Yukun Jiang (蒋宇昆)

In this blog, we will look at a very classic algorithm – Reservoir sampling to solve such a problem:

Choose a random sample k items without replacement from a population of unknown size n in a single pass.

It has the following constraints:

the size of n items are too big to fit into main memory
the items are presented in a streaming fashion and the algorithm cannot look back
at any point, the algorithm needs to be able to present a random sample of size k based on the data it processes so far

We will introduce a basic Algorithm R by Jeffrey Vitter.

Algorithm R

Initialize an array R containing the first k items of the input. This serves as the initial reservoir.
For each new input $x_i$, generate a random integer $j \in ( 1, \cdots, i ) $. If $j$ happens to be within $ \in ( 1, \cdots, k ) $, then $x_i$ will replace $R[j]$. otherwise $x_i$ is discarded.
Repeat step (2) until all inputs are processed.

This basic algorithm R looks simple enough. Here is the mathematical proof by induction.

We will prove that during the processing of each element, after processing $i$ elements, each processed element has a uniform probability of $\frac{k}{i}$ to be retained in the sample.

Base: when $i$ is $k$, all the elements are included in the sample with probability 1.
Induction:
- assume after processing $m$ elements, each elements seen so far have a probability of $\frac{k}{m}$ to be retained in the sample.
- The next element $x_{m+1}$ has a probability $\frac{k}{m+1}$ to be included in the reservoir. So the chance that it replaces on particular element already in the reservoir is $\frac{k}{m+1}\frac{1}{k} = \frac{1}{m+1}$
- So for any prior element $x_i$ where $i < m+1$, the new probability that it is retained in the sample is P(it is in the sample before $m+1$) * P(it is not replaced by $x_{m+1}$) = $\frac{k}{m}(1 - \frac{1}{m+1})=\frac{k}{m+1}$

Assume the average time complexity to generate a random number is $O(1)$, this basic algorithm R has an asymptotic complexity of $O(n)$.

Notice there is an optimized version of this called Algorithm L, which runs in $O(k(1+\log(\frac{n}{k})))$ by only computing three random numbers for each item that becomes part of the reservoir. Interested readers could further look it up.

Reference