Time flies when you're having fun (even when you're not). It's been more than a year since my last work related blog. Writing about random computer science and programming topics is like going to the gym; once you get off the routine it's very easy to just sit on your literal and figurative ass.

For my first topic back I want to relay some sequential random sampling algorithms I came across this week. It started out when we needed some random samples from the database. Performance issues aside, MySQL lets you do

`select column from table order by rand() limit n `

to select n random rows from a table. The randomness of the rand() function aside, someting troubled me about this approach. Won't your results be skewed toward the lower rows when you get duplicate random values?

So I was searching around for some random sampling algorithms and ran across a few from the works of Jeffrey Vitter I found rather interesting.

**Algorithm S**

Sequentially go through N elements to select n random samples. For each element generate an independent random variable U between 0 and 1 and test that NU > n. If true, skip, otherwise select for the sample and decrement n. In either case decrement N.

**Algorithms A-D**

These algorithms optimize Algorithm S by skipping elements. Define a skip function S(n,N) that will tell us how many elements to skip.

Algorithm S and its optimizations are for sequential sampling where N is known. But sometimes N is not known or it's prohibitive to determine N. Then we turn to reservoir samping.

**Algorithm R**

There is a reservoir R of n sample elements. The invariant is that the reservoir is a random sample of all the elements we've processed thus far. To maintain the invariant, as we sequentially process each element generate an independent random variable between 0 and the current element. Then either put the current element into the reservoir or ignore it.

**Algorithms X, Y and Z**

Optimization of algorithm R so we can skip over elements, much like Algorithms A-D.

## Comments