Time flies when you're having fun (even when you're not). It's been more than a year since my last work related blog. Writing about random computer science and programming topics is like going to the gym; once you get off the routine it's very easy to just sit on your literal and figurative ass.
For my first topic back I want to relay some sequential random sampling algorithms I came across this week. It started out when we needed some random samples from the database. Performance issues aside, MySQL lets you do
select column from table order by rand() limit n
to select n random rows from a table. The randomness of the rand() function aside, someting troubled me about this approach. Won't your results be skewed toward the lower rows when you get duplicate random values?
So I was searching around for some random sampling algorithms and ran across a few from the works of Jeffrey Vitter I found rather interesting.
Sequentially go through N elements to select n random samples. For each element generate an independent random variable U between 0 and 1 and test that NU > n. If true, skip, otherwise select for the sample and decrement n. In either case decrement N.
These algorithms optimize Algorithm S by skipping elements. Define a skip function S(n,N) that will tell us how many elements to skip.
Algorithm S and its optimizations are for sequential sampling where N is known. But sometimes N is not known or it's prohibitive to determine N. Then we turn to reservoir samping.
There is a reservoir R of n sample elements. The invariant is that the reservoir is a random sample of all the elements we've processed thus far. To maintain the invariant, as we sequentially process each element generate an independent random variable between 0 and the current element. Then either put the current element into the reservoir or ignore it.
Algorithms X, Y and Z
Optimization of algorithm R so we can skip over elements, much like Algorithms A-D.