On a beautiful, warm Sunday in San Francisco there's nothing more enjoyable than doing laundry and reading up on hashing. I'm being sarcastic if it wasn't obvious (maybe there are people out there who enjoy doing laundry). Fundamentally, hashing is a simple, basic concept. But there's all kinds of neat things you can do with hashing.
There's consistent hashing if you want to build a distributed hash table or route your requests to a dynamic server farm. I wrote about it a while back, but unfortunately things got deleted during the migration. You can try this link for a Google cached version. Hashing plays a big part in cryptography so you can securely buy laundry detergent online at Amazon.com. Compiler writers like to use perfect hashing for keyword look ups and switch statement optimization.
When talking about hashing and hash functions/values/codes collision should be part of the discussion. Yet you'd be surprised how many developers aren't aware or simply ignore collisions when using a hash based data structure. Often the reasoning is that the probability of hash(key1) = hash(key2) for random keys is so low it is acceptable to pretend that it will never happen. With a 32-bit hash output, for example, there's something like 4+ billion possible values. One in 4 billion is pretty good odds. But you need to be careful about how you calculate collision probability.
You might have heard of the birthday problem/paradox or the birthday attack. The probability that two people have the same birthday is not that high, 1/365 ignoring leap years. The probability that in a small group someone will have the same birthday as you is still pretty low. But if you are hosting a cocktail party for just 23 friends there's more than a 50 percent chance that two people will have the same birthday.
For a 32-bit hash output this means that if you have about 10,000 objects there's a one percent chance that there's a collision for at least one pair. That's a lot more than one in 4 billion.
Yeah, I've run into hash collisions on just 8,000 items in a .NET dictionary. That opened my eyes :)
Still trying to figure out why .NET has 32-bit hashes instead of 64....
Posted by: Nathanael Jones | May 19, 2009 at 08:00 AM
i'm sure there are plenty of reasons. one is probably because java (and probably .net) don't require load, store, read, write to be atomic for 64 bit. this can create subtle problems if the programmer is not fully aware of the non-atomic treatment of double and long, and it'd probably force hash code stuff to be more expensive. and lets be fair, this was all done when 64 bit chips wasn't that available.
Posted by: tinou | May 20, 2009 at 05:32 PM
An interesting aspect is to understand when a hash-table becomes a performance bottleneck due to the collision rate. At 200,000 entries there is a 99% probability of a collision. If using a CHM, which uses up to 2^16 segments, the average length of the link chain would be 3. This assumes uniform distribution, of course.
In a forum post I provided help on, the author had a map of 60M entries had a serious performance problem. My diagnosis was that this resulted in a length of 925 entries per chain. I suggested that he use a skip-list based map instead, which would have a height of 26. It seems that in the ideal case, the cross-over is at 1.3M entries.
So I agree with it being important to understand how a hash table works, because it will also let you know when not to use one.
Posted by: Ben Manes | May 26, 2009 at 02:43 AM