Friday, 4 March 2011

Bloom Filters

A common and severe bottleneck in distributed systems is I/O. Many applications in industry are based on storing data in a relational database that persists to a disc-based file system and the data is retrieved using many non-contiguous disk reads. The most common solution is to use a cache for the data so the data is read from memory to avoid reading from disk. The improvements are significant.

What if you wanted to detect the presence of a record in the database without I/O? You could use a cache but it's not space-efficient as you would need to increase the size of the cache as data is added to the database. A much better solution is to use space-efficient bloom filter, which uses a constant size of memory as the data grows to infinity.

A bloom filter is a probabilistic data structure that enables to detect the presence of an element in an universal set in constant space and time making it extremely scalable as the data tends to petabytes. The wikipedia article will do a better job of describing it: http://en.wikipedia.org/wiki/Bloom_filter. The most important property is that it cannot give false negatives but false positives; which means that it will never falsely tell you an element does not exist in the set it represents but can give you a false positive.

Let's give a real word use-case where you might want to use it. Let's say we are storing orders in a order management system, which persists all ordes to a relational database. We could use a cache to store the primary keys of each order but this solution would run into problems if the cache runs out of memory due too story too many keys. A much efficient solution would to use a bloom filter to detect if any order is in the database. If it is in the database then we can retrieve it. It may give you false positives but will never tell you a false negative.

I will be open sourcing my implementation that I use for my own startup project.

No comments:

Post a Comment