Want to know “Bloom filters” ? — Here is your one stop store

Shakeel Ahmed V
8 min readMay 22, 2021
credits

Intro:

I have been spending some time to study about bloom filter and all about it.
I observed that I was jumping around multiple documentation and videos to know use cases and their stories. I am sure just like me there are other people who do this. So I thought why not make a single place of all those documentation(primarily medium docs) and various other sources. You can consider this log as index for interesting explanation(personal opinion) of Bloom filters with short summary.

What you can expect:

  • My opinion on other medium blogs of bloom filters
  • Brief definition
  • Use cases list

Medium blogs on Bloom filter

Title : Bloom filter in Ads Recommendation | Read time: 7min
Summary: Starts It starts by sharing problem statement of Ads recommendation which they were going to solve using Bloom filter at Tokopedia. I personally like the blog. If you love things sorted and in points you will definitely like this explanation.
Talks about
→ Detail problem statement
→ What | How — Bloom filters.
→ Implementation and how it was solved
→ Simple conclusion and “Go” references can be found.
Author: Nupur Bansal -Tokopedia Engineering| Year: 2020 | Claps: 1.1K

Title : Bloom Filter : A Probabilistic Data Structure | Read time: 7min
Summary: Starts with short history and goes into detail explanation using simple string with bulletin points. If you are in need to of crisp and to the point explanation this is the one.
Author talks about
→ Short history
→ Properties(6–7)
→ Hashing theory
→ Detail Implementation with Python snippet for add and test operation uses simple strings in example.
→ Some good list of Applications(5–6)
Author: Vivek Kumar Singh — System design Blog | Year: 2019 | Claps: 425

Title : What are Bloom filters? | Read time: 16min
Summary: One Explains via story of problem(A story has been suggested to them too many times already. Perhaps they’ve already read it. We need to filter those stories out) in Medium(company) and how they are using Bloom filter for it. It sounds like pod cast to me.
Author talks about
→ Story about how the discussion arose.
→ Detail explanation of hashing and various different applications of hashing(with one code)
→ Detail explanation of Bloom using basic animal strings “cat”, “dog” and “elephant”

Interesting section:
Author: Jamie Talbot | Year: 2015 | Claps: 10k

Title :Bloom Filters | Read time: 5min
Summary: Starts off with use case of google chrome(Ex:username already takes).
→ Bloom filters in details — explains about properties, basic examples
→ Application of bloom filters(3–4)
→ Short summary of how to implement and also shares Java code for it.

Interesting section:

  • One imminent question that might be coming to your mind would be why bloom filters and why not hashtables or hashmaps? Hashmaps are indeed a good data structure to answer whether an element exists in the set or not. But it comes at a cost. If there are 1M unique elements in data storage, then we are required to store all of them to answer whether particular elements exist or not. However, with bloom filters, we can answer the same query in a memory-efficient way (Though it will be a probabilistic answer)
  • Faster hash function implementations are murmur, the fnv series of hashes, Jenkins hashes, and HashMix.

Author: Sandeep Verma | | Year: 2019 | Claps: 64

Title : BLOOM FILTER | Read time: 4min
Summary: Starts off with problem statement and why Bloom filter over hash table/list/set. Author is basically YouTuber hence have shared short and crisp content with Video link. I personally watch is videos for system design concepts.
→ Bloom filter explanation with basic examples with video
→ Shares various has function to use in Bloom filter.
→ Shares detail applications of Bloom filter

Interesting section:

  • Detail application section and video explanation.
  • Types of hash functions available for Bloom filters.

Author: Narendra L | | Year: 2018 | Claps: 94

Other references:

  1. Bloom Filters in Action — Shares Formulas and Java code with basic usage
  2. BloomFilter in SearchEngine (Solr)
  3. Probabilistic Data structures: Bloom filter
  4. Bloom Filter Calculator
  5. Let’s Implement a Bloom Filter in Go
  6. All About Bloom Filter
  7. Bloom Filter: A simple but interesting data structure

Interesting Video content:

  1. Redis Labs video — Use of Bloom filters in Redis
  2. Bloom Filter with formulation — Has lot of other CS collection.
  3. Deduplicating Data Streams with Bloom Filters

— — — — — — — — — — — — — — — — — — — — -

Brief definition —

It is a probabilistic data structure - a set(bit array) which only supports two operation “add” and “exists”. You cannot remove item from this set, the interesting part is “exists” method sometimes lies(Returns true(false positive), probably but not when it doesn’t exist), where as “add” method is honest ;-)

In short a bloom filter can tell whether an element is probably in the set or surely not part of set.

Want to know how it lies ?

First without going into much depth let’s understand with simple example — Assume that you want to add “foo” and “bar” to set and then find “buzz” item if it exist in set or not. We will use hash function — same input must always hash to the same output

Our set(size 6) looks like this {0,0,0,0,0,0}

Add():
foo” → Hash(“foo”, {n1, n2, n3}) % 6→ {0,4,5} → {1,0,0,0,1,1}
# Marking 1 on {0,4,5} index of bloom filter set — No collision

bar” → Hash(“bar”, {n1, n2, n3}) % 6 → {2,3,5} →{1,0,1,1,1,1}
# Now Marking 1 on {2,3,5} index of same set — notice collision happens at index 5

IsExist():
Now let’s check if “buzz” item exist in bloom filter — Assume that hashed and mod set of item “buzz” returns set {0,3,5}. Note that our set looks something like this : {1,0,1,1,1,1}.
Now when you try to check on those indices it says it exist but in-fact it doesn’t.

IsExists(“buzz”) → Yes
buzz” → Hash(“buzz”, {n1, n2, n3}) % 6 → {0,3,5}
Do we have 1s on index {0,3,5} of bloom filter {1,0,1,1,1,1}? yes. False positive result.

Can we not use hashtables or hashmaps? Hashmaps are indeed a good data structure to answer whether an element exists in the set or not. But it comes at a cost. Bloom filter provides better space-efficient data structures

Hash function:

Well for bloom filter the hash function should show below properties

  • It should be fast.
  • The hash function should have rare collisions and hash value should be evenly and randomly distributed. which are fast and provides

Here are some list of hash functions:

Application and use cases:

Bloom filter has a large number of applications in software engineering.

  • Medium uses bloom filter in medium blog recommendation to check whether a user has already read this post before or not. Read more about it on this amazing blog.
  • Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
  • Bloom filter is used by Content Delivery Network provider Akamai to avoid caching web objects which are requested by users just once ( also known as “one-hit-wonders”). Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates.
  • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned if that too returned a positive result)
  • Databases using LSM trees, like HBase, Cassandra, and BigTable uses bloom filters to reduce searching into different segments (on disks) for the non-existent rows or columns
  • Recommendation engine of Medium uses bloom filters to filters out articles already seen by a user
  • Quora filters out stories already seen by people using bloom filters
  • The classic example is using bloom filters to reduce expensive disk (or network) lookups for non-existent keys. If the element is not in the bloom filter, then we know for sure we don’t need to perform the expensive lookup. On the other hand, if it is in the bloom filter, we perform the lookup, and we can expect it to fail some proportion of the time (the false positive rate).
  • Our problem of checking whether the username is already taken. update bloom filter everytime a username is taken and check nexttime when some one checks for username is taken or not.
  • Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql[11] use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
  • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned if that too returned a positive result).
  • Bitcoin uses Bloom filters to speed up wallet synchronization
  • Medium uses Bloom filters to avoid recommending articles a user has previously read
  • is passwords weak: instead of having a huge set of all possible weak passwords, you can just check whether the password is surely not weak with a way smaller bloom filter
  • Akamai’s web servers use Bloom filters to prevent “one-hit-wonders” from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates (taken from examples in bloom’s filter article at wiki)
  • Bloom filters are quite useful in bioinformatics. They can be more space efficient compared to using a regular hash, especially when the size of the strings you are working with can be hundreds of millions of letters with a very small alphabet ie {A,G,T,C} . They are usually used to assess whether a certain k-mer is present or absence in a genome. There’s an example of one used for something relevant here.

Implementations of bloom filters:

  • Bloomd is a high-performance C server which is used to expose bloom filters and operations over them to networked clients. It nicely fits into the micro-services architecture.
  • PyreBloom is python based implementation of bloom filter based on Redis
  • The Cuckoo filter is another variation of the bloom filter. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. There is a nice blog by Farhan on the difference between Cuckoo and Bloom filters.

References:

  • All blogs mentioned in this blog.

--

--