Want to know “Bloom filters” ? — Here is your one stop store

credits

Intro:

What you can expect:

  • Brief definition
  • Use cases list

Medium blogs on Bloom filter

Title : Bloom Filter : A Probabilistic Data Structure | Read time: 7min
Summary: Starts with short history and goes into detail explanation using simple string with bulletin points. If you are in need to of crisp and to the point explanation this is the one.
Author talks about
→ Short history
→ Properties(6–7)
→ Hashing theory
→ Detail Implementation with Python snippet for add and test operation uses simple strings in example.
→ Some good list of Applications(5–6)
Author: Vivek Kumar Singh — System design Blog | Year: 2019 | Claps: 425

Title : What are Bloom filters? | Read time: 16min
Summary: One Explains via story of problem(A story has been suggested to them too many times already. Perhaps they’ve already read it. We need to filter those stories out) in Medium(company) and how they are using Bloom filter for it. It sounds like pod cast to me.
Author talks about
→ Story about how the discussion arose.
→ Detail explanation of hashing and various different applications of hashing(with one code)
→ Detail explanation of Bloom using basic animal strings “cat”, “dog” and “elephant”

Interesting section:
Author: Jamie Talbot | Year: 2015 | Claps: 10k

Title :Bloom Filters | Read time: 5min
Summary: Starts off with use case of google chrome(Ex:username already takes).
→ Bloom filters in details — explains about properties, basic examples
→ Application of bloom filters(3–4)
→ Short summary of how to implement and also shares Java code for it.

Interesting section:

  • One imminent question that might be coming to your mind would be why bloom filters and why not hashtables or hashmaps? Hashmaps are indeed a good data structure to answer whether an element exists in the set or not. But it comes at a cost. If there are 1M unique elements in data storage, then we are required to store all of them to answer whether particular elements exist or not. However, with bloom filters, we can answer the same query in a memory-efficient way (Though it will be a probabilistic answer)
  • Faster hash function implementations are murmur, the fnv series of hashes, Jenkins hashes, and HashMix.

Author: Sandeep Verma | | Year: 2019 | Claps: 64

Title : BLOOM FILTER | Read time: 4min
Summary: Starts off with problem statement and why Bloom filter over hash table/list/set. Author is basically YouTuber hence have shared short and crisp content with Video link. I personally watch is videos for system design concepts.
→ Bloom filter explanation with basic examples with video
→ Shares various has function to use in Bloom filter.
→ Shares detail applications of Bloom filter

Interesting section:

  • Detail application section and video explanation.
  • Types of hash functions available for Bloom filters.

Author: Narendra L | | Year: 2018 | Claps: 94

Other references:

  1. BloomFilter in SearchEngine (Solr)
  2. Probabilistic Data structures: Bloom filter
  3. Bloom Filter Calculator
  4. Let’s Implement a Bloom Filter in Go
  5. All About Bloom Filter
  6. Bloom Filter: A simple but interesting data structure

Interesting Video content:

  1. Bloom Filter with formulation — Has lot of other CS collection.
  2. Deduplicating Data Streams with Bloom Filters

— — — — — — — — — — — — — — — — — — — — -

Brief definition —

In short a bloom filter can tell whether an element is probably in the set or surely not part of set.

Want to know how it lies ?

First without going into much depth let’s understand with simple example — Assume that you want to add “foo” and “bar” to set and then find “buzz” item if it exist in set or not. We will use hash function — same input must always hash to the same output

Our set(size 6) looks like this {0,0,0,0,0,0}

Add():
foo” → Hash(“foo”, {n1, n2, n3}) % 6→ {0,4,5} → {1,0,0,0,1,1}
# Marking 1 on {0,4,5} index of bloom filter set — No collision

bar” → Hash(“bar”, {n1, n2, n3}) % 6 → {2,3,5} →{1,0,1,1,1,1}
# Now Marking 1 on {2,3,5} index of same set — notice collision happens at index 5

IsExist():
Now let’s check if “buzz” item exist in bloom filter — Assume that hashed and mod set of item “buzz” returns set {0,3,5}. Note that our set looks something like this : {1,0,1,1,1,1}.
Now when you try to check on those indices it says it exist but in-fact it doesn’t.

IsExists(“buzz”) → Yes
buzz” → Hash(“buzz”, {n1, n2, n3}) % 6 → {0,3,5}
Do we have 1s on index {0,3,5} of bloom filter {1,0,1,1,1,1}? yes. False positive result.

Can we not use hashtables or hashmaps? Hashmaps are indeed a good data structure to answer whether an element exists in the set or not. But it comes at a cost. Bloom filter provides better space-efficient data structures

Hash function:

  • It should be fast.
  • The hash function should have rare collisions and hash value should be evenly and randomly distributed. which are fast and provides

Here are some list of hash functions:

Application and use cases:

  • Medium uses bloom filter in medium blog recommendation to check whether a user has already read this post before or not. Read more about it on this amazing blog.
  • Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql use bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
  • Bloom filter is used by Content Delivery Network provider Akamai to avoid caching web objects which are requested by users just once ( also known as “one-hit-wonders”). Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates.
  • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned if that too returned a positive result)
  • Databases using LSM trees, like HBase, Cassandra, and BigTable uses bloom filters to reduce searching into different segments (on disks) for the non-existent rows or columns
  • Recommendation engine of Medium uses bloom filters to filters out articles already seen by a user
  • Quora filters out stories already seen by people using bloom filters
  • The classic example is using bloom filters to reduce expensive disk (or network) lookups for non-existent keys. If the element is not in the bloom filter, then we know for sure we don’t need to perform the expensive lookup. On the other hand, if it is in the bloom filter, we perform the lookup, and we can expect it to fail some proportion of the time (the false positive rate).
  • Our problem of checking whether the username is already taken. update bloom filter everytime a username is taken and check nexttime when some one checks for username is taken or not.
  • Google Bigtable, Apache HBase and Apache Cassandra, and Postgresql[11] use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
  • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned if that too returned a positive result).
  • Bitcoin uses Bloom filters to speed up wallet synchronization
  • Medium uses Bloom filters to avoid recommending articles a user has previously read
  • is passwords weak: instead of having a huge set of all possible weak passwords, you can just check whether the password is surely not weak with a way smaller bloom filter
  • Akamai’s web servers use Bloom filters to prevent “one-hit-wonders” from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates (taken from examples in bloom’s filter article at wiki)
  • Bloom filters are quite useful in bioinformatics. They can be more space efficient compared to using a regular hash, especially when the size of the strings you are working with can be hundreds of millions of letters with a very small alphabet ie {A,G,T,C} . They are usually used to assess whether a certain k-mer is present or absence in a genome. There’s an example of one used for something relevant here.

Implementations of bloom filters:

  • Bloomd is a high-performance C server which is used to expose bloom filters and operations over them to networked clients. It nicely fits into the micro-services architecture.
  • PyreBloom is python based implementation of bloom filter based on Redis
  • The Cuckoo filter is another variation of the bloom filter. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. There is a nice blog by Farhan on the difference between Cuckoo and Bloom filters.

References:

  • All blogs mentioned in this blog.

Software developer