Bloom Filters

A Bloom filter is a probabilistic data structure specifically used to change whether or not an element is within a set.

Why Use It

  • Constant time retrieval after preprocessing

  • Extremely space efficient. If we can determine that something definitely does not exist, then we can potentially save the time to make an expensive call if we were to find it in the first place.

  • O(k) insertion time where k is the number of hash functions.

How and Why it Works

  • Define a bit array, bit_ar of constant size. This array will be used identify whether or not a hash index is occupied.

  • Define a constant number of different hash functions.

  • Inserting element into a set:

    • For every hash function h_i bit_ar[h_i(elm)] = 1

  • Now when we try to check if something exists, we iterate through the list of hash functions. If any of them return 0, then we know for a fact that it doesn't exist within the set because the hash functions do not change and when an element is inserted every hash function by the minimum must acknowledge it. If all return 1, then the element exists to some degree of high probability.

  • Checking whether an element exists in a set is a lot like have the right key to a lock. Maybe the key will only partially go through, but the moment a cut of the key does not fit the bounds of the lock, it will stop and not go through.

Key Insights

  • Blooms filters cannot tell you that an item is definitely within the set. Only possibly, and definitely not.

  • No support for removals. This is because hash functions can index to the same value over different inputs. Removals would then imply the removals of possibly more than one other object.

  • As the number of elements take get inserted into the bloom filter increases, so the probability of retrieving a false positive. Additionally, the more hash functions we iterate through, the greater chance we will have a collision, and therefore increase the probability of false positives.

    • There is a sweet spot that is based off the number of items to be placed into the filter.

    • If nn is number of planned items to be inserted, and pp is acceptable probability of false positive rates, then m=nln(p)ln(2)2m = \frac{nln(p)}{ln(2)^2} and k=mnln(2)k=\frac{m}{n}ln(2) where mm is the optimal array size and kk is the optimal number of hash functions.

  • Hash functions are ideally independent (not related with each other), uniformally distributed (minimize collisions), and fast (of course).

The Maths

For our purposes, let

k = number of hash functions
m = number of added items
n = table size 
p = false probability threshold

For a single hash, the probability that it does not index to some random index j is P(hi(x))j=(11n)P(h_i(x)) \neq j = (1 - \frac{1}{n}). This is because we are assumming uniformity in our hash function, so if each index has the same probability to get hashed to, it would be 1n\frac{1}{n}, where nn is the table_size.

Generalizing this to kk amount of hash functions and each of which are called upon mm amount of times. The probability that the same arbitrary index j gets hashed becomes P(hi...k(x))j=(11n)kmP(h_{i...k}(x)) \neq j = (1 - \frac{1}{n})^{km}. This is the same as multiplying the previous probability to itself kmkm times because each hash is completely independent from the next.

Next some calculus and algebraic simplification. (11n)n(1 - \frac{1}{n})^n can be reduced to 1e\frac{1}{e} when n is very large (take the limit as n approaches infinity). (11n)km=((11n)n)kmn=1ekmn=ekmn(1-\frac{1}{n})^{km} = ((1-\frac{1}{n})^n)^{\frac{km}{n}} = \frac{1}{e}^{\frac{km}{n}} = e^{\frac{-km}{n}}

To find the probability of a false positive (when the bloom filter reports that an element exists in the set when it does not), we take our probability from before, but negate it for kk amount of hash functions. This is because a false positive can only occur when some bit index jj is set for all kk hash functions, and that is all it takes. So our probability for a false positive becomes (1ekmn)k(1-e^{\frac{-km}{n}})^k.

We can see that decreasing k and increasing n lowers the probability of false positives because of a lower chance of overlap.

Last updated