Bloom Filters
A Bloom filter is a probabilistic data structure specifically used to change whether or not an element is within a set.
Why Use It
Constant time retrieval after preprocessing
Extremely space efficient. If we can determine that something definitely does not exist, then we can potentially save the time to make an expensive call if we were to find it in the first place.
O(k) insertion time where k is the number of hash functions.
How and Why it Works
Define a bit array,
bit_ar
of constant size. This array will be used identify whether or not a hash index is occupied.Define a constant number of different hash functions.
Inserting element into a set:
For every hash function
h_i
bit_ar[h_i(elm)] = 1
Now when we try to check if something exists, we iterate through the list of hash functions. If any of them return 0, then we know for a fact that it doesn't exist within the set because the hash functions do not change and when an element is inserted every hash function by the minimum must acknowledge it. If all return 1, then the element exists to some degree of high probability.
Checking whether an element exists in a set is a lot like have the right key to a lock. Maybe the key will only partially go through, but the moment a cut of the key does not fit the bounds of the lock, it will stop and not go through.
Key Insights
Blooms filters cannot tell you that an item is definitely within the set. Only possibly, and definitely not.
No support for removals. This is because hash functions can index to the same value over different inputs. Removals would then imply the removals of possibly more than one other object.
As the number of elements take get inserted into the bloom filter increases, so the probability of retrieving a false positive. Additionally, the more hash functions we iterate through, the greater chance we will have a collision, and therefore increase the probability of false positives.
There is a sweet spot that is based off the number of items to be placed into the filter.
If is number of planned items to be inserted, and is acceptable probability of false positive rates, then and where is the optimal array size and is the optimal number of hash functions.
Hash functions are ideally independent (not related with each other), uniformally distributed (minimize collisions), and fast (of course).
The Maths
For our purposes, let
For a single hash, the probability that it does not index to some random index j is . This is because we are assumming uniformity in our hash function, so if each index has the same probability to get hashed to, it would be , where is the table_size
.
Generalizing this to amount of hash functions and each of which are called upon amount of times. The probability that the same arbitrary index j
gets hashed becomes . This is the same as multiplying the previous probability to itself times because each hash is completely independent from the next.
Next some calculus and algebraic simplification. can be reduced to when n is very large (take the limit as n approaches infinity).
To find the probability of a false positive (when the bloom filter reports that an element exists in the set when it does not), we take our probability from before, but negate it for amount of hash functions. This is because a false positive can only occur when some bit index is set for all hash functions, and that is all it takes. So our probability for a false positive becomes .
We can see that decreasing k and increasing n lowers the probability of false positives because of a lower chance of overlap.
Last updated