Deduplication is a specialised compression technique to remove repeating copies of duplicate data. It is used to improve storage utilisation and reduce the number of bytes to be sent via a network. Deduplication needs to be performed before ingesting data into a Data Warehouse, so that the data is clean and pristine for analysis and consumption.
What Are Bloom Filters
Invented by Burton Howard Bloom in 1970, Bloom Filters are probabilistic data structures used to test set membership. It can guarantee if an element is NOT present in the set but cannot guarantee its presence. In other words, a Bloom filter query responds in either ‘not in set’ or ‘possibly in set’. So an element may be present in the set, but the Bloom Filter cannot assure of it (hence possibly in set).
The biggest advantage of this type of data structure is it’s space and time efficiency. It uses a 1-bit array and has a lookup time of O(1). Extremely small space requirement, yet superfast. This speed and space efficiency, however, has a trade-off with accuracy because a Bloom Filter can return False Positives.
An empty Bloom filter is bit array of size m, with all values initially set to 0. There also needs to be k hash functions that are used to hash (or map) the set element(s) e to one of the m array positions. The hash function generates numbers i1, i2… ik-1 and ik, where ik<=m. The array index is set to 1 when an element is added. If any of the array positions is set to 0, it means that the element is definitely not present.
A Bloom filter allows two types of operations – Add and Check.
Let us take an example to further explain how Bloom filters work:
For our example, we take a 1 bit array of size nine with all the array indices set to 0 initially. We choose optimal number of hash functions (k) to be 2.
- Adding an element called “Dog”. Let’s say the hash function returns index position at 2 and 4. We update the index position at 2 and 4 to ‘1’.
- Adding another element, called “Cat”. The hash function now returns index position at 5 and 6. Updating the position 5 and 6 to ‘1’.
- Looking for an element called “Rat”, let’s say the Hash function returns index positions at 4 and 7. Since 7 is set to zero it assures that element, “Rat” is not present in the bloom filter.
- Bloom filters are time and space efficient, but there is a catch to it. There is a probability of ‘False Positives’, meaning that there could be a probability of an element being present, even if they are not. Continuing from our previous example, an element “Cow” on checking could return index positions of 2 and 5, which is a false positive.
Deciding Optimal Parameters for Bloom filters
As the Bloom filter fills up, the chances of false positives increases. The question now is, how do we decide on the size of the Bloom filter.
Let’s say we need a Bloom filter to be able to accomodate n elements. We need to decide on a permissible false positive rate fp. Given that we have n and fp, we can calculate the optimal size of the Bloom filter m and the number of hash functions required k.
def bloom_filters_optimal(fP, n): if (fP==0): return "False positive rate should be > 0 and < 1" m = math.ceil(-1 * (n*math.log(fP)) / math.pow(math.log(2), 2)) k = math.ceil(math.log(2) * m/ n) print("Optimal size of bloom filter %s Bits" % m) print("Optimal size of bloom filter %s MiB" % round((m/(math.pow(1024,2)*8)), 2)) print("Number of hash functions required %s" % k) if __name__=='__main__': false_positive = float(sys.argv) number_of_elements = int(sys.argv) bloom_filters_optimal(false_positive, number_of_elements)
Assuming we're expecting around 10 million elements and we start with the false positive rate 10% i.e. 0.1. On running the code with the above values, the output would be:
python bloom_filter.py 0.1 10000000
Optimal size of bloom filter 47925292.0 Bits
Optimal size of bloom filter 5.71 MiB
Number of hash functions required 4.0
For same number of elements with a 0.01 false positive rate, the output would be:
python bloom_filter.py 0.01 10000000
Optimal size of bloom filter 95850584.0 Bits
Optimal size of bloom filter 11.43 MiB
Number of hash functions required 7.0
And with 0.001 false positive rate, the output is:
python bloom_filter.py 0.01 10000000
Optimal size of bloom filter 143775876.0 Bits
Optimal size of bloom filter 17.14 MiB
Number of hash functions required 10.0
We can see that false positive rate is inversely proportional to size of array (m) and number of hash functions(k).
Bloom filters are used in everyday applications. Some of the more common ones are:
- Bloom filters are used by web-crawlers to check if the page has not been crawled.
- Chrome browser used bloom filters to check for malicious urls. (source)
- Squid cache uses Bloom filters for proxy server check (if the page has been previously not requested).
- Online hotel booking portals use Bloom filters to check room availability instead of checking in database every time.
When in Doubt, Use Actual Source of Truth
Every time a Check operation is done for an element, it either returns if an element is not present in the Bloom filter or may be present. For assurance, use actual source of truth. For example - if an e-commerce company has released some promo codes. On application of a coupon code, the system could first check with the Bloom filter if such a code may exist. If the filter returns ‘Not Present’ the company can safely assume that a wrong coupon code has been applied. However, if ‘Maybe Present’ is returned, the database should be referenced for existence check.
Persisting Bloom filters to Disk
You can directly save a Bloom filter to a disk and marshal/unmarshal every time. Another, more efficient way is to use Redis. Here is an implementation written in Java using Redis to persist.
Use Rolling Dates Bloom Filters and Check in Parallel
In one of our use cases, the data we ingested was not in order. So we created Bloom filters based on current date and lookup for duplicates was done in Bloom filters for past eight days. Looking up in sequential manner can be slow, so we check for duplicates in parallel for every entry. (?) This helped us achieve an overall gain of around 200%.
In one of our use cases, the data we ingested was not in order. That made it possible for old data to also be ingested. To tackle this problem, we maintained bloom filters for past eight days and checked in all the filters for possible duplicates. We observed that doing it sequentially was quite slow. Hence, a check-in parallel was implemented (resulting in over 250% overall gain).
If you have any queries or comments, please feel free to drop us a comment below.
Content that brings inner peace, delivered straight to your inbox:
(no spam, promise!)