Data Deduplication Using Bloom Filters

Deduplication is a specialised compression technique to remove repeating copies of duplicate data. It is used to improve storage utilisation and reduce the number of bytes to be sent via a network. Deduplication needs to be performed before ingesting data into a Data Warehouse, so that the data is clean and pristine for analysis and consumption.

What Are Bloom Filters

Invented by Burton Howard Bloom in 1970, Bloom Filters are probabilistic data structures used to test set membership. It can guarantee if an element is NOT present in the set but cannot guarantee its presence. In other words, a Bloom filter query responds in either ‘not in set’ or ‘possibly in set’. So an element may be present in the set, but the Bloom Filter cannot assure of it (hence possibly in set).

The biggest advantage of this type of data structure is it’s space and time efficiency. It uses a 1-bit array and has a lookup time of O(1). Extremely small space requirement, yet superfast. This speed and space efficiency, however, has a trade-off with accuracy because a Bloom Filter can return False Positives.

An empty Bloom filter is bit array of size m, with all values initially set to 0. There also needs to be k hash functions that are used to hash (or map) the set element(s) e to one of the m array positions. The hash function generates numbers i1, i2… ik-1 and ik, where ik<=m. The array index is set to 1 when an element is added. If any of the array positions is set to 0, it means that the element is definitely not present.

A Bloom filter allows two types of operations – Add and Check.

Let us take an example to further explain how Bloom filters work:

For our example, we take a 1 bit array of size nine with all the array indices set to 0 initially. We choose optimal number of hash functions (k) to be 2.

  1. Adding an element called “Dog”. Let’s say the hash function returns index position at 2 and 4. We update the index position at 2 and 4 to ‘1’.
  2. How does a bloom filter work

  3. Adding another element, called “Cat”. The hash function now returns index position at 5 and 6. Updating the position 5 and 6 to ‘1’.
  4. How does a bloom filter work

  5. Looking for an element called “Rat”, let’s say the Hash function returns index positions at 4 and 7. Since 7 is set to zero it assures that element, “Rat” is not present in the bloom filter.
  6. Demonstrating a check function in bloom filter

  7. Bloom filters are time and space efficient, but there is a catch to it. There is a probability of ‘False Positives’, meaning that there could be a probability of an element being present, even if they are not. Continuing from our previous example, an element “Cow” on checking could return index positions of 2 and 5, which is a false positive.
  8. Bloom filter showing false positive

Deciding Optimal Parameters for Bloom filters

As the Bloom filter fills up, the chances of false positives increases. The question now is, how do we decide on the size of the Bloom filter.

Let’s say we need a Bloom filter to be able to accomodate n elements. We need to decide on a permissible false positive rate fp. Given that we have n and fp, we can calculate the optimal size of the Bloom filter m and the number of hash functions required k.

def bloom_filters_optimal(fP, n):
 	if (fP==0):
     		return "False positive rate should be > 0 and < 1"

 	m = math.ceil(-1 * (n*math.log(fP)) / math.pow(math.log(2), 2))
 	k = math.ceil(math.log(2) * m/ n)
 	print("Optimal size of bloom filter %s Bits" % m)
 	print("Optimal size of bloom filter %s MiB" % round((m/(math.pow(1024,2)*8)), 2))
 	print("Number of hash functions required %s" % k)

 if __name__=='__main__':
 	false_positive = float(sys.argv[1])
 	number_of_elements = int(sys.argv[2])
  	bloom_filters_optimal(false_positive, number_of_elements)

Assuming we're expecting around 10 million elements and we start with the false positive rate 10% i.e. 0.1. On running the code with the above values, the output would be:

python 0.1 10000000
Optimal size of bloom filter 47925292.0 Bits
Optimal size of bloom filter 5.71 MiB
Number of hash functions required 4.0

For same number of elements with a 0.01 false positive rate, the output would be:

python 0.01 10000000
Optimal size of bloom filter 95850584.0 Bits
Optimal size of bloom filter 11.43 MiB
Number of hash functions required 7.0

And with 0.001 false positive rate, the output is:

python 0.01 10000000
Optimal size of bloom filter 143775876.0 Bits
Optimal size of bloom filter 17.14 MiB
Number of hash functions required 10.0

We can see that false positive rate is inversely proportional to size of array (m) and number of hash functions(k).


Bloom filters are used in everyday applications. Some of the more common ones are:

  • Bloom filters are used by web-crawlers to check if the page has not been crawled.
  • Chrome browser used bloom filters to check for malicious urls. (source)
  • Squid cache uses Bloom filters for proxy server check (if the page has been previously not requested).
  • Online hotel booking portals use Bloom filters to check room availability instead of checking in database every time.

When in Doubt, Use Actual Source of Truth

Every time a Check operation is done for an element, it either returns if an element is not present in the Bloom filter or may be present. For assurance, use actual source of truth. For example - if an e-commerce company has released some promo codes. On application of a coupon code, the system could first check with the Bloom filter if such a code may exist. If the filter returns ‘Not Present’ the company can safely assume that a wrong coupon code has been applied. However, if ‘Maybe Present’ is returned, the database should be referenced for existence check.

Persisting Bloom filters to Disk

You can directly save a Bloom filter to a disk and marshal/unmarshal every time. Another, more efficient way is to use Redis. Here is an implementation written in Java using Redis to persist.

Use Rolling Dates Bloom Filters and Check in Parallel

In one of our use cases, the data we ingested was not in order. So we created Bloom filters based on current date and lookup for duplicates was done in Bloom filters for past eight days. Looking up in sequential manner can be slow, so we check for duplicates in parallel for every entry. (?) This helped us achieve an overall gain of around 200%.

In one of our use cases, the data we ingested was not in order. That made it possible for old data to also be ingested. To tackle this problem, we maintained bloom filters for past eight days and checked in all the filters for possible duplicates. We observed that doing it sequentially was quite slow. Hence, a check-in parallel was implemented (resulting in over 250% overall gain).

If you have any queries or comments, please feel free to drop us a comment below.

Content that brings inner peace, delivered straight to your inbox:
(no spam, promise!)

Build a Highly Available Distributed Cron

Crons are widely used to set up scheduled jobs and automate certain parts of a system. Setting up Cron for a simple application is very straightforward. All you need to do is configure it in your server using `crontab` to run at scheduled intervals and trigger the job.

But what if the server with this crontab goes down? We need to ensure that the system is resilient to a single node failure. The simplest way to do this is to have 2 (or multiple) servers with crontab running. That, however, might trigger the job multiple times. And we surely don’t want that happen!

What we really need is some sort of a fault tolerant, resilient cron setup. How do we then build a Highly Available Distributed Cron?

Let’s say, an application wants to send out reminder emails for drinking water every 30 minutes.

Using AWS CloudWatch As Cron

CloudWatch is a monitoring service. While it is extremely popular for tracking the status and logging system reports, it’s alarm setting ability can be used to act as a Cron.

  • Expose the Email Sending Service as an HTTP endpoint.
  • Create an AWS SNS topic that will hit the exposed HTTP endpoint for triggering the job.
  • Now set up CloudWatch event to periodically trigger a notification. The SNS topic created above will be the event target.

(CloudWatch by itself cannot talk to HTTP endpoints, thus bringing SNS into the picture.)

Building a highly available distributed cron using cloudwatch

Using AWS Lambda as CloudWatch target

If you don’t want to expose any extra HTTP endpoints, AWS Lambda could be another alternative.

  • Create an email sending service as a Lambda function.
  • Set up CloudWatch event to periodically trigger a notification. Include Lambda as this event’s target.

There are few limitations to using Amazon Lambda. It only allows you 128mb memory for each run and each run cannot be longer than 5 minutes. So if your logic is heavy and consumes more memory/time, you’d be better off exposing the endpoint and using CloudWatch with SNS.

Note: While we have used AWS tools in our solutions, other cloud service providers also have provisions to implement a similar solution. Microsoft Azure has scheduled Azure Functions and Google Cloud provides App Engine Cron.

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)

Microservice Architecture – Beyond HTTP

This talk explores why HTTP is not the best fit for microservice architectures and how it results in the need for complex cloud infrastructure components. This presents a different alternative using Redis pubsub and some other synchronisation facilities of redis.

Its a demo of the simplicity and extensibility of the solution and how it enables the development team to focus on the core logic, rather than worry about the deployment, operations and orchestration overhead.

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)

Resiliency in microservices


One of the important aspects while building a system involving lot of micro-services is the ability to heal or contain failure. Resilience.

How do you ensure resiliency, avoiding cascading failures in microservices?

Let’s take an example.

Service Dependencies

Service Interaction

  • Client calls Service A
  • Service A depends on Service B to satisfy the request
  • Service B
    • Responds fast – Success.
    • Responds with Connection Refused / Reset – Handled in code.
    • Responds slow – Timeouts, Retries.

Timeouts, Retries

Slow resources fail slowly.

The last situation where the dependent service is slow is the most interesting. Service A’s handler blocks for the slow resource. During that time, the handler is doing nothing useful, and causing a cascading failure.

This could be solved in a couple of ways involving some global state to monitor such a performance.

  1. Circuit Breaker : If we hit a timeout on a dependent resource more than once, it probably will fail in the consequent requests. Instead, we can mark it as dead and throw exceptions to be handled immediately.
  2. Bulkheads : This looks at services as connection pools. If access to Service B is restricted at 5 workers at a time, then the rest fail immediately unless  a connection can be established. This requires lot of monitoring insight to arrive at the number 5. This works best when the response times are expected to be long.

A bulkhead is an upright wall within the hull of a ship which serves to limit the failure within the compartment.

resilient microservices

Bulkheads in a ship

If water breaks through in one compartment, it prevents from flowing into the other. This prevents from cascading failures and the entire ship capsizing.

Titanic, is a very well known example of what happens when you don’t have proper isolation leading to cascading failures. Ref


Some great libraries available, that help with actual instrumentation are

  1. Shopify’s Semian (Ruby, Great documentation)
  2. Netflix’s Hystrix (Very popular)

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)

Terraform, null_resources & Azure ASM API

Recently, I was trying to bring up virtual machines in Microsoft Azure but ran into this interesting & annoying problem of not being able to upload SSH keys via the terraform DSL. There is a provision to provide a ssh_key_thumbprint but sadly no way to upload what you would call a KeyPair in AWS jargon.

While terraform does not support this operation via its DSL, It is possible to achieve this using some less-explored features of terraform.


I am using OS X, so my code samples might include some OS X specific commands. However it should be fairly easy to carry out these operations on other operating systems too.

First, the azure cli must be installed. Easiest way to do that is using brew:

$: brew install azure-cli

Post installation you will have to authenticate the azure cli. But that’s fairly easy. All you have to do is $: azure login and subsequent instructions on the screen will handhold you through the process.

Next, generate a SSL certificate that meets the following requirements:

  • The certificate must contain a private key.
  • The certificate must be created for key exchange, exportable to a Personal Information Exchange (.pfx) file.
  • The certificate must use a minimum of 2048-bit encryption.

A SSH keypair requires to be associated with an azure service. So you can create a service.json with the following contents:

Here’s how you can generate a certificate, a .pfx file and upload it to Azure portal.

openssl req -x509 \
  -key $service-deployer.key \
  -nodes \
  -days 1365 -newkey rsa:2048 \
  -out /tmp/$service-deployer.pem \
  -subj ‘/ Inc./C=US’
openssl x509 \
  -outform der \
  -in /tmp/$service-deployer.pem \
  -out /tmp/$service-deployer.pfx
azure service cert create $service /tmp/$service-deployer.pfx

Azure API also provides a way to fetch the list of all certificates uploaded and attached to it’s services.

piyush:azure master λ azure service cert list
info: Executing command service cert list
+ Getting cloud services
+ Getting cloud service certificates
data: Service Name Thumbprint Algorithm
data: domain-gamma 4F2AUA9ADF39830CDEHAJAND553DEANAJNAD8C8F sha1
info: service cert list command OK

The recently uploaded certificate has started showing up with a corresponding thumbprint, that can be used to provision new Azure machines.


So while the above example works well, it does not yet have an automatic essence to it. I am still responsible for the grunt work of checking if the certificate has been uploaded and if not, create one key pair, upload the .pfx and then save the thumbnail corresponding to that service, and all of this before running the terraform plan. Thing can be definitely be done better.


You mainly have to observe these four things in the above example:

  • depends_on
  • null_resource.ssh_key
  • ssh_key_thumbprint: ${file(“./ssl/ssh_thumbprint”)}
  • ssl/


While most dependencies in Terraform are implicit; i.e Terraform is able to infer dependencies based on usage of attributes of other resources, Sometime you need to specify explicit dependencies. You can do this with the depends_on parameter which is available on any resource.

I recommend reading more about Terraform dependencies here.

By injecting a depends_on we can defer the responsibility of assurance of a thumbprint to another resource, but that should be done before an Instance is created.

Note (FAQ): Using a local-exec provisioner approach will not work here, because local-exec is done AFTER the resource has been created and not before. Also local-exec provisioner on any previous operation doesn’t guarantee re-run if the resource itself does not change.

Read on, for the solution.


The null_resource is a resource that allows you to configure provisioners that are not directly associated with a single existing resource.

null_resource is like a dummy stub that you can use to insert a node that encapsulates provisioners between two existing stages of the graph. The position is determined by refering to this resource via a depends_on from the child resource. In this case, null_resource will be called from the azure_instance resource.

You can read more about terraform’s null_resource here.

Say we delegate all the duties to a standalone Bash script, we can invoke the script as a local-exec provisioner from the null_resource.


But what if someone deletes the ssh_thumbprint file? Every subsequent terraform run would panic and crash. Solution lies in triggers attribute of a null_resource. triggers is amapping of values which should trigger a rerun of this set of provisioners. Values are meant to be interpolated references to variables or attributes of other resources.

In this case it’s a file that is being read from the filesystem. So any changes forces the resource to be re-trigerred eventually forcing a re-converge on the instances that depend on this null_resource.


Putting together the bash script, which accepts the service name and tries to locate an existing uploaded certificate for that service. If not, it generates a new .pfx using the above mentioned techniques, fetches the ssh_key_thumbprint and saves it to a common file where terraform instance resource can read it from.

Now, you should be able to provision a SSH only VM and use the generated .pem file to login to your freshly created Virtual Machine. Yay!

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)

Terraform RemoteState Server

Terraform is a pretty nifty tool to layout complex infrastructures across cloud providers. It is an expressway to overcome the otherwise mundane and tedious task of going through insane amount of API documentations.

The output of terraform a run is a JSON which carries an awesome lot of information that the cloud platform provides about a resource; like instance_id, public_ip, local_ip, tags, dns, security groups etc and often it has left me wondering If I could search/access these JSON document from configuration management recipes, playbooks, or modules.

Example: While provisioning a zookeeper instance, I want the local-ip of all the peer nodes. I could run a query that would fetch me local_ips of all the nodes in this VPC that have the same security group. Or while applying a security patch to all the Redis nodes, I need the public-ip of all nodes that carry the tag `node_type: redis`.
I hope you get the idea of use cases by now and It definitely sounds like something that a document DB should be able to handle with relative ease.

To be able to achieve this, Terraform does not expose any pluggable backends to have custom formatters, however it does provide an ability to talk to a RESTful server. Every time a state needs to be read terraform makes a GET call on the /path specified while setting up the remote config. A save operation corresponds to a POST call on the same /path and a DELETE method call for a delete operation.

Here’s how you add a remote config to your terraform project:

terraform remote config \
    -backend=http \

While I wanted to export the information to MongoDB, others might want to store it somewhere else, maybe a Redis? Capitalising on terraforms ability to talk to a RESTful state server, I decided to write a implementation that would take data from the RESTful endpoint and save it to a MongoDB. Once it reaches MongoDB it’s fairly convenient and easy to use that information in the configuration manager code.

So I quickly put together a RESTful server, less than a day’s effort, written in Golang And it is available at

Given that you have GOPATH etc configured properly (In case you are new to Golang I suggest reading more about it here), You can download tfstate as simply as:

$: go get

This should provide you with a bianry file that you can execute as:

$: tfstate -config=/path/to/config.yaml

A sample configuration looks like this:

  database: terraform
  username: transformer
  password: 0hS0sw33t

Although tfstate by default talks to MongoDB but implementing your own backend is fairly easy. Each provider has to implement the Storer interface that looks like this:

type Storer interface {
    Setup(cfgpath string) error
    Get(ident string) ([]byte, error)
    Save(ident string, data []byte) error
    Delete(ident string) error

Look at for a sample implementation of this Interface.

Here’s an output from a working use case:

piyush:infra-monk: master λ tfstate -config tfstate.yaml

2016/06/15 22:19:02 Getting ident azure-state-zookeeper
2016/06/15 22:19:07 Saving ident azure-state-zookeeper to DB
2016/06/15 22:19:27 Saving ident azure-state-zookeeper to DB

2016/06/15 22:20:39 Getting ident aws-state-cassandra
2016/06/15 22:20:41 Saving ident aws-state-cassandra to DB
2016/06/15 22:23:52 Saving ident aws-state-cassandra to DB

Feel free to leave a comment or send Pull Requests 🙂

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)

Singletons in Golang

I was recently writing an application in Golang which required some Database interaction. The db library I was using had inbuilt Pooling so I didn’t have to bother about connection recycling and reusing, as long as I could initialise a DbPool and continue to call Having a module level Singleton object of DbPool would do this trick. However the problem with Singletons is that, in a multithreaded environment, the initialisation must be protected to prevent re-initialisation.

This post discusses a few common ways to achieve this, along-with the shortcomings of each approach.

Module init()

Most common approach I have come across is to define an init() functions in module files. These module level construtors perform operations like DB Pool initialisation or caches. It is guaranteed that this code runs once-and-only-once at startup of your program. Looks good.

I have two problems with this approach:

  1. Import Order: The import order is defined by the order in which the files show up in your code, and unless you rename your files obscurely there is no way to control this sequence.
  2. Implicit Calls: These inits are automatically called at startup and there is no way to invoke it explicitly. This makes it quite a challenge to test such codes. Like if you wanted to test a part of the code which depends on a DB state pre-initialised, you cannot easily mock the connection by seeding that value from within the test suite.

Import Order Problem

Let’s say you have a directory structure that looks something like this:

├── abc
│   ├── one.go
│   └── two.go
├── main.go
└── pack
    ├── one.go
    └── two.go

Where one.go looks has the following init method:

func init() {
	log.Println("<package_name> - One")

And two.go’s init method looks like this:

func init() {
	log.Println("<package_name> - Two")

And your main.go had a very simple code which looks like this:

package main

import (

func main() {
	log.Println(pack.PackOne, pack.PackTwo, abc.AbcOne, abc.AbcTwo)

Output will always be:

2016/06/29 21:34:53 Abc - One
2016/06/29 21:34:53 Abc - Two
2016/06/29 21:34:53 Pack - One
2016/06/29 21:34:53 Pack - Two
2016/06/29 21:34:53 hello world
2016/06/29 21:34:53 1 2 1 2

Since package abc appears ahead of package pack (alphabetically), and both of them are included in the main, there is no way you can alter the init order without renaming the package to something else.
Also, If pack.PackOne had to be seeded with a mock value while testing, it cannot be done because there is no way of invoking the init method explicitly. And while testing, Database connectors is something that you more-often-than-not have to mock.


Alternate approach to do this is to use a Module level cache variable with embedded Read-Write Mutex to ensure synchronisation across multiple go-routines.

An explicit method can then be used to acquire a ReadWrite Lock to check and return if the value had already been initialised, or initialise it with a value and return that otherwise.

A sample code for such an approach would look like this:

On carefully examining the output of this code you will observe a problem that the code tries to attain locks even after the first initialisation is complete.

2016/06/29 21:22:57 lock 5
2016/06/29 21:22:57 Initializing GetInt
2016/06/29 21:22:57 lock freed 5
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 3
2016/06/29 21:22:57 lock freed 3
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 1
2016/06/29 21:22:57 lock freed 1
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 2
2016/06/29 21:22:57 lock freed 2
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 0
2016/06/29 21:22:57 lock freed 0
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 4
2016/06/29 21:22:57 lock freed 4
2016/06/29 21:22:57 &{1}

After 5 was initialised; 1, 2, 3, and 4 should have been free to run in Parallel. Since the access to the cached value is bound by a ReadWrite Lock and only one goroutine would have that at a time, they pretty much execute in a sequence.

There should be a way to better to tackle this.


By Definition: Singleton is a design pattern that restricts the instantiation to one object. It would be lot more efficient if there was a way to lock JUST the first initialisation. Thereafter, any piece of code should be free to access the value without having to bother aout Locking and inevitably Blocking other resources.

sync.Once allows you to do exactly that, where Once is an object that will perform exactly one action.
You can read more about the documetation here:

The same code, as demonstrated in the last method, when moved to sync.Once pattern will look like this:

While the output of this code will be:

2016/06/29 21:26:42 No lock 5
2016/06/29 21:26:42 No lock 0
2016/06/29 21:26:42 No lock 2
2016/06/29 21:26:42 No lock 4
2016/06/29 21:26:42 No lock 3
2016/06/29 21:26:42 Initializing GetInt
2016/06/29 21:26:42 No lock return 5
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 No lock return 0
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 No lock return 2
2016/06/29 21:26:42 No lock 1
2016/06/29 21:26:42 No lock return 1
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 No lock return 4
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 No lock return 3
2016/06/29 21:26:42 &{1}

Do observe that after the first initialisation of 5, goroutines do not block each other and pretty much run at random. This by-passes the locking and still provide you the flexibility of being able to invoke it explicitly. The only down side of this method is that you would need a separate Once object for each such cached variable in your code. It also requires a promise that the value is not going to change through the lifecycle of the code.

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)