Author Archives: master

Resiliency in microservices


One of the important aspects while building a system involving lot of micro-services is the ability to heal or contain failure. Resilience.

How do you ensure resiliency, avoiding cascading failures in microservices?

Let’s take an example.

Service Dependencies

Service Interaction

  • Client calls Service A
  • Service A depends on Service B to satisfy the request
  • Service B
    • Responds fast – Success.
    • Responds with Connection Refused / Reset – Handled in code.
    • Responds slow – Timeouts, Retries.

Timeouts, Retries

Slow resources fail slowly.

The last situation where the dependent service is slow is the most interesting. Service A’s handler blocks for the slow resource. During that time, the handler is doing nothing useful, and causing a cascading failure.

This could be solved in a couple of ways involving some global state to monitor such a performance.

  1. Circuit Breaker : If we hit a timeout on a dependent resource more than once, it probably will fail in the consequent requests. Instead, we can mark it as dead and throw exceptions to be handled immediately.
  2. Bulkheads : This looks at services as connection pools. If access to Service B is restricted at 5 workers at a time, then the rest fail immediately unless  a connection can be established. This requires lot of monitoring insight to arrive at the number 5. This works best when the response times are expected to be long.

A bulkhead is an upright wall within the hull of a ship which serves to limit the failure within the compartment.

resilient microservices

Bulkheads in a ship

If water breaks through in one compartment, it prevents from flowing into the other. This prevents from cascading failures and the entire ship capsizing.

Titanic, is a very well known example of what happens when you don’t have proper isolation leading to cascading failures. Ref


Some great libraries available, that help with actual instrumentation are

  1. Shopify’s Semian (Ruby, Great documentation)
  2. Netflix’s Hystrix (Very popular)

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)

Terraform, null_resources & Azure ASM API

Recently, I was trying to bring up virtual machines in Microsoft Azure but ran into this interesting & annoying problem of not being able to upload SSH keys via the terraform DSL. There is a provision to provide a ssh_key_thumbprint but sadly no way to upload what you would call a KeyPair in AWS jargon.

While terraform does not support this operation via its DSL, It is possible to achieve this using some less-explored features of terraform.


I am using OS X, so my code samples might include some OS X specific commands. However it should be fairly easy to carry out these operations on other operating systems too.

First, the azure cli must be installed. Easiest way to do that is using brew:

$: brew install azure-cli

Post installation you will have to authenticate the azure cli. But that’s fairly easy. All you have to do is $: azure login and subsequent instructions on the screen will handhold you through the process.

Next, generate a SSL certificate that meets the following requirements:

  • The certificate must contain a private key.
  • The certificate must be created for key exchange, exportable to a Personal Information Exchange (.pfx) file.
  • The certificate must use a minimum of 2048-bit encryption.

A SSH keypair requires to be associated with an azure service. So you can create a service.json with the following contents:

Here’s how you can generate a certificate, a .pfx file and upload it to Azure portal.

openssl req -x509 \
  -key $service-deployer.key \
  -nodes \
  -days 1365 -newkey rsa:2048 \
  -out /tmp/$service-deployer.pem \
  -subj ‘/ Inc./C=US’
openssl x509 \
  -outform der \
  -in /tmp/$service-deployer.pem \
  -out /tmp/$service-deployer.pfx
azure service cert create $service /tmp/$service-deployer.pfx

Azure API also provides a way to fetch the list of all certificates uploaded and attached to it’s services.

piyush:azure master λ azure service cert list
info: Executing command service cert list
+ Getting cloud services
+ Getting cloud service certificates
data: Service Name Thumbprint Algorithm
data: domain-gamma 4F2AUA9ADF39830CDEHAJAND553DEANAJNAD8C8F sha1
info: service cert list command OK

The recently uploaded certificate has started showing up with a corresponding thumbprint, that can be used to provision new Azure machines.


So while the above example works well, it does not yet have an automatic essence to it. I am still responsible for the grunt work of checking if the certificate has been uploaded and if not, create one key pair, upload the .pfx and then save the thumbnail corresponding to that service, and all of this before running the terraform plan. Thing can be definitely be done better.


You mainly have to observe these four things in the above example:

  • depends_on
  • null_resource.ssh_key
  • ssh_key_thumbprint: ${file(“./ssl/ssh_thumbprint”)}
  • ssl/


While most dependencies in Terraform are implicit; i.e Terraform is able to infer dependencies based on usage of attributes of other resources, Sometime you need to specify explicit dependencies. You can do this with the depends_on parameter which is available on any resource.

I recommend reading more about Terraform dependencies here.

By injecting a depends_on we can defer the responsibility of assurance of a thumbprint to another resource, but that should be done before an Instance is created.

Note (FAQ): Using a local-exec provisioner approach will not work here, because local-exec is done AFTER the resource has been created and not before. Also local-exec provisioner on any previous operation doesn’t guarantee re-run if the resource itself does not change.

Read on, for the solution.


The null_resource is a resource that allows you to configure provisioners that are not directly associated with a single existing resource.

null_resource is like a dummy stub that you can use to insert a node that encapsulates provisioners between two existing stages of the graph. The position is determined by refering to this resource via a depends_on from the child resource. In this case, null_resource will be called from the azure_instance resource.

You can read more about terraform’s null_resource here.

Say we delegate all the duties to a standalone Bash script, we can invoke the script as a local-exec provisioner from the null_resource.


But what if someone deletes the ssh_thumbprint file? Every subsequent terraform run would panic and crash. Solution lies in triggers attribute of a null_resource. triggers is amapping of values which should trigger a rerun of this set of provisioners. Values are meant to be interpolated references to variables or attributes of other resources.

In this case it’s a file that is being read from the filesystem. So any changes forces the resource to be re-trigerred eventually forcing a re-converge on the instances that depend on this null_resource.


Putting together the bash script, which accepts the service name and tries to locate an existing uploaded certificate for that service. If not, it generates a new .pfx using the above mentioned techniques, fetches the ssh_key_thumbprint and saves it to a common file where terraform instance resource can read it from.

Now, you should be able to provision a SSH only VM and use the generated .pem file to login to your freshly created Virtual Machine. Yay!

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)

Terraform RemoteState Server

Terraform is a pretty nifty tool to layout complex infrastructures across cloud providers. It is an expressway to overcome the otherwise mundane and tedious task of going through insane amount of API documentations.

The output of terraform a run is a JSON which carries an awesome lot of information that the cloud platform provides about a resource; like instance_id, public_ip, local_ip, tags, dns, security groups etc and often it has left me wondering If I could search/access these JSON document from configuration management recipes, playbooks, or modules.

Example: While provisioning a zookeeper instance, I want the local-ip of all the peer nodes. I could run a query that would fetch me local_ips of all the nodes in this VPC that have the same security group. Or while applying a security patch to all the Redis nodes, I need the public-ip of all nodes that carry the tag `node_type: redis`.
I hope you get the idea of use cases by now and It definitely sounds like something that a document DB should be able to handle with relative ease.

To be able to achieve this, Terraform does not expose any pluggable backends to have custom formatters, however it does provide an ability to talk to a RESTful server. Every time a state needs to be read terraform makes a GET call on the /path specified while setting up the remote config. A save operation corresponds to a POST call on the same /path and a DELETE method call for a delete operation.

Here’s how you add a remote config to your terraform project:

terraform remote config \
    -backend=http \

While I wanted to export the information to MongoDB, others might want to store it somewhere else, maybe a Redis? Capitalising on terraforms ability to talk to a RESTful state server, I decided to write a implementation that would take data from the RESTful endpoint and save it to a MongoDB. Once it reaches MongoDB it’s fairly convenient and easy to use that information in the configuration manager code.

So I quickly put together a RESTful server, less than a day’s effort, written in Golang And it is available at

Given that you have GOPATH etc configured properly (In case you are new to Golang I suggest reading more about it here), You can download tfstate as simply as:

$: go get

This should provide you with a bianry file that you can execute as:

$: tfstate -config=/path/to/config.yaml

A sample configuration looks like this:

  database: terraform
  username: transformer
  password: 0hS0sw33t

Although tfstate by default talks to MongoDB but implementing your own backend is fairly easy. Each provider has to implement the Storer interface that looks like this:

type Storer interface {
    Setup(cfgpath string) error
    Get(ident string) ([]byte, error)
    Save(ident string, data []byte) error
    Delete(ident string) error

Look at for a sample implementation of this Interface.

Here’s an output from a working use case:

piyush:infra-monk: master λ tfstate -config tfstate.yaml

2016/06/15 22:19:02 Getting ident azure-state-zookeeper
2016/06/15 22:19:07 Saving ident azure-state-zookeeper to DB
2016/06/15 22:19:27 Saving ident azure-state-zookeeper to DB

2016/06/15 22:20:39 Getting ident aws-state-cassandra
2016/06/15 22:20:41 Saving ident aws-state-cassandra to DB
2016/06/15 22:23:52 Saving ident aws-state-cassandra to DB

Feel free to leave a comment or send Pull Requests 🙂

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)

Singletons in Golang

I was recently writing an application in Golang which required some Database interaction. The db library I was using had inbuilt Pooling so I didn’t have to bother about connection recycling and reusing, as long as I could initialise a DbPool and continue to call Having a module level Singleton object of DbPool would do this trick. However the problem with Singletons is that, in a multithreaded environment, the initialisation must be protected to prevent re-initialisation.

This post discusses a few common ways to achieve this, along-with the shortcomings of each approach.

Module init()

Most common approach I have come across is to define an init() functions in module files. These module level construtors perform operations like DB Pool initialisation or caches. It is guaranteed that this code runs once-and-only-once at startup of your program. Looks good.

I have two problems with this approach:

  1. Import Order: The import order is defined by the order in which the files show up in your code, and unless you rename your files obscurely there is no way to control this sequence.
  2. Implicit Calls: These inits are automatically called at startup and there is no way to invoke it explicitly. This makes it quite a challenge to test such codes. Like if you wanted to test a part of the code which depends on a DB state pre-initialised, you cannot easily mock the connection by seeding that value from within the test suite.

Import Order Problem

Let’s say you have a directory structure that looks something like this:

├── abc
│   ├── one.go
│   └── two.go
├── main.go
└── pack
    ├── one.go
    └── two.go

Where one.go looks has the following init method:

func init() {
	log.Println("<package_name> - One")

And two.go’s init method looks like this:

func init() {
	log.Println("<package_name> - Two")

And your main.go had a very simple code which looks like this:

package main

import (

func main() {
	log.Println(pack.PackOne, pack.PackTwo, abc.AbcOne, abc.AbcTwo)

Output will always be:

2016/06/29 21:34:53 Abc - One
2016/06/29 21:34:53 Abc - Two
2016/06/29 21:34:53 Pack - One
2016/06/29 21:34:53 Pack - Two
2016/06/29 21:34:53 hello world
2016/06/29 21:34:53 1 2 1 2

Since package abc appears ahead of package pack (alphabetically), and both of them are included in the main, there is no way you can alter the init order without renaming the package to something else.
Also, If pack.PackOne had to be seeded with a mock value while testing, it cannot be done because there is no way of invoking the init method explicitly. And while testing, Database connectors is something that you more-often-than-not have to mock.


Alternate approach to do this is to use a Module level cache variable with embedded Read-Write Mutex to ensure synchronisation across multiple go-routines.

An explicit method can then be used to acquire a ReadWrite Lock to check and return if the value had already been initialised, or initialise it with a value and return that otherwise.

A sample code for such an approach would look like this:

On carefully examining the output of this code you will observe a problem that the code tries to attain locks even after the first initialisation is complete.

2016/06/29 21:22:57 lock 5
2016/06/29 21:22:57 Initializing GetInt
2016/06/29 21:22:57 lock freed 5
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 3
2016/06/29 21:22:57 lock freed 3
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 1
2016/06/29 21:22:57 lock freed 1
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 2
2016/06/29 21:22:57 lock freed 2
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 0
2016/06/29 21:22:57 lock freed 0
2016/06/29 21:22:57 &{1}
2016/06/29 21:22:57 lock 4
2016/06/29 21:22:57 lock freed 4
2016/06/29 21:22:57 &{1}

After 5 was initialised; 1, 2, 3, and 4 should have been free to run in Parallel. Since the access to the cached value is bound by a ReadWrite Lock and only one goroutine would have that at a time, they pretty much execute in a sequence.

There should be a way to better to tackle this.


By Definition: Singleton is a design pattern that restricts the instantiation to one object. It would be lot more efficient if there was a way to lock JUST the first initialisation. Thereafter, any piece of code should be free to access the value without having to bother aout Locking and inevitably Blocking other resources.

sync.Once allows you to do exactly that, where Once is an object that will perform exactly one action.
You can read more about the documetation here:

The same code, as demonstrated in the last method, when moved to sync.Once pattern will look like this:

While the output of this code will be:

2016/06/29 21:26:42 No lock 5
2016/06/29 21:26:42 No lock 0
2016/06/29 21:26:42 No lock 2
2016/06/29 21:26:42 No lock 4
2016/06/29 21:26:42 No lock 3
2016/06/29 21:26:42 Initializing GetInt
2016/06/29 21:26:42 No lock return 5
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 No lock return 0
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 No lock return 2
2016/06/29 21:26:42 No lock 1
2016/06/29 21:26:42 No lock return 1
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 No lock return 4
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 &{1}
2016/06/29 21:26:42 No lock return 3
2016/06/29 21:26:42 &{1}

Do observe that after the first initialisation of 5, goroutines do not block each other and pretty much run at random. This by-passes the locking and still provide you the flexibility of being able to invoke it explicitly. The only down side of this method is that you would need a separate Once object for each such cached variable in your code. It also requires a promise that the value is not going to change through the lifecycle of the code.

Enjoyed our content? Subscribe to receive our latest articles right in your inbox:
(no spam, promise!)