Data Engineering

ABC of Data Engineering

Why Data Engineering?

Let’s look at the following problems and see what’s common to them

  • For a supermarket find the products that have the lowest shelf life (amount of time they spend on the shelf before being purchased)
  • For a national telecom provider, find all cell tower’s call drop ratio for all cellular calls in last two minutes.

The nature of data no longer fits the boundaries of our traditional computation frameworks.

When is my data, big?

As soon as you have two out of these three

VelocityVolume and Variety.

ABC of Distributed Data Processing

you would have to look beyond the standard monolithic data solutions.

Volume. It’s the sheer size of data. Example: Facebook produces around 500 TB of data every day. Each time a Boeing aircraft flies from NY to LA, it records around 240 TB of data.

Velocity. This implies the rate at which data is coming in. Imagine an energy trading company constantly trying to divert/buy/sell power based on consumption being recorded across millions of homes and industries across the country.

Variety. We have methods to search through audio, videos and photos among other formats (Checkout IBM Watson project for exemplary work around this). Our ability to draw patterns is not just limited or tied to traditional sources of data. Someone could be making a model based on type of purchases people have made at a fashion outlet, who came in wearing a cap.

Since we have resources that can support us to handle such large volumes and weird combinations of data and you have the data, you might as well save it.

Data usually has the answers to everything that you want to ask about your business.

And since, tomorrow, you don’t know what questions you would ask; you save everything that was generated, yesterday.

For most cases, until of course you become the likes Facebook etc in which case this post is already irrelevant for you, addressing one of the above three problems in isolation is easy. But as early as you have two out of these three you would have to look beyond the standard monolithic data solutions.

*Achieving Buzzword Compliance