Big data ETL – tame the beast!

Big data ETL – tame the beast!

What is Big Data?

Big data refers to large volume of complex data, which generally is difficult to process using relational Database systems. This data is typically gathered from sensors i.e weathers, posts to social media sites, share market transaction records, and cell phone call records to name a few.

What is ETL ?

ETL is commonly used for data warehousing. It’s a process to transform and load data in data warehouse. It involves three steps –

  • Extract pulls data from some source.
  • Transform takes and manipulates the data.
  • Load into data mart

What is Big Data ETL?

As the three V’s (volume, velocity and verities) grows, the relational databases started to struggle on query performance and fail to meet the ETL Service level agreements.

As the traditional ETL struggled with Big Data new distributed data storage and processing system evolved: Apache Hadoop, built from the ground up to be massively scalable (thousands of servers) using normal standard hardware. In addition Hadoop is very flexible in term of data type, format and structure.


In Big data ETL the most common architecture pattern followed is hybrid model, where Hadoop is taking care of the ETL, RDBMS are more dedicated to do the Query, hence used as Data Mart for reporting. In this model datamart can use their existing SQLs or can replace the RDBMS with NoSQL database(s) like Mongo DB, Cassandra or even HDFS.


Big data ETL revolves around Hadoop, which apart from being cost effective, provides scalability and flexibility out of the box. Hadoop’s architecture comprise of two major components, HDFS – Massive redundant Data storage and MapReduce – Scalable Batch data processing.


Hadoop can perform transformation much more effectively than RDBMSs. Besides the performance benefits it is also very fault tolerant and elastic. These are very useful features for example after running ETL transformation for 6 month an error is being discovered, and now we have run the ETL with last 6 months of data, with Hadoop a few node can be added to cluster temporarily and be removed once the this catch up ETL jobs completes.


In cases where data volume is high, transformations are kneeling down the RDBM and faces bottleneck on query performance. Transformations is much better suited for a batch processing system like Hadoop which offers the agility to work with any data type,  and are very scalable .

There are number of ETL tools already provide ETL capabilities for Hadoop using MapReduce (e.g. Informatica and Pentaho). This allows reusing business transformation logic and defines it through the ETL tool, then ETL execution to happen in Hadoop.

Leave a Comment

Your email address will not be published. Required fields are marked *

= 3 + 5