terça-feira, 19 de junho de 2018

ETL vs. ELT. How two letters make such a difference...

Since the rise of Data Warehouses, the ETL process has evolved and reached it's maturity, and has been the "go-to" technology for BI and Analytics solutions.

Althought the ETL process provides a great solution to so many different problems, It also creates a few problems of its own. If you are here, reading this post, you probably have done a few ETL (or many) processes yourself, but I will outline the conceps a bit before laying out the pros and cons of each technique.

An ETL is a data process, defined by three steps: Extract, Transform and Load. So, if you need to import an csv file into a database, converting the column values to proper database columns, you would be using an ETL tool like Sql Server Integration Services (SSIS) or Pentaho. 




Your transformation would look something like this in SSIS:


You can clear see the ETL process steps: A flat file components Extracts the csv file data, a derived columnn and a data conversion Transform the so that it can be Loaded into a SQL Server Database (or any other Database for that matter).

Ok, that is great, your data is loaded into a Database using a simple enough software and everybody is happy. So whats so wrong with that, that a new fancy word cames to "improve it"?

A simple ETL like that (simple csv file into database), wouldn't be much of a problem, BUT supose you need to do that using thousands of files with hundred of GB each, on a daily basis, and the destination database is heavily queried by hundreds of of users and third part applications? Your ETLs become long and dragging, and your users and your "data consumers" need to wait until the whole process is done to access the data. Boring!


The ETL is designed so that the Extract and Load process may occur in different mantainance window time frames, so that both source and destination don't suffer any downtime, minimizing the corporate strees.

But with the rise of NoSQL and Cloud appliances, that architecture is no longer needed. We can load the data into a centralized scalable data repository, and make all that data available to different users and applications.

This centralized data repository is called "Data Lake", and once the extract is loaded into it, your consumers can access it in a timely fashion. Later on, if that is the purpose, this data can be loaded into another applicance such as a Data Warehouse, Database Applications, etc.



In the image above, the Data Lake is noted as Hadoop, but you can use any technology you want: HBASE, CouchDB, Cassandra...even relational cloud databases such as Postgres, MySQL, SQL Server, etc. As far as technology goes, the Data Lake needs only "elastic" hardware resources and 24 X 7 X 365 availability.

The key idea, is that once the data is extracted, it can be made readily available to the consumers by loading into the Data Lake.

Pros of using and ELT process:

  • Time: once the data is extracted it is soon loaded and made available.
  • Cost: since the destination is usually a unstructered data repositoty, and NoSQL is usually cheaper, it is cost effective.
  • Flexibility: It is very useful for AD-HOC analysis, and its simplicity makes it closer to developers and data consumers. 

So what are your thoughts on it? Have you used ELT before?


Check out my new book about R Language http://www.amazon.com/dp/B00SX6WA06

Nenhum comentário:

Postar um comentário

Leave your comment here!