Top 10 ETL Activities in DW Projects
As an ETL developer, its important to see what are the various touch points in the ETL world. You may be working in any one of the touch point in the ETL stream line. In a DW system, most of the time we spend in analyzing the source data and creating ETL jobs to populate your data warehouse. Lets discuss about what 10 things you should know to become very effective ETL resource for an organization.
What is your source? Understand the nature of the source. Is the source data is in RDBMS, if so what is the credentials you have. How to get the meta data information about the source. Sample data from source etc. If its the flat file or XML or Web Service / API, understand the structure of the data you are getting. If the source in NOSQL Db, then understand the basics of how to retrieve the information.
2. Extraction cycle
What is the frequency in which the data gets updated in the source. Are you extracting the data once in a day / week / month / quarter etc. Are you going to extract the whole data every time you are accessing the source or you are extracting the data in an incremental (delta) way. Based on the data source, the methodology is different. Typically if you know how to implement the incremental extract from any source, then you are good to go in this area.
Staging database / area is used as a temp area where we can store the data for processing. Typically when we have more than one operational system where the data is stored in different RDBMS (Say oracle, sql server, db2 etc) then we can have to consolidate the data required in one RDBMS. The database which is used for the consolidation of data is called as Staging Database.
Based on the target table / file, you have to convert the data which is in the different form in some cases. For example, in the DW you have to store the data at a monthly level for performance reasons. In the source, the data typically will be at a transactional level. So, in this scenario you have to to aggregate transformations. Typical transformations are router or splitter, aggregate, filter, sorter, de duplication etc.
The ETL system should have data to clarify many questions from various stakeholders. For this same reason, we have to keep a track on how many records we got from a specific system, what data we got, how long it took to get etc. Any time we change the data, the system should keep a record for the audit purpose. If you see any ETL system with out auditing information then its not matured implementation of ETL system.
There should be an automation way of identifying how many records we read, how many records got rejected because of data type mismatch, source definition and the data not matching, business validation failed etc. There are two types of reconciliation typically we do, one is row count validation and value based validation. If the granularity is same between the source system and the target system then we should go for row count matching, if the granularity is different between the source and the target, then you do the value matching. For example, do a sum(rev) at the source level, and sum(rev) at the target level. The value should match if the summary is different at target level. Send an email automatically when the expectation is not matched to production support team to take care of the same.
7. Reject Handling
In order to know the value we add over the time to the organization because of DW platform, we can show case how we are improving on understanding the organization data and how we improved the quality of the data over the period of time. When we reject data while we populate what happens, and how to handle the same. We have three types of rejected data
- Self Heal Data — Today’s transaction data comes, but we don’t have master data in the dimension table. So, when we load the txn data into fact, it fails because of corresponding dimension record does not exists. So, reject the data, try loading the data in the next day, meanwhile the dimension data comes, thus when we are loading this yesterday’s rejection data, it works today. This is called as Self Healing data or Auto Correct function in the data processing world.
- Pragmatically correct the data (Automated) — Once we know the reason of failure, based on the business rule and data validation process we can fix the same. For example in some cases in the addresses we have zip code but don’t have the city name, we can use some external data and correct the data to populate the data. If the ETL program rejects the data, create another program to cleanse the rejected data and load it again. In the ETL world, we call that as reject load as well.
- Correct the data manually — If the rejected records are not able to correct in an automated way, then we develop an interface to manually see the error and fix the data issues. Typically when we understand the data from across source systems, this happens the most until the system gets matured.
- Reject (In complete source data) — In some cases we cannot fix the data at all. So, this gets flagged as reject which gets discarded over the period of time.
8. House Cleaning
We end up storing the error records, log records when we process the data using ETL programs. If the log files occupy more space or the error records cannot be used any more then we can write the programs to delete the records or delete the files which we don’t need to worry as we finished the reconciliation process and audited the reason for rejects etc. Typically in a frequent intervals, we remove the unwanted data / files from the DW platform using certain programs. This process is called as House Cleaning process.
As the ETL programs which we write has to be executed in a frequency basis, we have to make sure the scheduling is possible with in the ETL tool with out depending on the OS scheduling. If we use the database programming language as ETL tool then we have to use the schedular of the database. If you are using Oracle, then we can make use of DBMS_JOB. ETL Developer should know some scripting language to interact with files and OS of the Servers.
10. Log Handling
Logging is very important in the ETL Layer. When there is job failure, it is the source for us to start the work to fix the issues. In some of the tools, it provides both database logging or Flat file logging. Logging levels should be there, should we log only errors or we should log every thing which will write the record what’s been processed, timing for each of the transformations etc. The more detailed logging you are going to opt, the more time its going to consume to write the log files. So, do lot of analysis before you choose what level makes sense for your implementation.
These top 10 things will help you to design an ETL layer for your DW implementation, it can also be used to compare if the ETL tool is in par with industry standard.