Things to Care about as Data Engineer

Rahul Dubey
8 min readOct 31, 2023

--

Being a Data Engineer is like being a double edged sword cutting through Data-Roles and Software Engineering. If you have been a Data Engineer for more then 2 years, you already know it comes in variety of flavors.

One time you are working on building data pipelines that doesn’t cry in the mid-production and the other times you are also handling building data models to support your fellow Data Analysts and Data Scientists. Although some Data Engineer engineer roles requires you to work on building end-to-end data pipeline and Business Intelligence Reports on dashboarding tools like Tableau, ThoughtSpot, PowerBI etc. which are actually called as “Data Analytics Engineer”.

Anyways, the roles varies from companies to companies, but some things remain intact and if you are able to excel at those concepts, you will lead a good career in the data field ahead.

Some of the common Technical and Business Terms that are used more often between the business teams and engineering teams are like:

SLAs, ETL/ELT, Data Models, Validation and verification, Source and Target, Change Data Capture, Scheduling, Schema Change, Landing, Conformance, Consumption etc.

It’s important to effectively communicate the Source and Target data changes and keep the Target up-to-date. Here are few things that as a Data Engineer you need to keep up the good work:

1. Carry out your transformations before the Consumption Layer

I am keeping this point at the top because Consumption layer is always going to be exposed to Business Intelligence Team and Data Scientists. All the transformation should be done prior the data is elevated to Consumption. Often a good point of transformation starts at the Data Lake like S3, GCS or Landing/Staging. Once you move this transformed data to Conformance, there should not be much transformations to be applied while moving to Consumption.

Never ask them to get data cooking in consumption layer!

Your Data Modelling stage can begin at Conformance layer where all your Dimension, Look-up or Fact Tables exists. You create query with CTE and JOINS to create the final consumption fact table.

2. One model goes a long way

Tools like Tableau, ThoughtSpot, PowerBI often utilizes the connectors to establish one-to-one connection with the data source. They can also localize the dataset in cloud storage to cache the data. But often the best method is the most simplistic one.

Keep the model simple!

You create a Data Model that sits in the Consumption Layer and make a one-to-one connection to the BI Tool. It’s the Data Analysts and Data Scientists responsibility to derieve multiple Reports/Dashboards or ML Models from single data source. In the real world scenario, it can’t be always possible to manage all the dimensions and fact table under one model since the fact table metrics and other fields won’t be available at each granular level for the dimensions. But the effort should go into making one model as much as possible. This will keep Data Engineer’s and Data Analysts’ life much simpler.

3. Don’t take Data Pipeline development lightly

Often the words are thrown around that Data Engineering is not Software Engineering which I feel is just being ignorant and relying less on your skills and more on the tool’s capability.

Think like Software Engineer!

This is not true, I believe from my experience that Data Engineers need to write code similar to Software Engineers who are performance optimization focused. Although the tools like Apache Airflow and other tools shows the issues with code at the runtime to notify about the failure, it’s better to not rely on it completely. Write your code as much as robust as possible with all the exception handling and raise appropriate errors levels to maintain true and efficient logs.

Often these logs are maintained and monitored by another process that might be running for auditing your pipelines and data sources. A consistent and clear log reporting should be done.

4. Code Reviews are important, never skip them

This point can’t be emphasized more, but code reviews are too important. Especially if it’s with both Dev and Business team. Why the Business team? Although the business team won’t be in touch with technical details, but they need to know the flow of the data and how the transformations are applied to finalize the ingestible data. Tech leads will provide you feedback on the best practices where as the business team will expand your knowledge about the implication of the transformation.

Finally, the review culture depends on the team you are working with. But in general Code Reviews will bring out the best in you.

NOTE: There can exists some leads who emphasize on not complicating the code too much but it improves the quality and performance of the data pipeline, try to convince why it’s an important decision.

5. Keep your QA team informed about the data flow

QA teams are as much important as the development team because at the end of the data QA team will save you from beating yourself during the development phase.

A developer can pre-validate their data pipelines and the source /target variances, but while the code is moving into production, the amount of QA test cases increases with the data being used. Also QA teams have much vast knowledge about the validations and test cases being carried out. So it’s better not to sideline the QA fellows while doing the development. Keep them up-to-date even though the time has not come yet for the QA migration.

6. Know your Upstream/Downstream data sources/targets

Often in the big projects, you won’t be building the whole data pipeline from scratch. Your team might have already setup a base source which is being housed in Data Warehouses, Data Lakes and Data Marts.

While you are asked to create a separate pipe for different application, you must know the patterns of your upstream (parent) data sources. Their scheduled load should always be prior to your pipeline. If this doesn’t happen, your data pipeline will always be not in sync with parent data sources and your target users will always have lagged data.

7. Documentation is a must!

Another point that is obvious, documentation! Keep the Documentation up-to-date for the latest changes while doing the pipeline development. Put all the nitty gritty details like source Buckets, Target Buckets, DAGs, Event Systems and Pipes etc.

Its crucial to maintain Confluence pages for documentation. Every team member in the Dev/Ops team should have access to it.

8. Audit Logs should always exist with raw/transformed/final data

Audit logs are required to get information on the latest updates about data. Often the data pipelines fail and in that case either manual run or automated run needs to pickup from where it left. To avoid duplication and inconsistency in the data, audit logs become crucial to allow your data pipeline to make decision from where and what data needs to be picked up.

9. Get to know other programming paradigm and design patterns

Data Engineers are amalgam of Data Analysts and Software Engineers. They can be sturdy at times where they don’t get in touch with fundamentals of Software Engineering like Software Design Patterns, importance of Imperative and Declarative programming paradigm, Data Structures and Algorithms.

Often times you won’t be building your own DAG class from scratch since the sophisticated libraries and packages exists. But sometimes it might become necessary to build one if your team/project insists on developing their own framework. There your role might shift from Data Engineer to a core Software Engineer if you have already been a good Software Engineer.

This is where your knowledge for Software Design Patterns comes in handy. You will create/follow a templated architecture for target Software which will serve as a base for upcoming Software Engineers as well to know the code well.

Also, it’s not always going to be necessary that you are going to follow an Imperative approach to programming. Some crucial Big Data software operates in functional way. For example, Scala or Haskell are core languages for Big Data processing tools. Haskell specifically doesn’t allow looping statements and you have to be familiar with writing Recursive code for General Purpose Functions, High Order Functions and more.

10. Learn to develop APIs and MPP Systems

APIs in general used when you have to create a central point of communication with outside applications that are consuming or sending the data. Some cases will appear your career where your clients don’t allow sharing data via buckets, data shares and other normal approaches. In that case, you can come up with a solution to design and develop several API endpoints to allow/authenticate the clients and let them send/receive the data.

If your client has a requirement to allow concurrent and traffic heavy API from your side, then you have to be familiar with Massively Parallel Processing aka MPP architecture for software design. You must be able to write parallel code which are consistent and concurrent in nature to allow whatever the load is thrown at your API.

Conclusion

The list of points can be exhaustive and many more points can be added. But to me the above points matter. Feel free to share more points in the comments so the other Data Engineers know about them. Have a good pipe building!

--

--

Rahul Dubey
Rahul Dubey

Written by Rahul Dubey

Data-Intensive Software Engineer

No responses yet