Starting a technical design is hard and getting consensus on an architecture is harder. Therefore, having a set of proven foundational principles allows you to prioritize accordingly during the design process. For instance, the Kubernetes project has a great example of aspirational design principles in their docs.
There is quite a lot to take into consideration when designing data architectures. We have a tendency to abstract away data problems into ETL (Extract, Transform, Load) processes and yes, at a high level everything is just ETL — because you have to fetch data, process it in some way and surface that data layer up. But reducing all of data engineering to ETL is like reducing all of application engineering to CRUD services. It is foundational and gives a great high level abstraction but doesn’t solve the underlying problem of fleshing out the details and nuances of a technical design.
The following are my top three design principles that I believe add clarity to data engineering as a practice.
- Design for Immutable Data
- Creating Data Lineage
- Gradual and Optional Data Validation
With these, I believe we can build data systems that are easy to test, idempotent, and traceable — all of which primarily leads to maintainability.
Immutable data is core to designing a system that is easy to test, that is idempotent and that is reproducible — without which the other two principles below are incredibly challenging to execute. Idempotent operations means that the same input will consistently produce the same output (no side effects). There is a strong case to be made that batch data processing ought to follow functioning programming paradigms because functional programming facilitates the ability to make data processes reproducible.
In this paradigm, data is stored as an immutable sequence of events that collectively tells the whole story. Take for instance a bank account. At the end, you probably want to display a user’s current balance. And sure, you can store the current balance for each user in a table and override that whenever it changes but that limits the data usage to just displaying the current balance. Storing the data as immutable transactions with credits and debits allows you to leverage the data for creative use cases and allows you to view the history of the account.
This can enable developers to understand which version of the data and which version of the code led to a specific computation. As your code functionality becomes more and more complex — say instead of simple arithmetic, the code is running a machine learning algorithm — having immutable data that allows for reproducible testing becomes invaluable.
Creating Data Lineage
Immutable data enables you to create data lineage — data lineage on its own is the ability to trace and understand why data was mutated a certain way at a specific step in the pipeline.
From the data science and analytics perspective, every story told by data has a caveat, every fact in an analytics report deserves an asterisks. Take for instance, the Street Bump app launched in 2012 in Boston, Massachusetts as a way to crowdsource the identification of potholes on the road. The general data told the story that most of the potholes in the city were in affluent neighborhoods. However, upon further analysis, this was due to the majority of the people who used the Street Bump app being based out of affluent neighborhoods (Crawford “The Hidden Biases in Big Data”). So, that begs the question, do you know where your data is coming from and what underlying assumptions were introduced to your pipeline without your knowing?
On the architectural and data engineering side, there’s a need to create observability for the data pipeline by tracing data lineage, which is identifying how data mutated at each iteration of the process. Let’s go back to the functional programming paradigm and say each incremental process is idempotent. If at the end of a series of mutations, the outcome of the data is unexpected, you need to be able to iterate through each step of the pipeline in order to identify at which step there was a either a bug in the code or an unexpected behavior of the data.
This is the equivalent in data engineering to what distributed tracing is in the world of microservices and it is certainly not easy to do. One approach is to collect metadata that tags each record with its input source(s) and which version of code processed it. This creates a graph of which row record points to which previous data sources and so on.
Gradual and Optional Validation
While data science needs to know the contextual assumptions on the data, it is the responsibility of data engineering to understand and document the numerical and data type assumptions (n > 0, field0: must be a string). Because it takes evidence to differentiate valid assumptions from invalid ones, it doesn’t make sense to have strict validation at the front door.
This doesn’t mean there’s no validation. Instead, identify the non-negotiable security and privacy concerns around data and make sure those are handled with care. And then, allow for gradual and optional validation as you build up domain expertise about the data.
Without capturing the pieces of data that introduce uncertainty to the system, you can’t readily determine what the correct behavior is. This means that validation has to be re-evaluated as data evolves and at a certain level, validation has to be optional. Ideally, the system can turn off bits and pieces of data validation to allow the processing of unexpected data. This allows for gradual testing of the data in a QA environment and the ability of the team to document and understand what is acceptable from the data and why. Then, you can decide whether to keep the validation in production or disable it.
I think general community consensus is that data engineering has far evolved beyond just ETL and successful teams need to know how to abstract, generalize, and create long lasting solutions. And that involves coming up with core principles that guide your design practices. Immutable data, gradual validation and data lineage all tie together to solve for data architectures that are optimized for observability of data pipelines both from the perspective of code and data. This then facilitates debugging and triaging issues and testing new features.
Agree or disagree with any of the ideas above? Join the conversation at Button, we’re hiring!