Data lineage, the lost child of data science

Lineage in simple terms is the ability to leave breadcrumbs in the woods so that when you need to trace back your steps you are able to find your way out. In many organisations, this simple principle lacks and clients end up being eaten by the evil witch.

Data lineage includes the data’s origins, what happens to it and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process. — Wikipedia

To understand lineage, and to truly appreciate the impact there are a few concepts that are worth exploring. These concepts are fundamental in creating the correct vocabulary when exploring this beautiful science. I’m going to introduce the following concepts and elaborate on them in this article:

1. Source DataData

2. Immutability and time lineage

3. Prescriptive data lineage

4. Active vs Lazy lineage

5. Actors

6. Associations

Source Data is any data being produced in a structured or unstructured manner, inside or external from your company, for consumption by the company to produce reports and make business decisions on.

When collecting source data from different systems it is important to understand that the source being consumed today might change when you collect the same data again at a different time. A good example is when an external service provider provides you with a snapshot of the state of data in their system at that given time, the provider might realise that they haven’t used the correct reference data to calculate a value in the data and need to send you another copy of the data for the same time frame. This creates a problem in your data consumption because you now need the ability to rollback multiple records associated with the first dataset before you can load the new set in. We will discuss how lineage solves for this.

Data Immutability and time lineage are two separate concepts but are tightly coupled and together in this article to drive the idea of proper data management. Data immutability deals with the idea that once a record of data has been created, it cannot be altered or deleted. This idea is counter-intuitive in a world where the state of data in your source systems is changing on regular basis. Take a client record for example, in your CRM system an agent might change the email address of a client multiple times during the life of the client. For the CRM system only the latest information is relevant, so the record gets updated and overridden. The system might keep a log of activity, but normally only for a couple of months before the log gets overridden. In your operational data store, however, you need the ability to keep track of such changes and have the ability to recreate the state of your data for a specific point in time. One method of achieving this to create tables with the following four fields as attributes to a record:

1. Created By — This indicates the user/system that created the record 2. First Valid — This field holds a date-time stamp of the record creation

3. Updated By — This field is initially set to a default blank identity and is mutable

4. Last Valid — This is the date time stamp of when the data was in a valid state. As a default value, we normally set this date to ‘2300/01/01’. This field is mutable.

With this in place, we now have a framework in which we can roll data back to a state for a specific date, without compromising the integrity of my data’s current state. This feature of immutable data gives us time lineage, as we can trace the state of our data through time. This is a crucial piece of the puzzle when creating principles around lineage. Another term used for data immutability is data provenance.

The concept of Prescriptive Data Lineage combines both the logical model (entity) of how that data should flow with the actual lineage for that instance.

Data lineage and immutability typically refers to the way or the steps a dataset came to its current state. Data lineage, as well as all copies or derivatives. However, simply looking back at only audit or log correlations to determine lineage from a forensic point of view is flawed for certain data management cases. For instance, it is impossible to determine with certainty if the route a data workflow took was correct or in compliance without the logic model.

Lazy lineage collection typically captures only coarse-grain lineage at run time. These systems incur low capture overheads due to the small amount of lineage they capture. However, to answer fine-grain tracing queries, they must replay the data flow on all (or a large part) of its input and collect fine-grain lineage during the replay. This approach is suitable for forensic systems, where a user wants to debug an observed bad output.

Active collection systems capture entire lineage of the data flow at run time. The kind of lineage they capture may be coarse-grain or fine-grain, but they do not require any further computations on the data flow after its execution. Active fine-grain lineage collection systems incur higher capture overheads than lazy collection systems. However, they enable sophisticated replay and debugging. In this article, we will focus on active collection.

An actor is an entity that transforms data. Actors act as black-boxes and the inputs and outputs of an actor are tapped to capture lineage in the form of associations, where an association is a triplet {i, T, o} that relates an input i with an output o for an actor T. The instrumentation thus captures lineage in a dataflow one actor at a time, piecing it into a set of associations for each actor.

Association is a combination of the inputs, outputs and the operation itself. The operation is represented in terms of a black box also known as the actor. The associations describe the transformations that are applied to the data. The associations are stored in the association tables. Each unique actor is represented by its own association table. An association itself looks like {i, T, o} where i is the set of inputs to the actor T and o is set of outputs given produced by the actor. Associations are the basic units of Data Lineage. Individual associations are later clubbed together to construct the entire history of transformations that where applied to the data.

Now that we have some theory we can add some substance to the science of Data Lineage. The story of Hansel and Gretel is a good analogy to explain the importance of a solid data lineage practice in any organisation. When Hansel and Gretel were left in the woods by their father the first time they used pebbles that shone in the night, and they had no problem retracing their steps back to their home. The second time they were left in the woods the birds ate the breadcrumbs they left, so they were unable to retrace their steps.

With data lineage, we are trying to create a trail of pebbles rather than bread. With a solid understanding of why and how lineage benefits your company the road becomes a bit easier. The benefits are a fabric that supports client centricity, compliance and business efficiency.

The first step in this process is a solid understanding of what data is being collected, and why it’s being collected. When designing a lineage fabric it is important to create sufficient metadata that explains the source i.e. the fields that will be ingested, the categorisation of the data, SLAs (Service Level Agreement) that are attached to each source and the version of the source. By spending more time planning and decoding your sources the more value is provided to the organisation and the business benefits start to show. You are now able in a central place to view who the external providers are, what escalation processes can flow from SLAs and good governance in the sense that you can also identify where SLA should be put in place. SLAs in my mind for both internal and external sources is one of the most important aspects to good data governance. It allows you to plan your batch processes, be alerted upfront of any data model changes and have proper escalation processes in place when the SLA is not met.

When the sources are defined and ready for consumption we move on to the next step, data ingestion. The metadata attached to each source allows us to create multiple unique session identifiers when consuming each source. These unique session identifiers provide the first building block for creating lineage in your data. The combination of source identifier with the session identifier provides us with a mechanism to identify specific time frames of imports from specific sources. Now we have the ability to invalidate data from a previous import session and replace the data with a new set of the same data in a different state but same time frame. Because our tables are structured in an immutable structure, we don’t lose the previous state and we can replay the data for specific timeframes. During the ingestion phase we are not applying any business logic to the data, so no actors are involved in the transformation yet, we are simply putting the source data into a format we understand and can work with.

Beautiful, now we have all the different sources into multiple source tables and now we can start playing. This phase involves the normalisation of data. Ok, you have ingested the data for a reason and you now need to apply certain transformations on your data to get it into a format that is dictated by your master data management regime.

Master data management (MDM) is a method used to define and manage the critical data in an organisation to provide, with data integration, a single point of reference. The data that is mastered may include reference data- the set of permissible values, and the analytical data that supports decision making. — Wikipedia

Master data is the round hole you need to make your square data go into. You do this by chopping off the angles of the square, transformation. As explained earlier this is done with the use of single actors, or associations of multiple actors providing an output to a certain input that was given. This means we can decode business rules into these actors and trace the transformation from source to normalised state. It is very important to note that in order to maintain lineage in this process of transformation, the session identifier needs to be attached to an artifact that gets produced out of the transformation process.

The final step in our lineage journey is the consumption of normalised data. This is were lineage really starts paying off. By consumption, I mean any further transformation into cubes for analytics, the use of data in reports or any extract of data being sent to external parties. When producing a tax report for a client, and there are any queries that arise from incorrectly stated figures my support team can now effortlessly trace the lineage graph back to the source where it was imported and replay any transformations that were applied to the data to quickly resolve the query. The lineage graph provides a bulletproof method of traceability, which leads to consistency, which leads to the end users trusting the data.

The ultimate goal of data lineage is to reduce turnaround time on data queries, improve your data quality with well-defined actors and giving your business beautiful useful data that can be used to make informed decisions. Decisions that can potentially help two poor kids get out of the woods on a beautiful pebble-paved lineage highway.

21 views0 comments

Recent Posts

See All