The time dimension manifests itself in many aspects of software engineering. For something as ubiquitous, it is surprisingly often overlooked. It takes years of experience to start thinking upfront about the evolution of requirements and system design, about versioning of artifacts and data, schema changes and DB migrations. With ever-rising demands for audit, accountability, provenance, and business analytics I decided to dedicate my first two technical posts specifically to the topic of representing time dimension in data.
From the perspective of application engineers, capturing how application state changes over time is often an afterthought, so when the time comes to capture business metrics from the application or add audit it often leads to expensive re-designs or sends the teams down the winding road of even more costly sub-optimal solutions. From the perspective of data engineers, the temporal data is better understood, but as we will see basic temporality cannot satisfy many advanced use cases and fails to capture the true history. There is also a significant disconnect between the two camps on how data in transit between an application and the warehouse should look like.
This article is constantly getting a lot of traffic, so I decided to give a talk on this topic at PyData Global 2021. You can find the recording of it here.
Why Should You Care? #︎
In Kamu we firmly believe that the way we treat [open] data from the historical perspective needs to change. Currently, a large portion of published datasets are non-temporal. If you look at major open-data hubs like geonames.org and naturalearthdata.com their data is exposed in a form of daily snapshots, meaning these datasets lose all history on how the city population changed over time, or how country boundaries shifted.
In the age when 300+ hours of video uploaded to YouTube every minute is this really the way we should treat our factual data? How did it came to be that the historical data that can (and should) be used to drive important business and political decisions is less important to us than keeping all of the cat videos?
In Kamu we care about the quality of data published on the web. This data falls into “data in transit” category as it is captured from some task-specific systems and exposed for further analysis. Such data can only be as good as data models in the system that produces it, so too often we see it suffering from the above-mentioned symptoms. By taking a holistic look at systems where data originates and systems that process it I’m hoping to better understand the causes and see what can be done differently.
There was a ton of scientific effort put into the research of temporal data in the last 25-30 years, but very few aspects of it got any traction. Let’s find out why and how this can be changed.
In this post we will look at:
- What OLTP systems are
- Basic non-temporal relational data models in the OLTP systems
- Evolution towards temporal models as a result of business analytics requirements
- Introduction of temporality as a result of application feature requirements
In the next post we will look at:
- OLAP systems
- Basics of relational schema in OLAP systems with conventional ETL process
- Temporal features of facts and dimensions in a star/snowflake schema
- Different kinds of time within one application
- Bitemporal data models as a way to capture the entire history of change
- Inherent bitemporality of stream processing systems
- Temporal data design in the age of ELT and schema-on-read
- And finally, how data published on the web and data in transit should look like
Non-temporal models #︎
If you are a software engineer - this type of data will be most familiar to you.
To keep this more grounded let’s immediately start with an example in which we will be implementing the items store for an online video game. Commonly this type of systems is known as OLTP (online transaction processing). Such systems are optimized for processing high volumes of short-lived transactions.
Accounting only for the most basic requirements (let people earn virtual currency via in-game activities and spend their money on in-game items) the data model can be as simple as:
SKU | Price |
---|---|
1 | 100 |
2 | 250 |
SKU | ItemID |
---|---|
1 | 1 |
2 | 2 |
AccountID | Balance |
---|---|
1 | 100 |
2 | 250 |
AccountID | ItemID |
---|---|
1 | 3 |
1 | 4 |
In the model we built data is nothing but the current state. As the name suggests, non-temporal models ignore the time dimension entirely - there is no concept of history, there’s only “now” (or any other time really - it’s outside of the model’s knowledge). Changes to this state are made via CRUD operations: create, update, and delete - meaning data is mutable. We can get away with this model in our example because our requirements are only concerned with the current state.
Satisfying the requirements while keeping the footprint of the system absolutely minimal is great, right? Well… yes. But this design should really be regarded as an extreme optimization. Unfortunately, this is not how it’s usually presented. Non-temporal data models are the first and often the only kind of models that are taught to software engineers. We learn how to build them, how to normalize them, but disregard the time dimension all the way through. When system matures and the requirements for having insight on how the state of the system changes over time appear - you’re in for an expensive re-design.
I’m not saying that building a minimal system that satisfies the requirements is wrong, but:
- Some requirements should be anticipated, especially if they can significantly influence the design.
- You should know what evolution path regarding time-awareness looks like to avoid following a long winding road of half-measures.
Evolution of Time Dimension in OLTP Systems #︎
Associating time with data may be needed in the following cases:
- Data represents some outside world entities - in this case, we may need to track the “real world” time at which our knowledge about the state of these entities is considered to be “correct”
- There’s a need to track the history of state changes of the system over time - this may follow from regulations and accountability requirements (e.g. banking software needs to show everything that happened to an account) or can be driven solely by the desire to understand your own business better.
The last group is interesting because the existing non-temporal data model is enough to keep the core business running, so temporality becomes a “nice to have” but not an essential feature. This often leads to it being “bolted onto” an existing system to enable data analytics - this is the start of a long winding road I mentioned.
Let’s take a look first what kind of changes this can result in for the original system, in the order of their intrusiveness the designs I will describe are:
- Snapshot Differencing
- Changed Values Selection
- Change Data Capture
- Domain Events
- Event Sourcing
Snapshot Differencing #︎
If you are doing backups - you already adding temporality to your data. Backups are very low-frequency snapshots, but diffing data between the two backups may be just enough for some simple use cases, e.g. analyzing user signups and user churn rate on a daily/weekly basis.
To increase the frequency of diffs you can resort to techniques like table diffing, where periodic snapshots of specific tables are created and diffed regularly. The impact on the OLTP system can be minimized by performing such diffs on replicated data.
There are many ways to skin a cat, but in general, these approaches have the following drawbacks:
- Involves copying large amounts of data
- Running diff operation on large data volume is expensive
- All updates to data between the diffs are lost
- Data that changed but then reverted to old values will not appear in the diffs
Changed Values Selection #︎
To avoid the costly diffing operation and reduce the amount of data needed to be transported and duplicated we can modify the original data model to include a LastUpdated
column, which is set to the current system timestamp every time a row is modified. Changed data can then be easily scraped by specifying the time interval.
While other drawbacks of snapshot differencing still remain, this method also adds some overhead to the original system, increasing its footprint and having to introduce an additional index for querying changed data efficiently.
Change Data Capture #︎
What if we could capture every single modification of data that happened?
There are multiple approaches that achieve this:
- Setting up triggers that record changed values into special tables
- Capturing data from transaction logs/redo logs (something you might be already doing for enabling Point-in-Time recovery)
- There are also some vendor-specific change data capture solutions (see Oracle CDC)
All of them have different pros and cons, but let’s focus on the core idea.
Let’s say we now have an amazing ability to see how data was changing at any point in time. Is this all we can dream of?
Example:
State of the system at time T:
SkuID | ItemID | Price |
---|---|---|
1 | 2 | 100 |
AccountID | Balance |
---|---|
3 | 1000 |
AccountID | ItemID | Quantity |
---|---|---|
3 | 4 | 1 |
State of the system at time T+1:
SkuID | ItemID | Price |
---|---|---|
1 | 2 | 100 |
AccountID | Balance |
---|---|
3 | 900 |
AccountID | ItemID | Quantity |
---|---|---|
3 | 4 | 1 |
3 | 2 | 1 |
What just happened? From our CDC data, we can see that the user with accountID “3” got a new item “4” in their inventory, their balance was decreased by “100”. Since the price of “100” and itemID “2” happens to correspond to SKU “1” - we can tell this was a purchase.
You see, our game store application did its part to transform the user operation into state changes according to our business rules. Now our data analytics system is suddenly in a place where it needs to reconstruct which operation happened by looking only at the state changes.
This is not the position you want to be in. It may look manageable in this example, but even with this simple model - imagine reconstructing what happened if one transaction could include purchases of multiple SKUs at once (shopping cart), or if one SKU could include multiple items (bundles):
- This quickly turns into an NP-hard problem!
- It also puts you on a slippery slope of duplicating your business logic in two places and keeping them in sync.
When the data model is designed around storing state and not around storing history you lose the context around what led to these state changes, and this context is practically irreversible. For these reasons, I think CDC is acceptable in very early stages of adding data analytics to your OLTP system, or as a way to capture changes from legacy systems, but this should not be your end state.
Domain Events #︎
The best way to reconstruct the context of a state change is not to lose it to begin with. We can make the application take on additional responsibility - along with making changes to the “current state” it will start to emit so-called “domain events” that capture why something has happened, which entities were affected, and how. This approach was popularized largely by DDD.
In our earlier example, the state changes would be accompanied by the domain event such as:
|
|
Cool, now the data analytics team will have everything they need to analyze the purchase behavior of users.
- The domain events “speak” the language of your business, which makes them easy to understand and consume
- There is no expensive context reconstruction step
- The business logic stays in one place
But keep in mind that this is yet another “contract” of your application - you now have a “data API” which you need to carefully maintain and evolve.
Note: In DDD domain events are generally used for communicating things that happen between services within the bounded context and to the outside world. There is much debate on how granular such events need to be. To keep this article simple I have to pick a side and will assume that domain events capture all state transitions of the system. Therefore it is possible to accurately reconstruct the full state of the system by observing all domain events it emitted from the beginning of time.
Transactional Consistency #︎
How should you be emitting domain events from the application?
The simplest way could be to use a side-channel e.g. publishing to a Kafka topic. But be aware that by doing so you are giving up transactional consistency between domain events and state changes - you may see events out of order, events emitted for the state changes that are not committed to the DB and vice versa.
If you are only using domain events to feed data analytics - giving up transactional consistency may be fine. If you are using domain events in a broader scope of DDD, like asynchronous flows and inter-service communications - this is not/less acceptable.
To preserve transactional consistency you can emit the domain events by writing them to the same database you use for storing the state. The domain events can be written to a special table as part of the same transaction that mutates the rest of the state, thus preserving the atomicity and correct ordering. Later events can be scraped from this table and transported to their destination, and the table can be cleaned up. Such scraper can easily achieve “at least once” and even “exactly once” guarantees for transporting these events. This works of course only if you are using a datastore with ACID properties.
Note that all different types of events can be stored in a single table in a serialized form. We won’t be querying them or validating any constraints, so we don’t have to model them in a relational way.
Rules of Thumb for Even Design #︎
What made me choose this particular design of the event?
There is no hard guidelines to follow but here’s a set of things I considered:
- Transactional consistency is essential to make it possible to fully reconstruct the state of the application by looking at all of the events it emitted, therefore one event can only span one DB transaction.
- At the minimum, event needs to capture changes to all entities affected by the transaction - in our case changed entities are account balance and account inventory.
- Event should contain enough contextual information to understand what happened from the business perspective - in our case
skuIDs
tells us what was purchased, but this could also include any active promotions, coupons applied, etc. - Including new state information of changed entities will help to prevent duplication of the business logic on the data analytics side - in our example we included
debit
,newBalance
,itemID
, andnewQuantity
fields even though they could be reconstructed by keeping track of all events regarding changes to the store content to know the cost and content of all SKUs. By doing so we simplify the work for data analysts, make querying more efficient by denormalizing data, and most importantly if business logic around calculating debit amount and item quantity ever becomes non-trivial (e.g. discounts, coupons, item deduplication logic etc.) the event consumers don’t need to know about it.
As you can see, with the introduction of domain events we are getting into territory where temporality requirements have a significant impact on the design of the OLTP system, and we’re about to take it one big step further.
Event Sourcing #︎
What we ended up with after introducing domain events is a kind of weird duality in our storage layer:
- We store everything that happened in our system in the events table, with events periodically shipped elsewhere and table cleaned up
- We also store the “current state” of the system as our main data set
What would happen if we said that we never clean up the events table and store all events indefinitely? Because the current state can be fully reconstructed by looking at every domain event from the beginning of time until the present - we are duplicating data. To be more mathematical - the “current state” is a projection of all historical events at the current point in time. We will call it an “AS OF” projection from now on (following SQL:2011 terminology).
What’s cool about replacing “current state” data with historical events?
- Since we are not collapsing the time dimension we can navigate it as we please. The “AS OF” projection can be applied to any timestamp to see the exact state of the system at that specific point.
- Data becomes an append-only log, with every past event being immutable
- CRUD operations are replaced by expressing what actually happened and interpreting what this means for the state
This is a whole new perspective to look at your application from - application as a business rule engine, not a mere state updater. This idea is at the center of the design technique called “event sourcing”.
This is a rather radical departure from our original OLTP design. This article is not about implementing event sourcing, but few things are worth mentioning:
- If your OLTP system primarily operates with “current” time projection of data
- Switching to vanilla event sourcing will result in you losing most of your indexes
- The overhead of keeping entire ever-expanding history in one database may complicate operations (the way you do backups may need to change)
- Reconstituting state from events can become expensive as history grows
- Thus using pure event sourcing in such systems may not be justified, and domain events better be shipped elsewhere
- Consider doing periodic snapshots to speed up the projection
- CQRS technique allows to separate historical database (source of truth) from the database that stores projections (read models) to accelerate querying
Coming up next, we will look at the other side of the fence - the systems that are used to collect and store the historical data.