While some might observe that the difference between slowly changing dimensions (SCD) And Change Data Capture (CDC) might be subtle, there is in fact a technical difference between the two processes.
Both processes detect changes in a source database and deliver the changed data to a target database. The difference between the two is almost entirely about what happens in the target database to the data.
What are Slowly Changing Dimensions (SCD)?
There are actually six types of SCD with the most common being Type 1, Type 2 and Type 3. In Type 1, any new data that is ingested overwrites existing data. In Type 2, new data are inserted as new records and the data that would have been overwritten are flagged as inactive or closed with effective time and expiration time assigned to the change to maintain a history.
Type 1
Data is overwritten
Type 2
History is maintained, new data is inserted as new rows
Type 3
Stores two versions per record: a previous
and a current value
For a modern data integration tool to be considered truly modern support for SCD is key. StreamSets supports both type 1 and type 2 Slowly Changing Dimensions.
What is Change Data Capture?
CDC is a method of detecting and extracting new or updated records in a source and loading just this new information into your destination. Very often, the alternative to CDC is a full load from one table to another resulting in a very costly and time consuming operation. By sipping into your target database just the delta or changed data you get a much more streamlined process.
There are actually three different ways of performing CDC: log-based, query-based, and trigger based. Differences that are explored elsewhere in our blogs in detail. Essentially, however, log-based CDC updates a log for every INSERT, UPDATE or DELETE and reads that information when it is time to insert into the target database, while trigger CDC kicks off a trigger every operation with the same result. Log-based CDC is considered to be more efficient than a trigger CDC method. Query-based CDC involves using queries to find differences between datasets and can be untenable with larger datasets as it can require much more resources to perform this comparison.
Log-based
Updates a log after every operation
Trigger-based
Kicks off a trigger after every operation
Query-based
Compares two versions of the data with queries
CDC looks the most like Type 1 Slowly Changing Dimensions as overwriting new data as it appears. It is most useful to use when you’re not worried about maintaining a history of all the changes to your database. Like most other modern data integration systems, StreamSets supports log-based CDC.
Conclusion
Regardless of the method chosen, StreamSets can help you get your data from your target to your destination in an efficient way. With support for both SCD and CDC the question of which to choose becomes one of process instead of technical limitation. Does your organization need to maintain a record of all changes to the data flowing through your systems? If that’s the case, some version of SCD might be the better choice. If your business process requires only that the changed data arrive in your target post haste, CDC might work better for you and your business. Let us help you navigate your constantly changing data flows with ease.
The post Difference Between Slowly Changing Dimensions and Change Data Capture appeared first on StreamSets.
Read MoreStreamSets