Schema on Write vs. Schema on Read

By mullaned2002

January 3, 2023

650

In the simplest terms, schema is the structure of data inside a database. The structure of data can include things like field and table names, views, indexes, and snapshots. The definition of schema will often expand to include the relationships between data, for example, primary and foreign keys that logically connect separate tables.

Analytics systems and legacy data management systems require a schema, which can be generated either on write or on read. When schema is generated on write, the schema comes before the data. A very common schema on write scenario is that a data engineer creates several tables in a relational database that are connected by primary keys with a rigid schema. Then, the data engineer populates the table with data. In a schema on read scenario, different types of data, potentially both structured and unstructured, are loaded into the destination, and the schema is generated when queries against the data are executed. This means the data engineer can spend more time crafting queries to gain better insights rather than spending all of their time carefully defining fields.

Schema Past and Future

Schema on write was the default method for decades. Data engineers would spend a significant portion of their time defining schema and relationships before ever starting to analyze their data. Today, more modern data tools tend towards schema on read. The trend is towards automation of time-consuming and manual processes that don’t need human intervention. Defining schema falls squarely into this bucket.

At a Glance: Schema on Read vs. Schema on Write

We’ve already gone over the main differentiator between schema on read and schema on write, but there are other more subtle differences. Let’s explore them.

Schema on Write
Schema on Read
Schema
User has to define a schema
Schema is inferred from the data
Data
Structured and relational
Unstructured and Structured
End User Eperience
The only queryable data is pre-selected
Allows richer data exploration
Positive Features
Lightweight
Adaptable

When you have to define your data before it arrives at your destination, as you do with schema on write, it most often has to be structured and relational. Schema on read, on the other hand, can handle all kinds of data, including unstructured and structured.

Regarding the end user experience, schema on write forces data architects and engineers to be explicit about what data goes to their warehouse before they can analyze it. As you can imagine, this can pose a problem. Schema on read allows for more flexibility and a richer data exploration experience because analysts can pull in fields as needed.

Finally, while there is no right or wrong way to apply schema, there are positive features to both schema on read and schema on write. Schema on read benefits from excellent adaptability inherent in its design, while schema on write offers a lightweight solution that can offer lightning-fast query performance.

StreamSets and Schema

StreamSets aligns with the more modern way of handling schema, taking a schema on read approach. This design choice means pipelines don’t need to be re-written when new fields are created in the origin. Instead, the schema is inferred and passed to the transforms and destination without the need for human intervention. This makes for robust pipelines that can adapt to change. In other words, StreamSets pipelines respond automatically to data drift, a critical function for a modern data strategy.

The post Schema on Write vs. Schema on Read appeared first on StreamSets.

Schema on Write vs. Schema on Read

Schema Past and Future

At a Glance: Schema on Read vs. Schema on Write

StreamSets and Schema

Unveiling the Evaluation Pipeline: GenAI App Builder’s New Tool for Ensuring Excellence in Generative AI Applications

Boost efficiency and accuracy of loading CSV data with Workato

Unlocking GenAI’s Potential: Overcoming Legacy Tech and Data Challenges According to IT Leaders

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Improved Alerting with Atlas Streaming Eval

Up your game: Increase player retention with ML-powered matchmaking using Amazon Aurora ML and Amazon SageMaker

Episode 15: The Power of Data Scraping

POPULAR CATEGORY