What is a Data Warehouse?
Data warehouses are centralized repositories used to store data for an entire organization. Data warehouses contain data from many disparate data sources and can often be quite large. Data warehouses are different from other data repositories in that they are relational, meaning they store data in tables with rows and columns.
What is a Data Mart?
Data Marts are a subset of data warehouses. Ultimately, what distinguishes data marts is that they are specialized. Often described as subject-orientated, data marts are limited in scope to a specific purpose or subject. There are three types of data marts: those dependent on data warehouses, those that are wholly independent, and hybrids of the two.
Why Would an Organization Build a Completely Independent Data Mart?
Say a sales organization wants to join together Salesforce data and commission information to display sales goals in Tableau. The sales organization isn’t interested in connecting to the larger data warehouse because they want a speedy connection to the data that strikes at the point – how near or far is the sales team away from meeting their quarterly sales goals?
Dependent or Hybrid models might come into play when the sales team wants to include data from Marketo and other internal systems, and as the connected applications become larger, the need to gather these inputs into one internal source becomes greater.
Dependent and Hybrid Data Marts
Comparison At a Glance: Data Mart vs Data Warehouse
Often just a few data sources. Often subject-oriented containing data for a specific purpose
Many data sources
Cost and Size
Usually very speedy due to the limited amount of data
May have performance issues due to their large size
Sensitive data can be omitted completely
Security measures must be put in place to protect sensitive data
Ease of Implementation
Can be quite quick because the data is small
Can take years to implement fully
Short-lived. Often terminated after the purpose for which they were built is complete
Can extend throughout the lifetime of the organization.
Cost & Size
I’ll combine cost and size here because the relationship between the two is so strong. The cost for data storage can be broken down into two elements: storage and compute. Storage, of course, depends on the volume of data to be stored, while compute is determined by the transactions (SELECT, INSERT, UPDATE, DELETE) done to the data. As you might expect, the bigger the data, the bigger the compute. Data marts are, by definition, smaller data repositories than data warehouses and so naturally will cost an organization less to spin up and maintain.
After cost, the other attribute affected by size is performance. Data marts are generally speedier than data warehouses because they are smaller. A thoughtful data warehouse design is necessary to ensure that these larger repositories are performant. Considerations like indexes, query optimization, materialized views, etc., must be implemented for a data warehouse to perform like the much simpler data mart.
One interesting capability of a data mart is the sectioning off of data from parties that either aren’t interested in it or, more importantly, shouldn’t see it. For example, a data warehouse might include salary or employee retention information that shouldn’t be made available to most employees. This sensitive information can be separated securely by spinning up a data mart that just contains the data appropriate to a group of employees.
While a savvy database administrator can apply security rules to a data warehouse to ensure the same outcome, removing the possibility of access completely is a security benefit.
Ease of Implementation
It can take years for data warehouses to be fully implemented. Collecting the necessary data, gathering permissions, and storing it all in a sensible way that allows multiple teams within an organization to utilize the collected data successfully takes time. On the other hand, data marts can be spun up quickly with their much simpler designs and use cases.
Data marts can and are often spun up and then deleted after the specific use case they were created for is complete. Data warehouses have a much longer lifetime, potentially lasting years.
When to Use a Data Mart vs. a Data Warehouse
I’ve already mentioned one use case for using a data mart – a sales organization wanting to view Salesforce and commission data independent from the larger organization’s data repository.
Historically, another reason organizations have chosen to add data marts to their data architecture is data mining. Data mining is the process of manually sifting through data to come up with an analysis that could potentially be used for a business decision. However, data mining has been largely replaced with machine learning. Machines, after all, are better than us at finding patterns. However, machine learning still requires cleaned and sampled data to train models, and this is an area where a data mart might come into play again in modern data architecture.
Another frequent use case for a data mart is to power a business intelligence solution like Tableau. Users expect visualizations that load and filter quickly. One easy way to make sure this is possible is only to make a small subset of your data available to Tableau: enter the data mart.
Whenever an organization wants a centralized repository for their data – which today is most of the time – it’s time to implement a data warehouse. Extra points if your organization wants the repository to be relational.
At StreamSets, we help customers load data into popular cloud and on-prem data warehouses or directly into specific data marts. StreamSets smart data pipelines can also help you transform and cleanse your data before it gets to the data warehouse. StreamSets pipelines can also help you sync two data warehouses for hybrid cloud use cases.