In an increasingly data-centric world, enterprises must focus on gathering both valuable physical information and generating the information that they need but can’t easily capture. Data access, regulation, and compliance are an increasing source of friction for innovation in analytics and artificial intelligence (AI).
For highly regulated sectors such as Financial Services, Healthcare, Life Sciences, Automotive, Robotics, and Manufacturing, the problem is even greater. It causes barriers to system design, data sharing (Internal and external), monetization, analytics, and machine learning (ML).
Synthetic data is a tool that addresses many data challenges, particularly AI and analytics issues like privacy protection, regulatory compliance, accessibility, data scarcity, and bias. This also includes data sharing and time to data (and therefore time to market).
Synthetic data is algorithmically generated. It mirrors statistical properties and patterns from the source data. But importantly it contains no sensitive, private, or personal data points.
You ask questions of the synthetic data and get the same answers that you would from the real data.
In our earlier post, we demonstrated how to use adversarial networks like Generative Adversarial Networks (GANS) to generate tabular datasets to enhance credit fraud model training.
For business stakeholders to adopt synthetic data for their ML and analytics projects, it’s imperative to not only make sure that the generated synthetic data will fit the purpose and the expected downstream applications, but also for them to be able to measure and demonstrate the quality of the generated data.
With increasing legal and ethical obligations in preserving privacy, one of synthetic data’s strengths is the ability to remove sensitive and original information during its synthesis. Therefore, in addition to quality, we need metrics to evaluate the risk of private information leaks, if any, and assess that the process of generation isn’t “memorizing” or copying any of the original data.
To achieve all of this, we can map the quality of synthetic data into dimensions, which help the users, stakeholders, and us to better understand the generated data.
The three dimensions of synthetic data quality evaluation
The synthetic data generated is measured against three key dimensions:
These are some of the questions about any generated synthetic data that should be answered by a synthetic data quality report:
How similar is this synthetic data as compared to the original training set?
How useful is this synthetic data for our downstream applications?
Has any information been leaked from the original training data into the synthetic data?
Has any data which is considered sensitive in the real world (from other data sets not used for training the model) been inadvertently synthesized by our model?
The metrics that translate each one of these dimensions for the end-users are somewhat flexible. After all, the data to be generated can vary in terms of distributions, size, and behaviors. They should also be easy to grasp and interpret.
Ultimately, the metrics must be completely data-driven, and not requiring any prior knowledge or domain-specific information. However, if the user wants to apply specific rules and constraints applicable to a specific business domain, then they should be able to define them during the synthesis process to make sure that the domain-specific fidelity is met.
We look at each of these metrics in more detail in the following sections.
Metrics to understand fidelity
In any data science project, we must understand whether a certain sample population is relevant to the problem that we’re solving. Similarly, for the process of assessing the relevance of the synthetic data generated, we must evaluate it in terms of fidelity as compared to the original.
Visual representations of these metrics make them easier to comprehend. We could illustrate whether the cardinality and ratio of categories were respected, the correlations between the different variables were kept, and so on.
Visualizing the data not only helps to evaluate the quality of the synthetic data, but also fits in as one of the initial steps in the data science lifecycle for a better understanding of the data.
Let’s dive into some fidelity metrics in more detail.
Exploratory statistical comparisons
Within the exploratory statistical comparisons, the features of the original and synthetic datasets are explored using key statistical measures, such as the mean, median, standard deviation, distinct values, missing values, minima, maxima, quartile ranges for continuous features, and the number of records per category, missing values per category, and most occurring characters for categorical attributes.
This comparison should be conducted between the original hold-out dataset and the synthetic data. This evaluation would reveal if the datasets compared are statistically similar. If they aren’t, then we’ll have an understanding of which features and measures are different. You should consider retraining and regenerating the synthetic data with different parameters if a significant difference is noted.
This test acts as an initial screening to make sure that the synthetic data has reasonable fidelity to the original dataset and can therefore usefully undergo more rigorous testing.
Histogram similarity score
The histogram similarity score measures each feature’s marginal distributions of the synthetic and original datasets.
The similarity score is bounded between zero and one, with a score of one indicating that the synthetic data distributions perfectly overlap the distributions of the original data.
A score close to one would give the users the confidence that the holdout dataset and the synthetic dataset are statistically similar.
Mutual information score
The mutual information score measures the mutual dependence of two features, numerical or categorical, indicating how much information can be obtained from one feature by observing another.
Mutual information can measure non-linear relationships, providing a more comprehensive understanding of the synthetic data quality as it lets us understand the extent of the variable’s relations preservation.
A score of one indicates that the mutual dependence between features has been perfectly captured in the synthetic data.
The correlation score measures how well the correlations in the original dataset have been captured in the synthetic data.
Correlations between two or more columns are extremely important for ML applications, which help uncover relationships between features and the target variable and help create a well-trained model.
The correlation score is bounded between zero and one, with a score of one indicating that the correlations have been perfectly matched.
Unlike structured tabular data, which we commonly encounter in data problems, some types of structured data have a particular behavior where past observations have a probability of influencing the following observation. These are known as time-series or sequential data – for example, a dataset with hourly measurements of room temperature.
This behavior means that there is a requirement to define certain metrics that can specifically measure the quality of these time-series datasets
Autocorrelation and partial autocorrelation score
Although similar to correlation, autocorrelation shows the relationship of a time series at its present value as it relates to its previous values. Removing the effects of the previous time lags yields partial autocorrelation. Therefore, the autocorrelation score measures how well the synthetic data has captured the significant autocorrelations, or partial correlations, from the original dataset.
Metrics to understand utility
Now we may have statistically realized that the synthetic data is similar to the original dataset. In addition, we must also assess how well the synthesized dataset fares on common data science problems when trained on several ML algorithms.
Using the following utility metrics, we aim to build confidence that we can actually achieve performance on downstream applications regarding how the original data has performed.
Measuring the performance of synthetic data as compared to the original real data can be done through ML models. The downstream model score captures the quality of the synthetic data by comparing the performance of ML models trained on both the synthetic and original datasets and validated on withheld testing data from the original dataset. This provides a Train Synthetic Test Real (TSTR) score and a Train Real Test Real (TRTR) score respectively.
TSTR, TRTR scores, and the Feature Importance Score (Image by author)
The score incorporates a wide range of the most trusted ML algorithms for either regression or classification tasks. Using several classifiers and regressors makes sure that the score is more generalizable across most algorithms, so that the synthetic data may be considered useful in the future.
In the end, if the TSTR score and TRTR score are comparable, this indicates that the synthetic data has the quality to be used to train effective ML models for real-world applications.
Feature importance score
Highly related to the prediction score, the feature importance (FI) score extends it by adding interpretability to the TSTR and TRTR scores.
The F1 score compares the changes and stability of the feature’s importance order obtained with the prediction score. A synthetic set of data is considered to be of high utility if it yields the same order of feature importance as the original real data.
To make sure that a model trained on our newly generated data is going to produce the same answers to the same questions as a model trained using the original data, we use the Qscore. This measures the downstream performance of the synthetic data by running many random aggregation-based queries on both the synthetic and original (and holdout) datasets.
The idea here is that both of these queries should return similar results.
A high QScore makes sure that downstream applications that utilize querying and aggregation operations can provide close to equal value as that of the original dataset.
Metrics to understand privacy
With privacy regulations already in place, it’s an ethical obligation and a legal requirement to make sure that sensitive information is protected.
Before this synthetic data can be shared freely and used for downstream applications, we must consider the privacy metrics that can help the stakeholder understand where the generated synthetic data stands as compared to the original data in terms of the extent of leaked information. Moreover, we must make critical decisions regarding how the synthetic data can be shared and used.
Exact match score
A direct and intuitive evaluation of privacy is to look for copies of the real data among the synthetic records. The exact match score counts the number of real records that can be found among the synthetic set.
The score should be zero, stating that no real information is present as-is in the synthetic data. This metric acts as a screening mechanism before we evaluate further privacy metrics.
Neighbors’ privacy score
Furthermore, the neighbors’ privacy score measures the ratio of synthetic records that might be too close in similarity to the real ones. This means that, although they aren’t direct copies, they are potential points of privacy leakage and a source of useful information for inference attacks.
The score is calculated by conducting a high-dimensional nearest-neighbors search on the synthetic data overlapped with the original data.
Membership inference score
In the data science lifecycle, once a model has been trained, it no longer needs access to the training samples and can make predictions on unseen data. Similarly, in our case, once the synthesizer model is trained, samples of synthetic data can be generated without the need for the original data.
Through a type of attack called “membership inference attack”, attackers can attempt to reveal the data that was used to create the synthetic data, without having the access to the original data. This results in a compromise of privacy.
The membership inference score measures the likelihood of a membership inference attack being successful.
A low score suggests the feasibility of inference that a particular record was a member of the training dataset that led to the creation of the synthetic data. In other words, the attacks can infer details of an individual record, thereby compromising privacy.
A high membership inference score indicates that an attacker is unlikely to determine if a particular record was part of the original dataset used to create the synthetic data. This also means that no individual’s information was compromised through the synthetic data.
The holdout concept
An important best practice that we must follow is to make sure that the synthetic data is general enough and doesn’t overfit the original data on which it was trained. In typical data science flow, while building ML models such as a Random Forest classifier, we set aside test data, train the models using the training data, and evaluate the metrics on unseen test data.
Similarly, for synthetic data, we keep aside a sample of the original data – generally referred to as a hold-out dataset or unseen withheld test data – and evaluate the generated synthetic data against the hold-out dataset.
The holdout dataset is expected to be a representation of the original data, yet not seen when the synthetic data was generated. Therefore, it’s vital to have similar scores for all of the metrics when comparing the original to the holdout and the synthetic datasets.
When similar scores are obtained, we can establish that the synthetic data points aren’t a result of memorization of the original data points, while preserving the same fidelity and utility.
The world is starting to understand the strategic importance of synthetic data . As data scientists and data generators, it’s our duty to build trust in the synthetic data that we generate and make sure that it’s for a purpose.
Synthetic data is evolving into a must-have in the data science development toolkit. MIT Technology Review has noted synthetic data as one of the breakthrough technologies of 2022. We can’t imagine building excellent value AI models without synthetic data, claims Gartner.
According to McKinsey, synthetic data minimizes costs and barriers that you would otherwise have when developing algorithms or getting access to data.
The generation of synthetic data is about knowing the downstream applications and understanding the trade-offs between the different dimensions for the quality of synthetic data.
As the user of the synthetic data, it’s essential to define the context of the use case for which every sample of synthetic will be used in the future. Just as with real data, the quality of the synthetic data is dependent on the use case intended, as well as the parameters chosen for synthetization.
For example, keeping outliers in the synthetic data as in the original data is useful for a fraud detection use case. However, it’s not useful in a healthcare use case with privacy concerns, as outliers generally could be information leakage.
Moreover, a tradeoff exists between fidelity, utility, and privacy. Data can’t be optimized for all three simultaneously. These metrics enable the stakeholders to prioritize what is essential for each use case and manage expectations from the generated synthetic data.
Ultimately, when we see the values of each metric and when they meet expectations, stakeholders can be confident in the solutions that they build using the synthetic data.
The use cases for structured synthetic data cover a wide gamut of application from test data for software development to creating Synthetic control arms in clinical trials.
Reach out to explore these opportunities or built a PoC to demonstrate the value.
Read MoreAWS Machine Learning Blog