Information that has been created intentionally rather than as a result of actual occurrences is known as synthetic data. Synthetic data is generated algorithmically and is used to train machine learning models, verify mathematical models, and act as a stand-in for test datasets of production or operational data. In this article, we will look at the metrices to compare synthetic data with original data.
The Synthetic Data Evaluation Framework in SDV makes it easier to assess the caliber of your synthetic dataset by using a variety of synthetic data metrics on it and providing findings in a thorough manner.
Comparing Synthetic Data and Original Data
We essentially need two types of data: actual data and artificial data that attempts to imitate real data in order to assess the quality of synthetic data.
Three types of metrices to compare the synthetic data with original data are:
- Single Table Metrices
- Multi Table Metrices
- Time Series Metrices
Single Table Metrices
In single table metrices, it is divided into following families:
- Statistical Metrices
- Likelihood Metrices
- Detection Metrices
- Machine Learning Efficacy Metrices
- Privacy Metrices
By applying various statistical tests to the tables, the metrics of this family compare them.
These metrics compare certain columns from the real database to the equivalent column in the synthetic table in the most basic case, and they then summarize the test’s average results.
Such Metrices includes:
- metrics.tabular.KSTest: This measure employs the empirical CDF to compare the distributions of continuous columns using the two-sample Kolmogorov-Smirnov test. The highest difference between the anticipated CDF and the observed CDF values is shown for each column as 1 minus the KS Test D statistic.
- metrics.tabular.CSTest: This measure compares the distributions of two discrete columns using the Chi-Squared test. The CSTest p-value, which represents the likelihood that the two columns were sampled from the same distribution, is the result for each column.
This family of metrics compares tables by fitting the actual data to a probabilistic model and then calculating the likelihood of the synthetic data belonging to the learnt distribution.
Below are some examples:
- metrics.tabular.BNLikelihood: This measure fits a Bayesian Network to actual data and then computes the average likelihood of the rows from synthetic data on it.
- metrics.tabular.BNLogLikelihood: This measure fits a BayesianNetwork to the real data and then analyses the average log likelihood of the synthetic data rows on it.
- metrics.tabular.GMLogLikelihood: This measure fits several GaussianMixture models to real-world data and then computes the average log likelihood of synthetic data on them.
The metrics in this family assess how difficult it is to identify synthetic data from actual data using a Machine Learning model. To do this, the metrics will mix actual and synthetic data with flags indicating whether the data is genuine or synthetic, and then cross verify a Machine Learning model that attempts to forecast this flag. The metrics’ output will be 1 minus the average ROC AUC score across all cross-validation splits.
Below are some examples:
- metrics.tabular.LogisticDetection: A detection metric based on scikit-LogisticRegression learn’s classifier.
- metrics.tabular.SVCDetection: A detection metric based on a scikit-learn SVC classifier.
Machine Learning Efficacy Metrices
This metric family will assess if it is possible to substitute real data with synthetic data in order to solve a Machine Learning problem by training a Machine Learning model on synthetic data and then assessing the score obtained when tested on actual data.
As these metrics will be evaluated by attempting to solve a Machine Learning problem, they can only be used on datasets that contain a target column that requires or can be predicted using the rest of the data, and the metrics’ scores will be inversely proportional to the difficulty of the Machine Problem.
This metric family assesses the privacy of a synthetic dataset by asking the question, “Can an attacker anticipate sensitive properties in the actual dataset given the synthetic data?” These approaches do this by training an adversarial attacker model on synthetic data to predict sensitive qualities from “key” attributes and then testing its accuracy on actual data.
- Categorical metrics:
- Numerical metrics:
Multi- Table Metrices
The following are the two types of Multi-Table Metrices:
Multi Single Table Metrics
These metrics simply perform a Single Table Metric on each table in the dataset and provide the average score.
Parent-Child Detection Metrics
These metrics de-normalize the child tables for each parent-child connection detected in the dataset before applying a Single Table Detection Metric to the resultant tables. If the dataset contains more than one parent-child connection, the total score is the average of the Single Table Detection Metric scores for each of them.
These metrics will de-normalize and table each parent-child connection in the dataset. They will then apply a Single Table Detection Metric to the generated tables and provide the average of the scores obtained.
The following are the two types of Time-Series Metrices:
These metrics attempt to train a Machine Learning Classifier that learns to differentiate real data from synthetic data and report a score based on the classifier’s success.
The metrics in this family assess how difficult it is to identify synthetic data from actual data using a Machine Learning model. To do this, the metrics will mix actual and synthetic data with flags indicating whether the data is genuine or synthetic, and then cross verify a Machine Learning model that attempts to forecast this flag using a corrected version of a ROC AUC score that returns values in the range [0, 1]. The metrics’ output will be 1 minus the average of the scores across all cross-validation splits.
Machine Learning Efficacy Metrics
These metrics are used to train a Machine Learning model on simulated data and then evaluate the model’s performance on actual data. Because these metrics must assess the performance of a Machine Learning model on the dataset, they can only be applied to datasets that reflect a Machine Learning challenge.
This metric family will assess if it is possible to substitute real data with synthetic data in order to solve a Machine Learning problem by training a Machine Learning model on both synthetic and real data and then comparing the score obtained when assessed on held out actual data. The output is the score achieved by fitting the model to synthetic data divided by the score obtained by fitting the model to real data.