Auditing Synthetic Data Quality

QuantPi analyzes authenticity metrics and methods to detect mode deficiencies that allow the quality assessment of synthetic medical data.

The Challenge

Biomedicine is becoming increasingly data-driven and results in huge data collections from patients. The integration of multiple data sources in multi-omics is a major driver in understanding a number of diseases, precision medicine, and towards drug development and re-purposing in the pharma sector.

‍

However, biomedical data is highly sensitive, for imaging data this can be easily shown, but with a better understanding of the human genome, multi-omics data can reveal personal and medical information about the source. Therefore, sharing of biomedical data is highly regulated. In order to facilitate progress in this important research direction, we need novel and innovative ways to share data while providing strong guarantees w.r.t privacy protection.

QuantPi’s Contribution:

QuantPi develops various metrics to assess the quality of synthetic data and its generators. The metrics will be included in the overall benchmarking effort for synthetically generated biomedical data and will be agnostic of generative models to ensure broad applicability.

‍

Understanding how to measure data quality of generative models is an ongoing topic. On the one hand, there are task-independent metrics that measure the agreement in joint distribution or marginals. On the other hand, there are task-dependent metrics that measure the utility for a certain use case. Borrowing concepts from statistics, information theory, and feature selection, QuantPi combines several of these metrics to get a more holistic assessment of the quality of synthetic data.

‍

At the same time, machine learning models suffer from bias in the training data. Often, such biases are even amplified in the learning process. Generative models are no different here, and lead to well-known problems like “mode collapse.” The effects are quite severe. They limit the usefulness of such models and data for minority groups - which is in stark contrast to the European understanding of trustworthy AI and values of inclusiveness and fairness.

‍

This is why QuantPi will develop a methodology to identify such data biases - in particular identifying rare examples that are prone to be suppressed or misrepresented in the final model.

This project is funded by the German Federal Ministry for Education and Research.

More Information:

‍