Synthetic Data in Data Protection Law

On the promise of synthetic data, the distinction from anonymization, and the legal risks.

Back to blog

Contents

1. The promise of synthetic data

Synthetic data is so attractive above all because it seems to offer an elegant solution to an old problem: the desire to use data while still preserving privacy. The basic idea is to generate datasets that resemble the original data in their statistical structure without simply copying it. The European Data Protection Supervisor, or EDPS, describes synthetic data in precisely this sense as artificial data generated from original data and a model, intended to produce results in statistical analyses that are as similar as possible to those of the source data.[1]

For research, development, and testing environments, this is immediately appealing because it allows people to work with data-like material without always having to access the original data directly. At the same time, the EDPS matters for the legal analysis because, as the independent data protection authority of the EU institutions, it is a particularly influential voice on new technologies. It expressly emphasizes that the controller must assess the legal status of both the input dataset and the output dataset in their specific context.[1]

That much already makes one thing clear: the promise of synthetic data is significant, but it does not sustain itself legally.

2. Anonymization is not the same as synthetic data generation

Conceptually, these notions therefore need to be kept clearly separate. Synthetic data initially refers to a technical mode of production. Anonymization, by contrast, is not merely a technical label, but the result of a legally relevant process: personal data is altered so that the data subject cannot be identified, or can no longer be identified. What matters is not only whether a name or a direct identifier has been removed. Recital 26 GDPR instead requires an overall assessment of all means reasonably likely to be used to identify a person directly or indirectly; it explicitly mentions singling out as well.[2]

That is why anonymization is not an abstract property of a dataset “as such”, but a context-dependent assessment. The older opinion of the Article 29 Working Party on anonymization techniques puts this very clearly: effective anonymization must prevent individuals from being singled out, datasets from being linked, or additional information about data subjects from being inferred.[3]

For precisely that reason, “synthetic” is not synonymous with “anonymous”.

3. Relates to a person: not only identification, but also content, purpose, and effect

For legal analysis, the question of direct identifiability alone is not enough. In European data protection law, it has long been recognized that information may relate to a person not only through identification, but also through its content, purpose, or effect. This line of reasoning comes from the classic WP29 interpretation of the concept of personal data.[4]

Institutionally, the Article 29 Working Party was replaced on May 25, 2018 by the European Data Protection Board (EDPB), but this triad remains highly influential in substance.[5] That is particularly important for synthetic data.

A statement can relate to a person even if it is artificially generated or even factually false, provided that it is used to say something about that person, classify them, assign them a risk, or prepare decisions affecting their rights and interests. In this sense, even a false reference is not simply legally irrelevant; at least in the sense of its result, it may still be personal data if it becomes operative in a decision-making context concerning a particular person.

4. Input, model, and output must be assessed separately

From a legal point of view, it is therefore too simplistic to look only at the synthetic output. With synthetic data, at least three levels must be distinguished: the source data, the model, and the output data. If the training data is personal data, then using it to generate synthetic data is already a data-processing operation relevant under data protection law. The EDPS accordingly states that the controller must assess the legal status of the input and output datasets in their specific context.[5]

What matters here in particular are the nature of the data, realistic attack and risk scenarios, and the technical and organizational safeguards of the environment. This leads to the key conclusion: synthetic data is not legally privileged merely because it is “artificial”. It remains personal data as long as it has not been shown with sufficient confidence that the link to a person has actually been eliminated.

5. The core risks

The risks of synthetic data can be developed well from the classic anonymization debate. WP29 identifies singling out, linkability, and inference as the core problems.[5] This triad is helpful here as well. A dataset may remain legally problematic if individual records can be singled out, linked with other data sources, or used to infer additional information about data subjects. Modern models add further risks.

Membership inference attacks are particularly important: they aim to determine whether a particular record or person was included in the training data. Model inversion attacks are equally relevant: they infer properties of the training data from the behavior of the model itself. Especially in black-box or query-based scenarios, such attacks are not merely theoretical.[6]

There is also the problem of attribute disclosure. Even without secure identification, it may be legally significant if sensitive characteristics can be inferred about individual persons or very small groups. Finally, the principle of accuracy should not be underestimated either. Synthetic data may create fictitious or distorted profiles. As long as it remains in isolated test environments, this is initially a quality issue. But if it is used for assessment, profiling, or operational decision-making, it can become a genuine data protection and fundamental rights problem.

6. Synthetic data as a PET, but not as a miracle solution

The best way to understand synthetic data is therefore as a privacy-enhancing technology. Its value lies in making data processing less data-intensive and thereby supporting the principle of data minimization.[7]

That is precisely where its promise lies. At the same time, that promise is only partially reliable. Synthetic data neither replaces legal assessment nor other protective measures. Instead, it must be combined with access restrictions, secure processing environments, governance rules, and, where appropriate, further technical means. The underlying problem remains a trade-off: the better a system is protected against re-identification, linkage, or inference, the greater the likely loss of usefulness.[7]

For that reason, synthetic data should neither be romanticized nor dismissed too quickly. It is not a mere illusion, but neither is it automatic anonymization. Legally persuasive analysis is possible only through a context-sensitive assessment that thinks utility and residual risk together.

7. Switzerland: first practical applications, especially in health

For Switzerland, it is now possible to say quite clearly that synthetic data is no longer only discussed in theory, but is already being used in practice. This is especially visible in the Basel health and research environment. The University of Basel reports on a workshop in which representatives from University Hospital Basel, Roche, and MDClone presented the use of synthetic data.[8]

That report also makes clear that Basel is already working with concrete infrastructures intended to make clinical data accessible in a more privacy-friendly form. This supports the conclusion that synthetic data has already arrived in Switzerland, though so far it is most publicly visible in university, clinical, and research-adjacent contexts.[9]

The legal core point, however, does not change as a result: in Switzerland too, it is not the label “synthetic” that is decisive, but whether there is still a legally relevant link to a person in the specific context.[10]


Glossary

Anonymization: A process in which personal data is altered so that the data subject cannot be identified, or can no longer be identified, taking into account all means reasonably likely to be used. What matters is therefore not only the removal of direct identifiers, but also the prevention of singling out, linkability, and inference.

Synthetic data: Artificially generated data based on original data and a model, intended to reproduce statistical properties of the source data. It is not automatically anonymous.

EDPS: European Data Protection Supervisor; the independent data protection authority of the EU institutions. Especially relevant for synthetic data because the EDPS describes its legal assessment as a contextual question concerning input and output datasets.

EDPB: European Data Protection Board; successor institution to the Article 29 Working Party since May 25, 2018.

Relates to a person: Information relates to a person not only in cases of direct identification, but also where it is assigned to that person by content, purpose, or effect.

Singling out: The isolation of a record or profile from a set, even without immediate identification by name.

Linkability: The possibility of linking datasets or data traces from different sources.

Inference: The derivation of additional information about a person from existing data, statistical patterns, or model behavior.

Membership inference attack: An attack intended to determine whether a specific record or person was part of the training data.

Model inversion attack: An attack in which information about training data or its characteristics is reconstructed from model behavior.

Attribute disclosure: The disclosure or plausible inference of sensitive characteristics of a person, even if that person cannot be identified with certainty by name.

Attribute disclosure instead of identification: In data protection law, the disclosure or inference of sensitive characteristics may already be relevant even without secure identification of the person.

Singling out, linkability, inference: Core risk categories in anonymization analysis: isolating, linking, and inferring additional information.

Accuracy principle: The data protection principle of factual accuracy; relevant where synthetic data produces false or distorted personal profiles.

Further recent problem settings: Additional specialized issues, such as erasure claims, persistently memorizing models, or conflicts with special legal regimes.

PET: Privacy-enhancing technology; a technical measure intended to reduce privacy risks. Synthetic data can be understood as a PET in this sense.

Data minimization: A data protection principle according to which only the data necessary for the relevant purpose should be processed. Synthetic data is often justified on the basis that it can support this principle.


Bibliography

  1. [1] European Data Protection Supervisor, “Synthetic Data.” Available at: https://www.edps.europa.eu/press-publications/publications/techsonar/synthetic-data
  2. [2] European Union, “Recital 26 - Not Applicable to Anonymous Data - General Data Protection Regulation (GDPR).” Available at: https://gdpr-info.eu/recitals/no-26/
  3. [3] Article 29 Data Protection Working Party, “Opinion 05/2014 on Anonymisation Techniques.” Available at: https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf
  4. [4] Article 29 Data Protection Working Party, “Opinion 4/2007 on the Concept of Personal Data.” Available at: https://www.clinicalstudydatarequest.com/Documents/Privacy-European-guidance.pdf
  5. [5] European Data Protection Board, “Legacy of Article 29 Working Party.” Available at: https://www.edpb.europa.eu/about-edpb/who-we-are/legacy-art-29-working-party_en
  6. [6] Information Commissioner’s Office, “Guidance on AI and data protection.” Available at: https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/artificial-intelligence/guidance-on-ai-and-data-protection/how-should-we-assess-security-and-data-minimisation-in-ai/
  7. [7] Information Commissioner’s Office, “Chapter 5: Privacy-enhancing technologies (PETs).” Available at: https://ico.org.uk/media2/migrated/4021464/chapter-5-anonymisation-pets.pdf
  8. [8] Universität Basel, “Workshop: The Death of Data Sharing? A Modern Privacy Technology Survival Kit.” Available at: https://www.unibas.ch/en/Research/Research-in-Basel/University-Networks/Personalized-Health-Basel/PHB-Events/Archive/Workshop-The-death-of-data-sharing-A-modern-privacy-technology-survival-kit.html
  9. [9] MDClone, “Personalized Health Basel and Leading Life Sciences Organization Collaborate with MDClone to Drive Innovation.” Available at: https://mdclone.com/press-release/personalized-health-basel-and-leading-life-sciences-organization-collaborate-with-mdclone-to-drive-innovation/
  10. [10] Schweizerische Eidgenossenschaft, “Bundesgesetz über den Datenschutz (DSG), SR 235.1.” Available at: https://www.fedlex.admin.ch/eli/cc/2022/491/de