Synthetic data generation for training healthcare NLP models without compromising privacy

Author(s): Veerendra Nath Jasthi

Publication #: 2508027

Date of Publication: 05.05.2025

Country: United States

Pages: 1-9

Published In: Volume 11 Issue 3 May-2025

DOI: https://doi.org/10.5281/zenodo.17062967

Abstract

The Natural Language Processing (NLP) models have been known to be highly promising in medical care, especially in clinical note summarization, prediction of diagnoses, as well as in classification of patient records. Nevertheless, medical text data is sensitive, which raises critical privacy issues and regulatory restrictions thus, hindering access to training data of high quality. Data generation via synthesis is an attractive alternative because it generates artificial dataset that replicates the statistical figures of real clinical narratives, but without ending up at identifying patient data. This article details the sophisticated approaches to the construction of synthetic medical data suitable in machine learning settings with NLP downstream tasks based on generative adversarial networks (GANs) and large language models (LLMs) and rule-based methods of anonymization augmentation. Various types of NLP models are trained on real and synthetic data and the researchers check the quality of their performance and find out how predictive and linguistically relevant privacy-preserving synthetic datasets can be. According to our findings, good-quality synthetic datasets can serve as a source of preserving privacy as well as training a reliable model that can be used to apply AI in medicine in a safer and scalable way.

Keywords: Synthetic data, Natural Language Processing, Healthcare, Privacy preservation, GANs, Medical text generation, Clinical NLP, Data anonymization.

Download/View Paper's PDF

Download/View Count: 467

Share this Article