Generate Synthetic Text Approximating the Private Distribution with Differential Privacy

Generate Synthetic Text Approximating the Private Distribution with Differential Privacy

Wenhao Zhao, Shaoyang Song, Chunlai Zhou

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 6651-6659. https://doi.org/10.24963/ijcai.2024/735

Due to the potential leakage of sensitive information in text, there is a societal call for feeding privacy-preserving text to model training. Recently, a lot of work showed that using synthetic text with differential privacy, rather than private text, can provide a strong privacy protection. However, achieving higher semantic similarity between synthetic and private text has not been thoroughly investigated. In this paper, we propose an approach that combines the iteratively optimized mindset from genetic algorithms to align the distribution of synthetic text with that of private text. Furthermore, not only does the final synthetic text meet the requirements of privacy protection, but also has a high level of quality. Through comparisons with various baselines on different datasets, we demonstrate that our synthetic text can closely match the utility of private text, while providing privacy protection standards robust enough to resist membership inference attacks from malicious users.
Keywords:
Natural Language Processing: NLP: Language models
Multidisciplinary Topics and Applications: MTA: Security and privacy