Generative AI: A Comparison of CTGAN and CTGAN with Gaussian Copula in Generating Synthetic Data with Synthetic Data Vault
Authors: Lakshmi Prayaga, Chandra Prayaga. Aaron Wade, Gopi Shankar Mallu, Harsha Satya Pola
Abstract:
Synthetic data generated by Generative Adversarial Networks and Autoencoders are becoming more common to combat the problem of insufficient data for research purposes. However, generating synthetic data is a tedious task requiring extensive mathematical and programming background. Open-source platforms such as the Synthetic Data Vault (SDV) and mostly AI have offered a platform that is user-friendly and accessible to non-technical professionals to generate synthetic data to augment existing data for further analysis. The SDV also provides for additions to the generic Generative Adversarial Networks (GAN) such as the Gaussian copula. We present the results from two synthetic data sets Conditional Tabular Generative Adversarial Network (CTGAN data and CTGAN with Gaussian Copula) generated by the SDV and report the findings. The results indicate that the Receiver Operating Characteristic Curve ROC and Area Under the curve AUC curves for the data generated by adding the layer of Gaussian copula are much higher than the data generated by the CTGAN.
Keywords: Synthetic data generation, Generative Adversarial Networks, GANs, Conditional Tabular GAN, CTGAN, Gaussian copula.
Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 61References:
[1] C. Little et al., "Generative Adversarial Networks for Synthetic Data Generation: A Comparative Study," arXiv preprint arXiv:2112.01925, 2021, doi: 10.48550/arXiv.2112.01925.
[2] S. Kamthe, S. Assefa, and M. Deisenroth, "Copula Flows for Synthetic Data Generation," arXiv preprint arXiv:2101.00598, 2021, doi: 10.48550/arXiv.2101.00598.
[3] A. Koloi et al., "A comparison study on creating simulated patient data for individuals suffering from chronic coronary disorders," in 2023 45th Annual Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), Sydney, Australia, 2023, pp. 1-4, doi: 10.1109/EMBC40787.2023.10340194.
[4] A. J. Rodriguez-Almeida et al., "Synthetic Patient Data Generation and Evaluation in Disease Prediction Using Small and Imbalanced Datasets," IEEE J. Biomed. Health Inform., vol. 27, no. 6, pp. 2670-2680, 2023, doi: 10.1109/JBHI.2022.3196697.
[5] E. Espinosa and A. Figueira, "On the Quality of Synthetic Generated Tabular Data," Mathematics, vol. 11, no. 15, 3278, 2023, doi: 10.3390/math11153278.
[6] A. Pathare et al., "Comparison of tabular synthetic data generation techniques using propensity and cluster log metric," Int. J. Inf. Manag. Data Insights, vol. 3, no. 2, 100177, 2023, doi: 10.1016/j.jjimei.2023.100177.
[7] T. J. Anande, S. Al-Saadi, and M. S. Leeson, "Generative adversarial networks for network traffic feature generation," Int. J. Comput. Appl., vol. 45, no. 4, pp. 297-305, 2023, doi: 10.1080/1206212X.2023.2191072.
[8] A. Gupta, D. Bhatt, and A. Pandey, "Transitioning from real to synthetic data: Quantifying the bias in model," arXiv preprint arXiv:2105.04144, 2021, doi: 10.48550/arXiv.2105.04144.
[9] M. Chalé and N. D. Bastian, "Challenges and Opportunities for Generative Methods in the Cyber Domain," in 2021 Winter Simulation Conf. (WSC), Phoenix, AZ, USA, 2021, pp. 1-12, doi: 10.1109/WSC52266.2021.9715504.
[10] Shinde, Y. (2024, February). Analysis of University Admissions Data, Version 1. Retrieved February 15, 2024, from https://www.kaggle.com/code/yogesh239/analysis-of-university-admissions-data.
[11] A. Aggarwal, M. Mittal, and G. Battineni, "Generative adversarial network: An overview of theory and applications," Int. J. Inf. Manag. Data Insights, vol. 1, no. 1, p. 100004, 2021, doi: 10.1016/j.jjimei.2020.100004.
[12] Z. Li, Y. Zhao and J. Fu, "SynC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources," 2020 International Conference on Data Mining Workshops (ICDMW), Sorrento, Italy, 2020, pp. 571-578, doi: 10.1109/ICDMW51313.2020.00082.
[13] D. S. Watson et al., "Adversarial Random Forests for Density Estimation and Generative Modeling," in Proc. 26th Int. Conf. Artificial Intelligence and Statistics, vol. 206, pp. 5357-5375, 2023. (Online). Available: https://proceedings.mlr.press/v206/watson23a.html.
[14] N. Patki, R. Wedge and K. Veeramachaneni, "The Synthetic Data Vault," 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 2016, pp. 399-410, doi: 10.1109/DSAA.2016.49.
[15] B. Li et al., "Improving GAN with inverse cumulative distribution function for tabular data synthesis," Neurocomputing, vol. 456, pp. 373-383, 2021, doi: 10.1016/j.neucom.2021.05.098.
[16] F. Hamad et al., "A supervised generative optimization approach for tabular data," in Proc. Fourth ACM Int. Conf. AI Finance, Association for Computing Machinery, 2023, pp. 10–18, doi: 10.1145/3604237.3626907.
[17] B. Chaudhari et al., "FairGen: Fair Synthetic Data Generation," arXiv preprint arXiv:2210.13023, 2022, doi: 10.48550/arXiv.2210.13023.