تقرير
Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space
العنوان: | Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space |
---|---|
المؤلفون: | Zhang, Hengrui, Zhang, Jiani, Srinivasan, Balasubramaniam, Shen, Zhengyuan, Qin, Xiao, Faloutsos, Christos, Rangwala, Huzefa, Karypis, George |
سنة النشر: | 2023 |
المجموعة: | Computer Science |
مصطلحات موضوعية: | Computer Science - Machine Learning |
الوصف: | Recent advances in tabular data generation have greatly enhanced synthetic data quality. However, extending diffusion models to tabular data is challenging due to the intricately varied distributions and a blend of data types of tabular data. This paper introduces Tabsyn, a methodology that synthesizes tabular data by leveraging a diffusion model within a variational autoencoder (VAE) crafted latent space. The key advantages of the proposed Tabsyn include (1) Generality: the ability to handle a broad spectrum of data types by converting them into a single unified space and explicitly capture inter-column relations; (2) Quality: optimizing the distribution of latent embeddings to enhance the subsequent training of diffusion models, which helps generate high-quality synthetic data, (3) Speed: much fewer number of reverse steps and faster synthesis speed than existing diffusion-based methods. Extensive experiments on six datasets with five metrics demonstrate that Tabsyn outperforms existing methods. Specifically, it reduces the error rates by 86% and 67% for column-wise distribution and pair-wise column correlation estimations compared with the most competitive baselines. Comment: Accepted by ICLR 2024 (Oral Presentation). Code is available at: https://github.com/amazon-science/tabsynTest |
نوع الوثيقة: | Working Paper |
الوصول الحر: | http://arxiv.org/abs/2310.09656Test |
رقم الانضمام: | edsarx.2310.09656 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |