南京大学学报(自然科学版)2026,Vol.62Issue(2):277-284,8.DOI:10.13232/j.cnki.jnju.2026.02.010
基于LLM概率提示词的表格数据生成方法
A method for generating tabular data based on LLM prompt words
摘要
Abstract
Large Language Model(LLM)have demonstrated significant potential in tabular data generation.However,they often struggle to accurately preserve the statistical dependencies between columns.To address this challenge,we propose TabProLLM,a probabilistic prompting framework that separately generates numerical and categorical columns using strategies grounded in probability distributions.For numerical columns,we fit a Gaussian Mixture Model(GMM)to decompose the empirical distribution into multiple Gaussian components.Prompts are then constructed based on these segmented distributions to guide the LLM in generating realistic numerical values.For categorical columns,we condition on a reference numerical column by partitioning its range and computing the conditional probability distribution of each category within each interval.These conditional probabilities are embedded into the prompt design to steer the generation of categorical data consistent with observed inter-variable dependencies.During prompt construction,correlation coefficients and other statistical measures are incorporated to verify that the generated data preserves the correlation structure of the original dataset.Experimental results on 10 public datasets show that TabProLLM,while ensuring strong data privacy,achieves performance gains of 0.5%to 18.3%over existing methods across multiple fidelity metrics in the SDMetrics toolkit,including RangeCoverage,CategoryCoverage,KSComplement,and TVComplement.On the CorrelationSimilarity metric,TabPro-LLM performs comparably to the state-of-the-art TabDDPM model and surpasses GPT-4o(using mean-variance prompts)by approximately 4.1%.Furthermore,in privacy evaluations,TabProLLM achieves top or second-best performance across DCR and NNDR metrics(evaluated at the 5th percentile),highlighting its robust privacy-preserving capabilities.关键词
表格数据生成/大语言模型/提示词/条件概率Key words
tabular data generation/large language model/prompt words/conditional probability分类
信息技术与安全科学引用本文复制引用
张爽,房俊,欧阳琛..基于LLM概率提示词的表格数据生成方法[J].南京大学学报(自然科学版),2026,62(2):277-284,8.基金项目
国家重点研发计划(2023YFC3107900),国家自然科学基金(72272140) (2023YFC3107900)