| 注册
首页|期刊导航|南京大学学报(自然科学版)|基于LLM概率提示词的表格数据生成方法

基于LLM概率提示词的表格数据生成方法

张爽 房俊 欧阳琛

南京大学学报(自然科学版)2026,Vol.62Issue(2):277-284,8.
南京大学学报(自然科学版)2026,Vol.62Issue(2):277-284,8.DOI:10.13232/j.cnki.jnju.2026.02.010

基于LLM概率提示词的表格数据生成方法

A method for generating tabular data based on LLM prompt words

张爽 1房俊 2欧阳琛1

作者信息

  • 1. 北方工业大学信息学院,北京,100144
  • 2. 北方工业大学信息学院,北京,100144||大规模流数据集成与分析技术北京市重点实验室,北方工业大学,北京,100144
  • 折叠

摘要

Abstract

Large Language Model(LLM)have demonstrated significant potential in tabular data generation.However,they often struggle to accurately preserve the statistical dependencies between columns.To address this challenge,we propose TabProLLM,a probabilistic prompting framework that separately generates numerical and categorical columns using strategies grounded in probability distributions.For numerical columns,we fit a Gaussian Mixture Model(GMM)to decompose the empirical distribution into multiple Gaussian components.Prompts are then constructed based on these segmented distributions to guide the LLM in generating realistic numerical values.For categorical columns,we condition on a reference numerical column by partitioning its range and computing the conditional probability distribution of each category within each interval.These conditional probabilities are embedded into the prompt design to steer the generation of categorical data consistent with observed inter-variable dependencies.During prompt construction,correlation coefficients and other statistical measures are incorporated to verify that the generated data preserves the correlation structure of the original dataset.Experimental results on 10 public datasets show that TabProLLM,while ensuring strong data privacy,achieves performance gains of 0.5%to 18.3%over existing methods across multiple fidelity metrics in the SDMetrics toolkit,including RangeCoverage,CategoryCoverage,KSComplement,and TVComplement.On the CorrelationSimilarity metric,TabPro-LLM performs comparably to the state-of-the-art TabDDPM model and surpasses GPT-4o(using mean-variance prompts)by approximately 4.1%.Furthermore,in privacy evaluations,TabProLLM achieves top or second-best performance across DCR and NNDR metrics(evaluated at the 5th percentile),highlighting its robust privacy-preserving capabilities.

关键词

表格数据生成/大语言模型/提示词/条件概率

Key words

tabular data generation/large language model/prompt words/conditional probability

分类

信息技术与安全科学

引用本文复制引用

张爽,房俊,欧阳琛..基于LLM概率提示词的表格数据生成方法[J].南京大学学报(自然科学版),2026,62(2):277-284,8.

基金项目

国家重点研发计划(2023YFC3107900),国家自然科学基金(72272140) (2023YFC3107900)

南京大学学报(自然科学版)

0469-5097

访问量0
|
下载量0
段落导航相关论文