中国科学数据(中英文网络版)2025,Vol.10Issue(3):523-534,12.DOI:10.11922/11-6035.csd.2024.0208.zh
TCST-UT:卫藏方言藏汉语音翻译数据集
TCST-UT:a dataset of Tibetan-Chinese speech Translation in the Ü-Tsang dialect
摘要
Abstract
In the era of large language models,the construction of multilingual language resources is of great significance.However,the publicly available Tibetan-Chinese speech translation datasets are currently very limited,which significantly hinders the development of Tibetan within the context of multilingual language resource construction.To address this,this paper presents a large-scale dataset of Tibetan-Chinese speech translation constructed using a semi-automatic annotation approach,adhering to international speech translation dataset standards.Initially,leveraging the publicly available Tibetan automatic speech recognition dataset for the Ü-Tsang dialect(M2ASR),we employed the Gemini-1.5-pro large language model to translate the corresponding Tibetan transcripts into Chinese.These translations underwent rigorous review and correction by experts,resulting in a high-quality dataset of Ü-Tsang dialect Tibetan-Chinese speech translation.This dataset comprises 58,767 speech-text pairs,totaling 72.08 hours of audio from 147 different speakers,with a total size of 22 MB.This dataset can provide foundational data for Tibetan-Chinese speech translation research.Furthermore,it can offer insights for the construction of speech translation datasets for other low-resource languages.关键词
藏汉语音翻译/数据集/半自动标注/低资源语言Key words
Tibetan-Chinese Speech translation/dataset/semi-automatic annotation/low-resource Languages引用本文复制引用
黎鑫,刘佳洛,多杰朋毛,看卓措,戚肖克,赵小兵..TCST-UT:卫藏方言藏汉语音翻译数据集[J].中国科学数据(中英文网络版),2025,10(3):523-534,12.基金项目
国家社科基金重大项目(22&ZD035) National Social Science Foundation of China(22&ZD035) (22&ZD035)