首页|期刊导航|中国科学数据（中英文网络版）|TCST-UT:卫藏方言藏汉语音翻译数据集

TCST-UT:卫藏方言藏汉语音翻译数据集

黎鑫刘佳洛多杰朋毛看卓措戚肖克赵小兵

中国科学数据（中英文网络版）2025，Vol.10Issue(3)：523-534,12.

中国科学数据（中英文网络版）2025，Vol.10Issue(3)：523-534,12.DOI:10.11922/11-6035.csd.2024.0208.zh

TCST-UT:卫藏方言藏汉语音翻译数据集

TCST-UT:a dataset of Tibetan-Chinese speech Translation in the Ü-Tsang dialect

黎鑫 ¹刘佳洛 ¹多杰朋毛 ²看卓措 ²戚肖克 ³赵小兵¹

作者信息

1. 国家语言资源监测与研究少数民族语言中心,北京 100081||中央民族大学信息工程学院,北京 100081
2. 中央民族大学信息工程学院,北京 100081
3. 中国政法大学法治信息管理学院,北京 102249
折叠

摘要

Abstract

In the era of large language models,the construction of multilingual language resources is of great significance.However,the publicly available Tibetan-Chinese speech translation datasets are currently very limited,which significantly hinders the development of Tibetan within the context of multilingual language resource construction.To address this,this paper presents a large-scale dataset of Tibetan-Chinese speech translation constructed using a semi-automatic annotation approach,adhering to international speech translation dataset standards.Initially,leveraging the publicly available Tibetan automatic speech recognition dataset for the Ü-Tsang dialect(M2ASR),we employed the Gemini-1.5-pro large language model to translate the corresponding Tibetan transcripts into Chinese.These translations underwent rigorous review and correction by experts,resulting in a high-quality dataset of Ü-Tsang dialect Tibetan-Chinese speech translation.This dataset comprises 58,767 speech-text pairs,totaling 72.08 hours of audio from 147 different speakers,with a total size of 22 MB.This dataset can provide foundational data for Tibetan-Chinese speech translation research.Furthermore,it can offer insights for the construction of speech translation datasets for other low-resource languages.

关键词

藏汉语音翻译/数据集/半自动标注/低资源语言

Key words

Tibetan-Chinese Speech translation/dataset/semi-automatic annotation/low-resource Languages

引用本文复制引用

黎鑫,刘佳洛,多杰朋毛,看卓措,戚肖克,赵小兵..TCST-UT:卫藏方言藏汉语音翻译数据集[J].中国科学数据（中英文网络版）,2025,10(3):523-534,12.

基金项目

国家社科基金重大项目(22&ZD035) National Social Science Foundation of China(22&ZD035) （22&ZD035）

中国科学数据（中英文网络版）

ISSN：2096-2223

访问量0

下载量0

段落导航