| 注册
首页|期刊导航|中国科学数据(中英文网络版)|TCST-UT:卫藏方言藏汉语音翻译数据集

TCST-UT:卫藏方言藏汉语音翻译数据集

黎鑫 刘佳洛 多杰朋毛 看卓措 戚肖克 赵小兵

中国科学数据(中英文网络版)2025,Vol.10Issue(3):523-534,12.
中国科学数据(中英文网络版)2025,Vol.10Issue(3):523-534,12.DOI:10.11922/11-6035.csd.2024.0208.zh

TCST-UT:卫藏方言藏汉语音翻译数据集

TCST-UT:a dataset of Tibetan-Chinese speech Translation in the Ü-Tsang dialect

黎鑫 1刘佳洛 1多杰朋毛 2看卓措 2戚肖克 3赵小兵1

作者信息

  • 1. 国家语言资源监测与研究少数民族语言中心,北京 100081||中央民族大学信息工程学院,北京 100081
  • 2. 中央民族大学信息工程学院,北京 100081
  • 3. 中国政法大学法治信息管理学院,北京 102249
  • 折叠

摘要

Abstract

In the era of large language models,the construction of multilingual language resources is of great significance.However,the publicly available Tibetan-Chinese speech translation datasets are currently very limited,which significantly hinders the development of Tibetan within the context of multilingual language resource construction.To address this,this paper presents a large-scale dataset of Tibetan-Chinese speech translation constructed using a semi-automatic annotation approach,adhering to international speech translation dataset standards.Initially,leveraging the publicly available Tibetan automatic speech recognition dataset for the Ü-Tsang dialect(M2ASR),we employed the Gemini-1.5-pro large language model to translate the corresponding Tibetan transcripts into Chinese.These translations underwent rigorous review and correction by experts,resulting in a high-quality dataset of Ü-Tsang dialect Tibetan-Chinese speech translation.This dataset comprises 58,767 speech-text pairs,totaling 72.08 hours of audio from 147 different speakers,with a total size of 22 MB.This dataset can provide foundational data for Tibetan-Chinese speech translation research.Furthermore,it can offer insights for the construction of speech translation datasets for other low-resource languages.

关键词

藏汉语音翻译/数据集/半自动标注/低资源语言

Key words

Tibetan-Chinese Speech translation/dataset/semi-automatic annotation/low-resource Languages

引用本文复制引用

黎鑫,刘佳洛,多杰朋毛,看卓措,戚肖克,赵小兵..TCST-UT:卫藏方言藏汉语音翻译数据集[J].中国科学数据(中英文网络版),2025,10(3):523-534,12.

基金项目

国家社科基金重大项目(22&ZD035) National Social Science Foundation of China(22&ZD035) (22&ZD035)

中国科学数据(中英文网络版)

2096-2223

访问量0
|
下载量0
段落导航相关论文