首页|期刊导航|信号处理|基于生成式算法的序列到序列目标说话人检测和日志系统

基于生成式算法的序列到序列目标说话人检测和日志系统

陈正阳钱彦旻

信号处理2025，Vol.41Issue(9)：1570-1580,11.

信号处理2025，Vol.41Issue(9)：1570-1580,11.DOI:10.12466/xhcl.2025.09.010

基于生成式算法的序列到序列目标说话人检测和日志系统

Generative Sequence-to-Sequence Framework for Target Speaker Voice Activity Detection and Diarization

陈正阳 ¹钱彦旻¹

作者信息

1. 上海交通大学计算机学院,上海 200240
折叠

摘要

Abstract

Speaker diarization development has spanned decades,playing a pivotal role in speech processing,particu-larly in multi-speaker scenarios.Speaker diarization systems segment multi-speaker audio into homogeneous regions with consistent speaker identities.In the era of big data,extracting single-speaker segments from large volumes of unla-beled speech has become increasingly important,making speaker diarization systems essential tools.Prior research on neural network-based speaker diarization has focused largely on architectural design,with most implementations relying on binary cross-entropy as the optimization objective.This is intuitive,as speaker activity is typically represented as bi-nary label sequences:0 for absence and 1 for presence of the target speaker.Most such systems use discriminative algo-rithms that produce fixed outputs for fixed inputs.However,this fixed mapping can limit performance since speaker dia-rization labels are interval-based and often have uncertain boundaries that may adversely affect discriminative algorithm training.Recently,generative algorithms have gained attention for their iterative inference and distribution modeling ca-pabilities that enable finer-grained results and increase resilience to label uncertainty.Neural network-based diarization systems generally fall into two categories:end-to-end systems and target speaker voice activity detection(TSVAD)sys-tems.In this paper,investigate the application of generative algorithms to a sequence-to-sequence TSVAD system.Building upon an existing implementation,we integrate two generative models—diffusion and flow-matching algorithms—to predict outcome distributions.Experiments show that generative models underperform in the binary label space.To mitigate this,we propose a label autoencoder to compress binary sequences into a low-dimensional continuous latent space.Within this space,our flow-matching algorithm outperforms the baseline.Moreover,the stochastic nature of gen-erative models allows for multiple sampling iterations yielding varying results.Fusing these results leads to further per-formance improvements—achieving approximately 12%relative improvement over the baseline.Additional findings in-clude the significantly lower performance and slower convergence of diffusion algorithms compared to flow-matching with deterministic sampling paths,because of their inherent stochasticity.Notably,while generative methods in other domains often require over 10 inference steps,only two were steps sufficient for effective diarization.Although our gen-erative model yields variable outputs,performance remains robust,and all samples consistently outperform the discrimi-native baseline.We also explore the compression of binary label sequences into a continuous low-dimensional latent space through ablation studies to determine optimal configurations.Finally,while this study focuses on TSVAD systems,extending generative methods to end-to-end diarization architectures remains a promising direction for future research.

关键词

目标说话人检测/说话人日志/生成式算法/扩散算法/流匹配算法

Key words

target speaker voice activity detection/speaker diarization/generative method/diffusion algorithm/flow-matching algorithm

分类

信息技术与安全科学

引用本文复制引用

陈正阳,钱彦旻..基于生成式算法的序列到序列目标说话人检测和日志系统[J].信号处理,2025,41(9):1570-1580,11.

基金项目

科技创新2030-国家重大项目(2021ZD0201500) （2021ZD0201500）

国家自然科学基金(62122050,62071288) China STI 2030-Major Projects(2021ZD0201500) （62122050,62071288）

The National Natural Science Foundation of China(62122050,62071288) （62122050,62071288）

信号处理

OA北大核心

ISSN：1003-0530

访问量0

下载量0

段落导航