Christoph Boeddeker

Universität Paderborn

H-index: 22

Europe-Germany

Description

Christoph Boeddeker, With an exceptional h-index of 22 and a recent h-index of 21 (since 2020), a distinguished researcher at Universität Paderborn,

His recent articles reflect a diverse array of research interests and contributions to the field:

Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios

TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings

On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

Reverberation as Supervision For Speech Separation

A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures

MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

Professor Information

University	Universität Paderborn
Position	___
Citations(all)	1532
Citations(since 2020)	1467
Cited By	478
hIndex(all)	22
hIndex(since 2020)	21
i10Index(all)	26
i10Index(since 2020)	26
Email	Access Email
University Profile Page	Universität Paderborn

Top articles of Christoph Boeddeker

Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios

We propose a modified teacher-student training for the extraction of frame-wise speaker embeddings that allows for an effective diarization of meeting scenarios containing partially overlapping speech. To this end, a geodesic distance loss is used that enforces the embeddings computed from regions with two active speakers to lie on the shortest path on a sphere between the points given by the d-vectors of each of the active speakers. Using those frame-wise speaker embeddings in clustering-based diarization outperforms segment-level clustering-based diarization systems such as VBx and Spectral Clustering. By extending our approach to a mixture-model-based diarization, the performance can be further improved, approaching the diarization error rates of diarization systems that use a dedicated overlap detection, and outperforming these systems when also employing an additional overlap detection.

Authors

Tobias Cord-Landwehr,Christoph Boeddeker,Cătălin Zorilă,Rama Doddipatla,Reinhold Haeb-Umbach

Journal

arXiv preprint arXiv:2401.03963

Published Date

2024/1/8

TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings

Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.

Authors

Christoph Boeddeker,Aswin Shanmugam Subramanian,Gordon Wichern,Reinhold Haeb-Umbach,Jonathan Le Roux

Journal

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Published Date

2024/1/8

On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems

We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distance tensor under the constraint that a reference utterance must be matched consistently with one hypothesis output. This also results in an efficient implementation of the ORC WER which previously suffered from exponential complexity. We give an overview of commonly used WER definitions for multi-speaker scenarios and show that they are specializations of the above MIMO WER tuned to particular application scenarios. We conclude with a discussion of the pros and cons of the various WER definitions and a …

Authors

Thilo von Neumann,Christoph Boeddeker,Keisuke Kinoshita,Marc Delcroix,Reinhold Haeb-Umbach

Published Date

2023/6/4

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.

Authors

Thilo von Neumann,Christoph Boeddeker,Tobias Cord-Landwehr,Marc Delcroix,Reinhold Haeb-Umbach

Journal

arXiv preprint arXiv:2309.16482

Published Date

2023/9/28

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect of artifacts introduced by the speech separation. Previously, however, the method only addressed two-speaker scenarios. In this work, we extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. We evaluate the performance using different speech separators, including the powerful TF-GridNet model. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder. Furthermore, they demonstrate the strong separation of TF-GridNet which largely closes the gap between previous methods and oracle separation.

Authors

Peter Vieting,Simon Berger,Thilo von Neumann,Christoph Boeddeker,Ralf Schlüter,Reinhold Haeb-Umbach

Journal

arXiv preprint arXiv:2309.08454

Published Date

2023/9/15

Reverberation as Supervision For Speech Separation

This paper proposes reverberation as supervision (RAS), a novel unsupervised loss function for single-channel reverberant speech separation. Prior methods for unsupervised separation required the synthesis of mixtures of mixtures or assumed the existence of a teacher model, making them difficult to consider as potential methods explaining the emergence of separation abilities in an animal’s auditory system. We assume the availability of two-channel mixtures at training time, and train a neural network to separate the sources given one of the channels as input such that the other channel may be predicted from the separated sources. As the relationship between the room impulse responses (RIRs) of each channel depends on the locations of the sources, which are unknown to the network, the network cannot rely on learning that relationship. Instead, our proposed loss function fits each of the separated sources …

Authors

Rohith Aralikatti,Christoph Boeddeker,Gordon Wichern,Aswin Subramanian,Jonathan Le Roux

Published Date

2023/6/4

A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures

We introduce a monaural neural speaker embeddings extractor that computes an embedding for each speaker present in a speech mixture. To allow for supervised training, a teacher-student approach is employed: the teacher computes the target embeddings from each speaker's utterance before the utterances are added to form the mixture, and the student embedding extractor is then tasked to reproduce those embeddings from the speech mixture at its input. The system much more reliably verifies the presence or absence of a given speaker in a mixture than a conventional speaker embedding extractor, and even exhibits comparable performance to a multi-channel approach that exploits spatial information for embedding extraction. Further, it is shown that a speaker embedding computed from a mixture can be used to check for the presence of that speaker in another mixture.

Authors

Tobias Cord-Landwehr,Christoph Boeddeker,Cătălin Zorilă,Rama Doddipatla,Reinhold Haeb-Umbach

Journal

arXiv preprint arXiv:2306.00634

Published Date

2023/6/1

MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems

MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.

Authors

Thilo von Neumann,Christoph Boeddeker,Marc Delcroix,Reinhold Haeb-Umbach

Journal

arXiv preprint arXiv:2307.11394

Published Date

2023/7/21