Christoph Boeddeker
Universität Paderborn
H-index: 22
Europe-Germany
Description
Christoph Boeddeker, With an exceptional h-index of 22 and a recent h-index of 21 (since 2020), a distinguished researcher at Universität Paderborn,
His recent articles reflect a diverse array of research interests and contributions to the field:
Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios
TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings
On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization
Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition
Reverberation as Supervision For Speech Separation
A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures
MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems
Professor Information
University | Universität Paderborn |
---|---|
Position | ___ |
Citations(all) | 1532 |
Citations(since 2020) | 1467 |
Cited By | 478 |
hIndex(all) | 22 |
hIndex(since 2020) | 21 |
i10Index(all) | 26 |
i10Index(since 2020) | 26 |
University Profile Page | Universität Paderborn |
Top articles of Christoph Boeddeker
Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios
We propose a modified teacher-student training for the extraction of frame-wise speaker embeddings that allows for an effective diarization of meeting scenarios containing partially overlapping speech. To this end, a geodesic distance loss is used that enforces the embeddings computed from regions with two active speakers to lie on the shortest path on a sphere between the points given by the d-vectors of each of the active speakers. Using those frame-wise speaker embeddings in clustering-based diarization outperforms segment-level clustering-based diarization systems such as VBx and Spectral Clustering. By extending our approach to a mixture-model-based diarization, the performance can be further improved, approaching the diarization error rates of diarization systems that use a dedicated overlap detection, and outperforming these systems when also employing an additional overlap detection.
Authors
Tobias Cord-Landwehr,Christoph Boeddeker,Cătălin Zorilă,Rama Doddipatla,Reinhold Haeb-Umbach
Journal
arXiv preprint arXiv:2401.03963
Published Date
2024/1/8
TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings
Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.
Authors
Christoph Boeddeker,Aswin Shanmugam Subramanian,Gordon Wichern,Reinhold Haeb-Umbach,Jonathan Le Roux
Journal
IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published Date
2024/1/8
On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems
We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distance tensor under the constraint that a reference utterance must be matched consistently with one hypothesis output. This also results in an efficient implementation of the ORC WER which previously suffered from exponential complexity. We give an overview of commonly used WER definitions for multi-speaker scenarios and show that they are specializations of the above MIMO WER tuned to particular application scenarios. We conclude with a discussion of the pros and cons of the various WER definitions and a …
Authors
Thilo von Neumann,Christoph Boeddeker,Keisuke Kinoshita,Marc Delcroix,Reinhold Haeb-Umbach
Published Date
2023/6/4
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization
We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.
Authors
Thilo von Neumann,Christoph Boeddeker,Tobias Cord-Landwehr,Marc Delcroix,Reinhold Haeb-Umbach
Journal
arXiv preprint arXiv:2309.16482
Published Date
2023/9/28
Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition
Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect of artifacts introduced by the speech separation. Previously, however, the method only addressed two-speaker scenarios. In this work, we extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. We evaluate the performance using different speech separators, including the powerful TF-GridNet model. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder. Furthermore, they demonstrate the strong separation of TF-GridNet which largely closes the gap between previous methods and oracle separation.
Authors
Peter Vieting,Simon Berger,Thilo von Neumann,Christoph Boeddeker,Ralf Schlüter,Reinhold Haeb-Umbach
Journal
arXiv preprint arXiv:2309.08454
Published Date
2023/9/15
Reverberation as Supervision For Speech Separation
This paper proposes reverberation as supervision (RAS), a novel unsupervised loss function for single-channel reverberant speech separation. Prior methods for unsupervised separation required the synthesis of mixtures of mixtures or assumed the existence of a teacher model, making them difficult to consider as potential methods explaining the emergence of separation abilities in an animal’s auditory system. We assume the availability of two-channel mixtures at training time, and train a neural network to separate the sources given one of the channels as input such that the other channel may be predicted from the separated sources. As the relationship between the room impulse responses (RIRs) of each channel depends on the locations of the sources, which are unknown to the network, the network cannot rely on learning that relationship. Instead, our proposed loss function fits each of the separated sources …
Authors
Rohith Aralikatti,Christoph Boeddeker,Gordon Wichern,Aswin Subramanian,Jonathan Le Roux
Published Date
2023/6/4
A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures
We introduce a monaural neural speaker embeddings extractor that computes an embedding for each speaker present in a speech mixture. To allow for supervised training, a teacher-student approach is employed: the teacher computes the target embeddings from each speaker's utterance before the utterances are added to form the mixture, and the student embedding extractor is then tasked to reproduce those embeddings from the speech mixture at its input. The system much more reliably verifies the presence or absence of a given speaker in a mixture than a conventional speaker embedding extractor, and even exhibits comparable performance to a multi-channel approach that exploits spatial information for embedding extraction. Further, it is shown that a speaker embedding computed from a mixture can be used to check for the presence of that speaker in another mixture.
Authors
Tobias Cord-Landwehr,Christoph Boeddeker,Cătălin Zorilă,Rama Doddipatla,Reinhold Haeb-Umbach
Journal
arXiv preprint arXiv:2306.00634
Published Date
2023/6/1
MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems
MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps.
Authors
Thilo von Neumann,Christoph Boeddeker,Marc Delcroix,Reinhold Haeb-Umbach
Journal
arXiv preprint arXiv:2307.11394
Published Date
2023/7/21
Professor FAQs
What is Christoph Boeddeker's h-index at Universität Paderborn?
The h-index of Christoph Boeddeker has been 21 since 2020 and 22 in total.
What are Christoph Boeddeker's top articles?
The articles with the titles of
Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios
TS-SEP: Joint diarization and separation conditioned on estimated speaker embeddings
On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization
Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition
Reverberation as Supervision For Speech Separation
A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures
MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems
...
are the top articles of Christoph Boeddeker at Universität Paderborn.
What is Christoph Boeddeker's total number of citations?
Christoph Boeddeker has 1,532 citations in total.
What are the co-authors of Christoph Boeddeker?
The co-authors of Christoph Boeddeker are Reinhold Haeb-Umbach, Joerg Schmalenstroeer, Thilo von Neumann, Jens Heitkaemper.