Local normalization and delayed decision making in speaker detection and tracking

View all publications


Reference

Title: Local normalization and delayed decision making in speaker detection and tracking

Author(s): Johan Koolwaaij & Lou Boves

Reference: Special Issue of Digital Signal Processing: A Review Journal on the NIST Speaker Recognition Workshop, Volume 10, Number 1-3, pp. 113-132

Keywords: Speaker Recognition

Abstract

This paper describes A2RT's speaker detection and tracking system and its performance on the 1999 NIST speaker recognition evaluation data. The system does not consist of concatenated modules like, for instance, silence-speech detection, handset and gender detection, and finally speaker detection or tracking, where each module builds on the hard decisions from previous modules, but rather applies the principle of delayed decision making, and postpones all hard decisions until the final stage of the detection process. This paper focuses on two important locality issues in detecting or tracking speakers in a telephone conversation, for which the speaker change frequency is usually high. First, channel estimation needs sufficiently long but homogeneous segments. Several kinds of local channel normalization are compared in this paper. And second, local estimation of speaker likelihoods critically depends on the segmentation of the conversation.

Our experiments show that a global level of segmentation really improves speaker tracking performance, whereas a more detailed segmentation is needed for speaker detection, because likelihood computation over clusters of segments depends on the purity of the segments. Furthermore, choosing the appropriate type of channel normalization can give a small but consistent improvement in speaker tracking performance. Error processing SSI file