Authors:
Reinhold Haeb-Umbach,
Volume: 1, Page (NA) Paper number 1513
Abstract:
We apply Fisher variate analysis to measure the effectiveness of speaker
normalization techniques. A trace criterion, which measures the ratio
of the variations due to different phonemes compared to variations
due to different speakers, serves as a first assessment of a feature
set without the need for recognition experiments. By using this measure
and by recognition experiments we demonstrate that cepstral mean normalization
also has a speaker normalization effect, in addition to the well-known
channel normalization effect. Similarly vocal tract normalization (VTN)
is shown to remove inter-speaker variability. For VTN we show that
normalization on a per sentence basis performs better than normalization
on a per speaker basis. Recognition results are given on Wallstreet
Journal and Hub-4 databases.
Authors:
Seung Ho Choi, Dept. of Electrical Eng., Korea Advanced Institute of Science and Technology, 373-1 Kusong-Dong, Yusong-Ku, Taejon 305-701, Korea (Korea)
Hong Kook Kim, AT&T Labs Research, Rm. E148, 180 Park Avenue, Florham Park NJ 07932, USA (USA)
Hwang Soo Lee, Central Research Laboratory, SK Telecom, 58-4 Hwaam-Dong, Yusong-Gu, Taejon 305-348, Korea (Korea)
Volume: 1, Page (NA) Paper number 1331
Abstract:
In digital communication networks, a speech recognition system extracts
feature parameters after reconstructing speech signals. In this paper,
we consider a useful approach of incorporating speech coding parameters
into a speech recognizer. Most speech coders employ line spectrum pairs
(LSPs) to represent spectral parameters. We introduce weighted distance
measures to improve the recognition performance of an LSP-based speech
recognizer. Experiments on speaker-independent connected-digit recognition
showed that weighted distance measures provide better recognition accuracy
than unweighted distance measures do. Compared with a conventional
method employing mel-frequency cepstral coefficients, the proposed
method achieved higher performance in terms of a recognition accuracy.
Authors:
Chun-Ping Chan, Department of Electronic Engineering, The Chinese University of Hong Kong (Hong Kong)
Yiu-Wing Wong, Department of Electronic Engineering, The Chinese University of Hong Kong (Hong Kong)
Tan Lee, Department of Electronic Engineering, The Chinese University of Hong Kong (Hong Kong)
Pak-Chung Ching, Department of Electronic Engineering, The Chinese University of Hong Kong (Hong Kong)
Volume: 1, Page (NA) Paper number 2261
Abstract:
This paper describes a novel approach of using multi-resolution analysis
(MRA) for automatic speech recognition. Two-dimensional MRA is applied
to the short-time log spectrum of speech signal to extract the slowly
varying spectral envelope that contains the most important articulatory
and phonetic information. After passing through a standard cepstral
analysis process, the MRA features are used for speech recognition
in the same way as conventional short-time features like MFCCs, PLPs,
etc. Preliminary experiments on both clean connected speech and noisy
telephone conversation speech show that the use of MRA cepstra results
in a significant reduction in insertion error when compared with MFCCs.
Authors:
Rathinavelu Chengalvarayan,
Volume: 1, Page (NA) Paper number 2257
Abstract:
In this paper, a new approach for linear prediction (LP) analysis is
explored, where predictor can be computed from a mel-warped subband-based
autocorrelation functions obtained from the power spectrum. For spectral
representation a set of multi-resolution cepstral features are proposed.
The general idea is to divide up the full frequency-band into several
subbands, perform the IDFT on the mel power spectrum for each subband,
followed by Durbin's algorithm and the standard conversion from LP
to cepstral coefficients. This approach can be extended to several
levels of different resolutions. Muti-resolution feature vectors, formed
by concatenation of the subband cepstral features into an extended
feature vector, are shown to yield better performance than the conventional
mel-warped LPCCs over the full voice-bandwidth for connected digit
recognition task.
Authors:
Douglas O'Shaughnessy,
Hesham Tolba,
Volume: 1, Page (NA) Paper number 1672
Abstract:
In this paper, we show that the concept of Voiced-Unvoiced (V-U) classification
of speech sounds can be incorporated not only in speech analysis or
speech enhancement processes, but also can be useful for recognition
processes. That is, the incorporation of such a classification in a
continuous speech recognition (CSR) system not only improves its performance
in low SNR environments, but also limits the time and the necessary
memory to carry out the process of the recognition. The proposed V-U
classification of the speech sounds has two principal functions: (1)
it allows the enhancement of the voiced and unvoiced parts of speech
separately; (2) it limits the Viterbi search space, and consequently
the process of recognition can be carried out in real time without
degrading the performance of the system. We prove via experiments that
such a system outperforms the baseline HTK when a V-U decision is included
in both front- and far-end of the HTK-based recognizer.
Authors:
Jhing-Fa Wang, Department of Electrical Engineering & Department of Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. (Taiwan)
Shi-Huang Chen, Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. (Taiwan)
Volume: 1, Page (NA) Paper number 1261
Abstract:
This paper proposes a new consonant/vowel (C/V) segmentation algorithm
for Mandarin speech signal. Since the Mandarin phoneme structure is
a combination of a consonant (may be null) followed by a vowel, the
C/V segmentation is an important part in the Mandarin speech recognition
system. Based on the wavelet transform, the proposed method can directly
search for the C/V segmentation point by using a product function and
energy profile. The product function is generated from the appropriate
wavelet and scaling coefficients of input speech signal and it can
be applied to indicate the C/V segmentation point. With this product
function and the additional verification of energy profile, the C/V
segmentation point can be accurately pointed out with a low computation
complexity. Experiments are provided that demonstrate the superior
performance of the proposed algorithm. An overall accuracy rate of
97.2% is achieved. This algorithm is suitable for Mandarin speech recognition
task.
Authors:
Tsuneo Nitta,
Volume: 1, Page (NA) Paper number 2298
Abstract:
This paper describes an attempt to extract multiple topological structures,
hidden in time-spectrum (TS) patterns, by using multiple mapping functions,
and to incorporate the functions into the feature extractor of a speech
recognition system. In the previous work, the author proposed a novel
feature extraction method based on MAFP/KLT (MAFP: multiple acoustic
feature planes), in which 3*3 derivative operators were used for mapping
functions, and showed that the method achieved significant improvement
in preliminary experiments. In this paper, firstly, the mapping functions
are directly extracted in the form of a 3*3 orthogonal basis from a
speech database. Next, the functions are evaluated, together with 3*3
simplified operators modeled on the orthogonal basis. Finally, after
comparing the experimental results, the author proposes an effective
feature extraction method based on MAFP/LDA, in which a Sobel operator
is used for mapping functions.
Authors:
Partha Niyogi, Bell Labs, Lucent Technologies, USA. (USA)
Chris Burges, Bell Labs, Lucent Technologies, USA. (USA)
Padma Ramesh, Bell Labs, Lucent Technologies, USA. (USA)
Volume: 1, Page (NA) Paper number 1995
Abstract:
An important aspect of distinctive feature based approaches to automatic
speech recognition is the formulation of a framework for robust detection
of these features. We discuss the application of the support vector
machines (SVM) that arise when the structural risk minimization principle
is applied to such feature detection problems. In particular, we describe
the problem of detecting stop consonants in continuous speech and discuss
an SVM framework for detecting these sounds. In this paper we use both
linear and nonlinear SVMs for stop detection and present experimental
results to show that they perform better than a cepstral features based
hidden Markov model (HMM) system, on the same task.
|