

Robust Features for Automatic Text-Independent Speaker Recognition using Ergodic Hidden Markov Models (HMMs)
Abstract
In this paper, robust feature for text-independent speaker
recognition has been explored. Through different experimental studies, it is demonstrated that, these robust features captures speaker specific information effectively by using Ergodic hidden Markov models (HMMs). The study on the effect of feature vector size for good speaker recognition demonstrates that, feature vector size in the range of 18-22 can capture speaker specific related information effectively for a speech signal sampled at 16 kHz, it is established that
the proposed speaker recognition system using robust features requires significantly less amount of training data during both in the training as well as in testing. Finally, the speaker recognition studies using robust features for different mixtures components, training and test durations have been exploited. We demonstrate the speaker recognition studies
on TIMIT database.
Keywords
References
S.Furui, “An overview of speaker recognition technology in Automatic
Speech and Speaker Recognition “(C.-H. Lee, F. K. Soong, and K. K.
Paliwal, eds.), ch. 2, pp.31-56, Boston: Kluwer Academic, 1996.
D. A. Reynolds, “The effects of handset variability on speaker recognition
performance: Experiments on switch board corpus,” in Proceedings of
IEEE Int. Conf. Acoust Speech, and Signal Processing, pp. 113-116,1996.
Dempster, A., Laird, N., and Rubin, D., “Maximum likelihood from
incomplete data via the EM algorithm,” Journal of the Royal Statistical
Society, vol. 39, pp. 1-38,1977.
Makhoul, J., 1975. Linear prediction: a tutorial review. Proc. IEEE 63,
-580.
Molau, S., Pitz, M., Schluter, R., and Ney, H., “Computing Mel-frequency
cepstral coefficients on the power spectrum,“ Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing (
ICASSP), vol. 1, pp. 73-76, May. 2001.
Picone, J. W., “Signal modeling techniques in speech recognition,”
Proceedings of IEEE, vol. 81, no.9, pp. 1215-1247, Sep. 1993.
Doddington.G., Speaker Recognition based on idiolectal differences
between speakers. In Proc. 7th European Conference on SpeechCommunication and Technology (Eurospeech 2001) (Aalborg, Denmark,
September 2001), pp. 2521 – 2524.
Andrews.W., Kohler.M., Campbell.J., Godfrey.J., and Hernandez
–Cordero.J. Gender-dependent phonetic refraction for speaker
recognition. In Proc. Int. Conf. on Acoustics, Speech, and Signal
Processing (ICASSP 2002) (Orlando, Florida, USA, May 2002), vol. 1,
pp. 149-152.
Campbell.W., Campbell.J., Reynolds.D., Jones.D., and Leek.T., Phonetic
speaker recognition with support vector machines. In Advances in Neural
Information Processing Systems 16,
S.Thrun, L.Saul, and B.Scholkopf, Eds. MIT Press, Cambridge, MA,
Adami.A, Mihaescu.R, Reynolds.D, and Godfrey.J., Modelling Prosodic
dynamics for speaker recognition. In Proc. Int. Conf. on Acoustics,
Speech, and Signal Processing (ICASSP 2003) ( Hong Kong, China,
April 2003), pp. 788-791.
Chen, Z.-H., Liao, Y.-F., and Juang, Y.-T. Eigen-Prosody analysis for
robust speaker recognition under mismatch handset environment. In Proc.
Int. Conf. on Spoken language processing (ICSLP 2004) (Jeju, South
Korea, October 2004), pp. 1421-1424.
Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., and Stolcke, A.
Modeling Prosodic feature sequences for speaker recognition. Speech
Communication 46, 3-4 (July 2005), 455-472.
Leung, K., Mak, M., Siu, M., and Kung, S. Adaptive articulatory feature –
based conditional pronunciation modeling for speaker verification.
Speech Communication 48, 1 (January 2006), 71-84.
MA, B., Zhu, D., Tong, R., and Li, H. Speaker cluster based GMM
tokenization for Speaker recognition. In Proc. Interspeech 2006 (ICSLP)
(Pittsburgh, Pennsylvania, USA, September 2006), pp. 505-508.
Torres-Carrasquillo, P., Reynolds, D., and Jr., J. D. Language
identification using Gaussian mixture model tokenization. In Proc. Int.
Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2002)
(Orlando, Florida, USA, May 2002), vol. 1, pp. 757-760.
Xiang, B. Text-independent speaker verification with dynamic trajectory
model. IEEE Signal Processing Letters 10 (May 2003), 141 -143.
Jin, Q., Schultz , T., and Waibel, A. Speaker identification using
multilingual phone strings. In Proc. Int. Int. Conf. on Acoustics, Speech,
and Signal Processing (ICASSP 2002) (Orlando, Florida, USA, May
, vol. 1, pp. 145-148.
Zissman, M. Comparison of four approaches to automatic language
identification of telephone speech. IEEE Trans. on Speech and Audio
Processing 4, 1 (January 1996), 31-44.
MA, B., Li, H., and Tong, R. Spoken language recognition with ensemble
classifier. IEEE Trans. Audio, Speech and Language Processing 15, 7
(September 2007), 2053-2062.)
Markel, J., Oshika, B., and A.H. Gray, J. Long-term feature averaging for
speaker recognition. IEEE Trans. Acoustics, Speech, and Signal
Processing 25, 4 (August 1977), 330-337.
Kinnunen, T., Hautamaki, V., and Franti, P. On the use of long-term
average spectrum in automatic speaker recognition. In 5th Int.
Symposium on Chinese Spoken Language Processing (ISCSLP’06)
(Singapore, Dec, 2006), pp.559-567.
Tomi Kinnunen., and Haizhou Li., An overview of Text-Independent
Speaker Recognition: from Features to Supervectors. Speech
Communication, July 1, 2009.
M. Forsyth and M. Jack, “Discriminating semi-continuous HMM for
speaker verification”, Proceedings of IEEE Int. Conf. on Acoust.,
Speech., and Signal Processing, vol. 1, pp. 313-316, 1994.
M. Forsyth, “Discriminating observation probability (dop) HMM for
speaker verification”, Speech Comm., vol. 17, pp. 117-129, 1995.
R, Rajeshwara Rao, “Automatic Text -Independent Speaker Recognition
using source features”, Ph.D. thesis., Jan-2010.
L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition.
Prentice-Hall, 1993.
MA, B., Li, H., and Tong, R. Spoken language recognition with ensemble
classifier. IEEE Trans. Audio, Speech and Language Processing 15, 7
(September 2007), 2053-2062.)
A. P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum likelihood from
incomplete data via the EM algorithm”, J. Royal Statist. Soc. Ser. B.
(methodologies), vol. 39, pp. 1-38, 1977.
K.N. Stevens, Acoustic Phonetics. Cambridge, England: The MIT Press,
Refbacks
- There are currently no refbacks.

This work is licensed under a Creative Commons Attribution 3.0 License.