Home   >   CSC-OpenAccess Library   >    Manuscript Information
A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audio-Visual Speech Recognition System
Rajavel, P.S. Sathidevi
Pages - 23 - 37     |    Revised - 25-02-2010     |    Published - 26-03-2010
Volume - 4   Issue - 1    |    Publication Date - March 2010  Table of Contents
Audio-visual speech recognition, Reliability-ratio based weight optimization, late integration
Audio-visual speech recognition (AVSR) using acoustic and visual signals of speech have received attention recently because of its robustness in noisy environments. Perceptual studies also support this approach by emphasizing the importance of visual information for speech recognition in humans. An important issue in decision fusion based AVSR system is how to obtain the appropriate integration weight for the speech modalities to integrate and ensure the combined AVSR system’s performances better than that of the audio-only and visual-only systems under various noise conditions. To solve this issue, we present a genetic algorithm (GA) based optimization scheme to obtain the appropriate integration weight from the relative reliability of each modality. The performance of the proposed GA optimized reliability-ratio based weight estimation scheme is demonstrated via single speaker, mobile functions isolated word recognition experiments. The results show that the proposed scheme improves robust recognition accuracy over the conventional unimodal systems and the baseline reliability ratio-based AVSR system under various signal to noise ratio conditions.
CITED BY (3)  
1 Stewart, D., Seymour, R., Pass, A., & Ming, J. (2014). Robust audio-visual speech recognition under noisy audio-video conditions. Cybernetics, IEEE Transactions on, 44(2), 175-184.
2 Shaikh, A. A., Kumar, D. K., & Gubbi, J. (2013). Automatic visual speech segmentation and recognition using directional motion history images and Zernike moments. The Visual Computer, 29(10), 969-982.
3 Chaudhary, K. (2012). Joint Error Optimization Algorithms for Multimodal Information Fusion.
1 Google Scholar 
2 ScientificCommons 
3 Academic Index 
4 CiteSeerX 
5 refSeek 
6 iSEEK 
7 Socol@r  
8 ResearchGATE 
9 Bielefeld Academic Search Engine (BASE) 
10 Scribd 
11 WorldCat 
12 SlideShare 
14 PdfSR 
A. Adjoudani, C. Benot. “On the integration of auditory and visual parameters in an HMM-based ASR”. In: D. G. Stork and M. E. Hennecke (Eds.), Speech reading by Humans and Machines: Models, Systems, and Speech Recognition, Technologies and Applications, Springer, Berlin, Germany, pp. 461-472 (1996)
A. Q. Summerfield. “Some preliminaries to a comprehensive account of audio-visual speech perception”. In: B. Dodd, R. Campbell (Eds.), Hearing by Eye: The Psychology of Lip-reading. Lawrence Erlbarum, London, pp. 3-51 (1987)
A. Rogozan, P. Delglise. “Adaptive fusion of acoustic and visual sources for automatic speech recognition”. Speech Communication. 26: 149-161, 1998
A. Verma, T. Faruquie, C. Neti, S. Basu. “Late integration in audiovisual continuous speech recognition”. In Proceedings of Workshop on Automatic Speech Recognition and Understanding. Keystone, 1999
B. Nasersharif, A. Akbari. “SNR-dependent compression of enhanced Mel sub-band energies for compensation of noise effects on MFCC features”. Pattern Recognition Letters, 28:1320-1326, 2007
B. Plannerer. “An introduction to speech recognition: A tutorial ”. Germany, 2003
C. Benoit, T. Mohamadi, S. D. Kandel. “Effects of phonetic context on audio-visual intelligibility of French”. Journal of Speech and Hearing Research. 37: 1195-1203, 1994
C. Benot. “The intrinsic bimodality of speech communication and the synthesis of talking faces”. In: M. M. Taylor, F. Nel, D. Bouwhuis (Eds.), The Structure of Multimodal Dialogue II. Amsterdam, Netherlands, pp. 485-502 (2000)
C. C. Chibelushi, F. Deravi, J. S. D. Mason. “A review of speech-based bimodal recognition”. IEEE Transactions on Multimedia, 4(1): 23-37, 2002
C. Neti, G. Potamianos, J. Luettin, I. Matthews, H. Glotin, D. Vergyri, J. Sison, A. Mashari, and J. Zhou. “Audio visual speech recognition, Final Workshop 2000 Report”. Center for Language and Speech Processing, Johns Hopkins University, Baltimore, 2000
E. D. Petajan. “Automatic lipreading to enhance speech recognition”. In Proceedings of Global Telecommunications Conference. Atlanta, 1984
G. F. Meyer, J. B.Mulligan, S. M.Wuerger. “Continuous audiovisual digit recognition using N-best decision fusion”. Information Fusion. 5: 91-101, 2004
G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu. “A cascade image transform for speaker independent automatic speechreading”. In Proceedings of IEEE International Conference on Multimedia and Expo. New York, 2000
G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior. “Recent advances in the automatic recognition of audio-visual speech”. In Proceedings of IEEE, 91(9), 2003
G. Potamianos, C. Neti, J. Huang, J.H. Connell, S. Chu, V. Libal, E.Marcheret, N. Hass, J. Jiang. “Towards practical development of audiovisual speech recognition”. In Proceedings of IEEE International Conf. on Acoustic, Speech, and Signal Processing. Canada, 2004
G. Potamianos, C. Neti, J. Luettin, and I. Matthews. “Audio-visual automatic speech recognition: An overview”. In: G. Baily, E. Vatikiotis-Bateson, P. Perrier (Eds.), Issues in visual and audio-visual speech processing, MIT Press, (2004)
G. Potamianos, H. P. Graf, and E. Cosatto. “An image transform approach for HMM based automatic lipreading”. In Proceedings of International Conference on Image Processing. Chicago, 1998
J.S. Lee, C. H. Park. “Adaptive Decision Fusion for Audio-Visual Speech Recognition”’. In: F. Mihelic, J. Zibert (Eds.), Speech Recognition, Technologies and Applications, pp. 550 (2008)
J.S. Lee, C. H. Park. “Robust audio-visual speech recognition based on late integration”’. IEEE Transaction on Multimedia, 10: 767-779, 2008
K. Iwano, T. Yoshinaga, S. Tamura, S. Furui. “Audio-visual speech recognition using lip information extracted from side-face images”. EURASIP Journal on Audio, Speech, and Music Processing, (2007): 9 pages, Article ID 64506, 2007
L. Rabiner, B.H. Juang. “Fundamentals of Speech Recognition”’. Prentice Hall, Englewood Cliffs (1993)
P. Arnold, F. Hill. “Bisensory augmentation: A speechreading advantage when speech is clearly audible and intact”. Brit. J. Psychol., 92: 339-355, 2001
P. Teissier, J. Robert-Ribes, J. L. Schwartz. “Comparing models for audiovisual fusion in a noisyvowel recognition task”. IEEE Transaction on Speech Audio Processing, 7: 629-642, 1999
P.L. Silsbee. “Sensory integration in audiovisual automatic speech recognition”. In Proceedings of the 28th Annual Asilomar Conference on Signals, Systems, and Computers, 1: 561-565, 1994
R. Rajavel, P. S. Sathidevi. “Static and dynamic features for improved HMM based visual speech recognition”. In Proceedings of 1st International Conference on Intelligent Human Computer Interaction, Allahabad, India, 2009
R. Seymour, D. Stewart, J. Ming. “Comparison of image transformbased features for visual speech recognition in clean and corrupted videos”. EURASIP Journal on Image and Video Processing. (2008), doi:10.1155/2008/810362, 2008
S. Dupont, J. Luettin. “Audio-visual speech modeling for continuous speech recognition”. IEEE Transanction on Multimedia, 2: 141-151, 2000
S. Tamura, K. Iwano, S. Furui. “A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization”. In Proceedings of ICASSP. Philadelphia, 2005
S.W.Foo, L. Dong. “Recognition of Visual Speech Elements Using Hidden Markov Models”. In: Y. C. Chen, L.W. Chang, C.T. Hsu (Eds.), Advances in Multimedia Information Processing-PCM02, LNCS2532. Springer-Verlag Berlin Heidelberg, pp.607-614 (2002)
T. Chen. “Audiovisual speech processing. Lip reading and lip synchronization”. IEEE Signal Processing Magazine, 18: 9-21, 2001
W. C. Yau, D. K. Kumar, H. Weghorn. “Visual speech recognition using motion features and Hidden Markov models”. In: M. Kampel, A. Hanbury (Eds.), LNCS, Springer, Heidelberg, pp. 832-839 (2007)
W. C. Yau, D. K. Kumar, S. P. Arjunan. “Voiceless speech recognition using dynamic visual speech features”. In Proceedings of HCSNet Workshop on the Use of Vision in HCI. Canberra, Australia, 2006
Mr. Rajavel
- India
Professor P.S. Sathidevi
NIT Calicut - India

View all special issues >>