Mispronunciation Detection and Diagnosis
Introduction
Mispronunciation Detection and Diagnosis (MDD) is a subfield of speech processing and computer-assisted language learning (CALL) that aims to automatically detect and analyze pronunciation errors in spoken language, particularly for second-language learners. With the rapid advancement of machine learning and deep learning techniques, MDD has achieved significant progress in developing systems that provide accurate, fine-grained feedback on learners’ pronunciation, contributing to more effective and personalized language learning experiences.
Applications of MDD include computer-assisted pronunciation training, language assessment, intelligent tutoring systems, and speech-enabled educational platforms. These systems can identify mispronounced phonemes, stress and intonation errors, and offer diagnostic feedback to help learners improve their spoken proficiency.
Our research group focuses on exploiting machine learning and deep learning approaches, in combination with acoustic-phonetic features and linguistic knowledge, to develop high-performance MDD systems. We also investigate methods for constructing high-quality annotated pronunciation datasets and for building robust models that generalize well across speakers, accents, proficiency levels, and languages.
Contact: Dr. Nguyen Thi Thu Trang | ✉️ trangntt@soict.hust.edu.vn
Research Direction
- Effective Linguistic Information Using in MDD: This research direction focuses on enhancing the learning capability of mispronunciation detection and diagnosis systems by effectively exploiting linguistic and accent-related information. In particular, we investigate how accent-specific pronunciation biases can be leveraged to compensate for and balance mispronunciation patterns from other accents, thereby improving model robustness. To address this challenge, we explore ensemble learning strategies trained on multi-accent datasets, where each component model captures accent-dependent characteristics while the ensemble integrates complementary knowledge across accents. In addition, we augment existing non-native English speech corpora (e.g., TIMIT, L2-ARCTIC) by explicitly incorporating accent information into the training data, enabling accent-aware modeling and better generalization across diverse learner populations.
- Graph-Based Phonetic Knowledge Modeling: This research direction explores the use of graph-based learning to explicitly model phonetic knowledge for mispronunciation detection and diagnosis. We investigate Graph Convolutional Networks (GCNs) to learn structured relationships among phonemes, such as articulatory similarity, phonological features, and confusion patterns observed in non-native speech. By representing phonetic knowledge as graphs, we aim to enhance the model’s internal representations and improve its ability to detect and diagnose systematic pronunciation errors. This direction emphasizes representation engineering, focusing on how graph-informed embeddings can be integrated with acoustic models to provide more linguistically grounded and interpretable MDD systems.
Members
Ha Viet Khanh
Team Leader
Tran Tien Dat
Researcher
Chu Hoang Viet
Researcher
Nguyen Hoang Lam
Researcher
Publications
- Huu, T.T., Ha, V.K., Tran, T.D., Vu, H., Thien, V.L., Nguyen, T.C., Nguyen, T.T.T, "Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach", ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- Huu Tuong Tu, Huan Vu, Cuong Tien Nguyen, Dien Hy Ngo, and Nguyen Thi Thu Trang. 2025. O_O-VC: Synthetic Data-Driven One-to-One Alignment for Any-to-Any Voice Conversion. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 16197–16208, Suzhou, China. Association for Computational Linguistics.
- Huu, T.T., Pham, V.T., Nguyen, T.T.T., Dao, T.L. (2023) Mispronunciation detection and diagnosis model for tonal language, applied to Vietnamese. Proc. Interspeech 2023, 1014-1018