Welcome to Zhihao Du（杜志浩）’s homepage.

Highlights

CosyVoice series have reached stars
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training [Code][Paper][Demos]
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction [Code][Paper][Demos]
CosyVoice 2: Scalable Streaming Speech Synthesis model has been open-sourced: [Code][Paper][Demos]
CosyVoice: Large-scale Multilingual Zero-shot Text-to-speech Synthesizer has been open-sourced: [Code][Paper][Demos]
FunAudioLLM has bee released at: [Code][Paper]

About（关于）

My research interests include speech generative models (e.g. CosyVoice series, InspireMusic), multi-modal large language models (e.g. LauraGPT, MinMo), speech processing and deep learning. I recieved the Ph.D. degree with the School of Computer Science and Technology at Harbin Institute of Technology under the supervision of Jiqing Han, in 2021. I received the B.E. degree in software engineering from the College of Software of Inner Mongolia University under the supervision of Xueliang Zhang, in 2015. Last, but certainly not least, I'd like to thanks my wonderful wife for her understanding and supports.[About Me]

Publications（出版物）

(Note: Most of my papers can be found on arxiv.)

Journal Papers（期刊论文）

Zhihao Du, Xueliang Zhang, Jiqing Han. A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020. View Demos

Conference Papers（会议论文）

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, tianrui wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen, "EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting", ACM MM 2025
Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao Du, ShiLiang Zhang, "OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation", ACL 2025
Changfeng Gao, Zhihao Du, Shiliang Zhang, Differentiable Reward Optimization for LLM based TTS system, INTERSPEECH 2025
Xiang Lyu, Yuxuan Wang, Tianyu Zhao, Hao Wang, Huadai Liu, Zhihao Du, Build LLM-base Zero-Shot Streaming TTS System with CosyVoice. ICASSP 2025
Guanrou Yang, Fan Yu, Ziyang Ma, Zhihao Du, Zhifu Gao, Shiliang Zhang, Xie Chen, Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap. ICASSP 2025
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, ShiLiang Zhang, Xie Chen, Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration. AAAI 2025(Oral)
Yue Gu, Zhihao Du, Shiliang Zhang, Jiqing Han, Yongjun He, Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition. INTERSPEECH 2024 Paper
Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng, Funcodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec. ICASSP 2024
Mohan Shi, Zhihao Du, et al., A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-Party Meetings. APSIPA 2023
Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie, Sa-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR. ASRU 2023
Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, Zhuo Chen, Kong Aik Lee, Zhijie Yan, Hui Bu, The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT 2.0): A Benchmark for Speaker-Attributed ASR. ASRU 2023
Yue Gu, Zhihao Du, Shiliang Zhang, Qian Chen, Jiqing Han, Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition. INTERSPEECH 2023 Paper
Mohan Shi, Zhihao Du, Qian Chen, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang, Lirong Dai, CASA-ASR: Context-Aware Speaker-Attributed ASR. INTERSPEECH 2023
Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, Shiliang Zhang, FunASR: A Fundamental End-to-End Speech Recognition Toolkit. INTERSPEECH 2023
Jiaming Wang*, Zhihao Du*, Shiliang Zhang. TOLD: A Novel Two-stage Overlap-aware Framework for Speaker Diarization. ICASSP 2023 (equal contribution)
Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan. Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis. EMNLP 2022 (long paper)
Yuxiao Lin, Zhihao Du, Shiliang Zhang, Fan Yu, Zhou Zhao, Fei Wu, Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR. ISCSLP 2022
Fan Yu, Shiliang Zhang, Pengcheng Guo, Yuhao Liang, Zhihao Du, et.al. MFCCA: Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario. SLT 2022
Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie. A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings. ICASSP 2022
Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, et.al. Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge. ICASSP 2022
Fan Yu, Shiliang Zhang, Yihui Fu, Lei Xie, Siqi Zheng, Zhihao Du, et.al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge. ICASSP 2022
Hongwei Song, Jiqing Han, Shiwen Deng, Zhihao Du. Capturing Temporal Dependencies Through Future Prediction for CNN-Based Audio Classifiers. ICASSP 2021
Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang. Pan: Phoneme-aware network for monaural speech enhancement. ICASSP 2020.
Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang. Self-Supervised Adversarial Multi-Task Learning for Vocoder-Based Monaural Speech Enhancement. INTERSPEECH 2020
Zhihao Du, Jiqing Han, Xueliang Zhang. Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition. INTERSPEECH 2020, https://github.com/ZhihaoDU/du2020dan
Yue Gu, Zhihao Du, Hui Zhang, Xueliang Zhang. An Efficient Joint Training Framework for Robust Small-Footprint Keyword Spotting. ICONIP 2020
Hongwei Song, Jiqing Han, Shiwen Deng. Zhihao Du. Acoustic scene classification by implicitly identifying distinct sound events, INTERSPEECH 2019
Zhihao Du, Xueliang Zhang, Jiqing Han. Investigation of Monaural Front-End Processing for Robust Speech Recognition Without Retraining or Joint-Training. APSIPA 2019.

Preprints（预印本）

Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan. Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information. https://arxiv.org/abs/2111.13694.

PhD Thesis（博士论文）

RESEARCH ON MONAURAL SPEECH ENHANCEMENT BASED ON PRIOR INFORMATION IN DIFFERENT SEMANTIC LEVELS（基于不同语义层级先验信息的单通道语音增强方法研究）.

Reviewer（审稿)

Journal（期刊)

KNOSYS, Knowledge-Based Systems (IF=7.6)
TASLP, IEEE Transactions on Audio, Speech and Language Processing (IF=6.1)
SPECOM, Speech Communication (IF=2.9)

Conference（会议)

Program Committee for AAAI Conference on Artificial Intelligence (AAAI-26)
ACL, Association for Computational Linguistics Conferences
ACM MM, ACM Multimedia
IJCAI, International Joint Conference on Artificial Intelligence
ICASSP, International Conference on Acoustics, Speech and Signal Processing
INTERSPEECH, Conference of the International Speech Communication Association
IAML, International Conference on Asian Language Processing
ISCSLP, International Symposium on Chinese Spoken Language Processing

Open sources（开源代码）

Widely-used speech features, https://github.com/ZhihaoDU/speech_feature_extractor, star 100+

Honors（荣誉）

哈尔滨工业大学优秀博士论文提名（2021）
内蒙古自治区优秀毕业生（2015）
MCM Meritorious Winner
ACM/ICPC 二等奖

Organization（组织）

IEEE Member
SIGDAT Member

Contact me（联系我）

TEL: +86-15600609952

E-mails: duzhihao.china@gmail.com and zhihaodu0622@163.com