Zhihao Du (杜志浩)

Highlights 近期动态

CosyVoice series have reached 22k+ stars on GitHub.
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training is open-sourced. Code Paper Demos
MinMo: A Multimodal Large Language Model for Seamless Voice Interaction. Paper Demos
CosyVoice 2: Scalable Streaming Speech Synthesis model open-sourced. Code Paper Demos
CosyVoice: Large-scale Multilingual Zero-shot TTS Synthesizer open-sourced. Code Paper Demos
FunAudioLLM has been released. Code Paper

About 关于

My research interests include speech generative models (e.g. CosyVoice series, InspireMusic), multi-modal large language models (e.g. LauraGPT, MinMo), speech processing and deep learning.

I received the Ph.D. degree at the School of Computer Science and Technology, Harbin Institute of Technology, under the supervision of Prof. Jiqing Han, in 2021. I received the B.E. degree in software engineering from the College of Software, Inner Mongolia University, under the supervision of Prof. Xueliang Zhang, in 2015.

Last, but certainly not least, I'd like to thank my wonderful wife for her understanding and support. About Me

Experience & Education 经历 & 教育

Researcher · Speech Generation & Multimodal LLMs

Speech Team, Tongyi Lab, Alibaba Group（通义实验室）

2021 – Present

Creator of the CosyVoice series, MinMo, LauraGPT, FunCodec and FunAudioLLM; core contributor to FunASR and InspireMusic.
Ph.D. in Computer Science and Technology

Harbin Institute of Technology（哈尔滨工业大学）

2015 – 2021 · Advisor: Prof. Jiqing Han

Research on monaural speech enhancement based on prior information at different semantic levels.
B.E. in Software Engineering

Inner Mongolia University（内蒙古大学）

2011 – 2015 · Advisor: Prof. Xueliang Zhang

Publications 出版物

Note: most of my papers can be found on arXiv.

Selected Publications 代表作

2025

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-trainingZhihao Du, Changfeng Gao, Yuxuan Wang, et al. Code Paper Demos

2025

MinMo: A Multimodal Large Language Model for Seamless Voice InteractionQian Chen, ..., Zhihao Du, et al. Paper Demos

2024

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language ModelsZhihao Du, Yuxuan Wang, Qian Chen, et al. Code Paper Demos

2024

CosyVoice: A Scalable Multilingual Zero-shot TTS Synthesizer based on Supervised Semantic TokensZhihao Du, Qian Chen, Shiliang Zhang, et al. Code Paper Demos

2024

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech CodecZhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng. ICASSP 2024

2023

LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPTZhihao Du, Jiaming Wang, Qian Chen, et al. Paper

Show full publication list

Journal Papers 期刊论文

Zhihao Du, Xueliang Zhang, Jiqing Han. A joint framework of denoising autoencoder and generative vocoder for monaural speech enhancement. IEEE/ACM TASLP, 2020. View Demos
Yue Gu, Zhihao Du^†, Ying Shi, Shiliang Zhang, Qian Chen, Jiqing Han^†. Enhancing the Robustness of Contextual ASR to Varying Biasing Information Volumes Through Purified Semantic Correlation Joint Modeling. IEEE TASLP, 2025.
Yue Gu, Zhihao Du^†, Ying Shi, Jiqing Han^†, Yongjun He. Knowledge-Decoupled Functionally Invariant Path with Synthetic Personal Data for Personalized ASR. IEEE SPL, 2025.

Conference Papers 会议论文

Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen. EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting. ACM MM 2025
Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao Du, ShiLiang Zhang. OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation. ACL 2025
Changfeng Gao, Zhihao Du, Shiliang Zhang. Differentiable Reward Optimization for LLM based TTS system. INTERSPEECH 2025
Xiang Lyu, Yuxuan Wang, Tianyu Zhao, Hao Wang, Huadai Liu, Zhihao Du. Build LLM-based Zero-Shot Streaming TTS System with CosyVoice. ICASSP 2025
Guanrou Yang, Fan Yu, Ziyang Ma, Zhihao Du, Zhifu Gao, Shiliang Zhang, Xie Chen. Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap. ICASSP 2025
Ziyang Ma, Guanrou Yang, Yifan Yang, Zhifu Gao, Jiaming Wang, Zhihao Du, Fan Yu, Qian Chen, Siqi Zheng, ShiLiang Zhang, Xie Chen. Speech Recognition Meets Large Language Model: Benchmarking, Models, and Exploration. AAAI 2025 (Oral)
Yue Gu, Zhihao Du, Shiliang Zhang, Jiqing Han, Yongjun He. Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition. INTERSPEECH 2024 [Paper]
Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng. FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec. ICASSP 2024
Mohan Shi, Zhihao Du, et al. A Comparative Study on Multichannel Speaker-Attributed ASR in Multi-Party Meetings. APSIPA 2023
Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie. SA-Paraformer: Non-Autoregressive End-To-End Speaker-Attributed ASR. ASRU 2023
Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, Zhuo Chen, Kong Aik Lee, Zhijie Yan, Hui Bu. The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2MeT 2.0). ASRU 2023
Yue Gu, Zhihao Du, Shiliang Zhang, Qian Chen, Jiqing Han. Personality-aware Training based Speaker Adaptation for End-to-end Speech Recognition. INTERSPEECH 2023 [Paper]
Mohan Shi, Zhihao Du, Qian Chen, Fan Yu, Yangze Li, Shiliang Zhang, Jie Zhang, Lirong Dai. CASA-ASR: Context-Aware Speaker-Attributed ASR. INTERSPEECH 2023
Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, Shiliang Zhang. FunASR: A Fundamental End-to-End Speech Recognition Toolkit. INTERSPEECH 2023
Jiaming Wang*, Zhihao Du*, Shiliang Zhang. TOLD: A Novel Two-stage Overlap-aware Framework for Speaker Diarization. ICASSP 2023 (equal contribution)
Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan. Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis. EMNLP 2022 (long paper)
Yuxiao Lin, Zhihao Du, Shiliang Zhang, Fan Yu, Zhou Zhao, Fei Wu. Separate-to-Recognize: Joint Multi-target Speech Separation and Speech Recognition for Speaker-attributed ASR. ISCSLP 2022
Fan Yu, Shiliang Zhang, Pengcheng Guo, Yuhao Liang, Zhihao Du, et al. MFCCA: Multi-Frame Cross-Channel Attention for Multi-speaker ASR in Multi-party Meeting Scenario. SLT 2022
Fan Yu, Zhihao Du, Shiliang Zhang, Yuxiao Lin, Lei Xie. A Comparative Study on Speaker-attributed ASR in Multi-party Meetings. ICASSP 2022
Fan Yu, Shiliang Zhang, Pengcheng Guo, Yihui Fu, Zhihao Du, et al. Summary on the ICASSP 2022 Multi-channel Multi-party Meeting Transcription Grand Challenge. ICASSP 2022
Fan Yu, Shiliang Zhang, Yihui Fu, Lei Xie, Siqi Zheng, Zhihao Du, et al. M2MeT: The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge. ICASSP 2022
Hongwei Song, Jiqing Han, Shiwen Deng, Zhihao Du. Capturing Temporal Dependencies Through Future Prediction for CNN-Based Audio Classifiers. ICASSP 2021
Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang. PAN: Phoneme-aware Network for Monaural Speech Enhancement. ICASSP 2020
Zhihao Du, Ming Lei, Jiqing Han, Shiliang Zhang. Self-Supervised Adversarial Multi-Task Learning for Vocoder-Based Monaural Speech Enhancement. INTERSPEECH 2020
Zhihao Du, Jiqing Han, Xueliang Zhang. Double Adversarial Network Based Monaural Speech Enhancement for Robust Speech Recognition. INTERSPEECH 2020 [Code]
Yue Gu, Zhihao Du, Hui Zhang, Xueliang Zhang. An Efficient Joint Training Framework for Robust Small-Footprint Keyword Spotting. ICONIP 2020
Hongwei Song, Jiqing Han, Shiwen Deng, Zhihao Du. Acoustic Scene Classification by Implicitly Identifying Distinct Sound Events. INTERSPEECH 2019
Zhihao Du, Xueliang Zhang, Jiqing Han. Investigation of Monaural Front-End Processing for Robust Speech Recognition Without Retraining or Joint-Training. APSIPA 2019

Preprints 预印本

Zhihao Du, Shiliang Zhang, Siqi Zheng, Zhijie Yan. Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information. arXiv:2111.13694

PhD Thesis 博士论文

Research on Monaural Speech Enhancement Based on Prior Information in Different Semantic Levels（基于不同语义层级先验信息的单通道语音增强方法研究）.

Reviewer & Service 审稿

Journal 期刊

KNOSYS, Knowledge-Based Systems (IF=7.6)
TASLP, IEEE Trans. Audio, Speech and Language Processing (IF=6.1)
SPECOM, Speech Communication (IF=2.9)

Conference 会议

AAAI, Conf. on Artificial Intelligence (Program Committee)
ACL, Association for Computational Linguistics
ACM MM, ACM Multimedia
SIGGRAPH Asia, ACM Conf. on Computer Graphics and Interactive Techniques in Asia
IJCAI, Int'l Joint Conference on Artificial Intelligence
ICASSP, Int'l Conf. on Acoustics, Speech and Signal Processing
INTERSPEECH, Conf. of the Int'l Speech Communication Assoc.
IALP, Int'l Conf. on Asian Language Processing
ISCSLP, Int'l Symposium on Chinese Spoken Language Processing

Open Source 开源代码

CosyVoice — Multilingual zero-shot TTS. github.com/FunAudioLLM/CosyVoice
speech_feature_extractor — Widely-used speech features, 100+ stars. github.com/ZhihaoDU/speech_feature_extractor

Honors & Membership 荣誉 & 组织

Honors 荣誉

哈尔滨工业大学优秀博士论文提名（2021）
内蒙古自治区优秀毕业生（2015）
MCM Meritorious Winner
ACM/ICPC 二等奖

Organization 组织

IEEE Member
SIGDAT Member

Contact 联系我

+86-15600609952 duzhihao.china@gmail.com zhihaodu0622@163.com