Dongzhi Jiang

Dongzhi Jiang

I'm a PhD student in Multimedia Lab, CUHK, supervised by Prof. Hongsheng Li. Please email me if you have any questions or want to collaborate.

Email / Google Scholar / Github

Selected Publications

Currently, I am focusing on Multimodal Large Language Model (MLLM) and Text-to-Image models.

	T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li NeurIPS*, 2025 arXiv / GitHub A reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process..
	MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, Bo Zhang, Chaoyou Fu, Peng Gao, Hongsheng Li ICML, 2025 arXiv / website / dataset / GitHub A comprehensive chain-of-thought evaluation suite for large multimodal models, focusing on reasoning quality, robustness, and efficiency.
	EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li ICML, 2025 A novel approach for group image reference in diffusion models using large multimodal model capabilities.
	MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, Hongsheng Li, ICLR, 2025 arXiv / website / dataset / GitHub We investigate the potential of current LMM to function as a multimodal AI search engine. We also introduce a multimodal AI search engine pipeline and outperforms Perplexity-pro with only open-source LMMs.
	MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Ziyu Guo, Yichi Zhang, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Shanghang Zhang, Peng Gao, Hongsheng Li ICLR, 2025 An automatic data engine and specialized vision encoder for mathematical visual instruction tuning.
	💫CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching Dongzhi Jiang, Guanglu Song, Xiaoshi Wu, Renrui Zhang, Dazhong Shen, Zhuofan Zong, Yu Liu, Hongsheng Li NeurIPS, 2024 arXiv / website / GitHub A fine-tuning strategy to address the text-to-image misalignment issue with image-to-text concept matching. The training data only includes text prompts.
	MoVA: Adapting Mixture of Vision Experts to Multimodal Context Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, Yu Liu NeurIPS, 2024 An adaptive approach for mixing vision experts for large multimodal model.
	MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Aojun Zhou, Kai-Wei Chang, Peng Gao, Hongsheng Li ECCV*, 2024 arXiv / website / dataset / GitHub We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams.
	Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction Zhuofan Zong, Dongzhi Jiang, Guanglu Song, Zeyue Xue, Jingyong Su, Hongsheng Li, Yu Liu ICCV, 2023 arXiv / GitHub We design a plug-and-play approach to enhance the temporal modeling capability of BEV detectors with no additional inference cost.