Selected Publication
I am interested in AIGC. Currently, I am focusing on Multimodal Large Language Model (MLLM) and Text-to-Image Diffusion models.
|
|
MMSearch: Benchmarking the Potential of Large Models as Multi-modal Search Engines
Dongzhi Jiang*,
Renrui Zhang*,
Ziyu Guo,
Yanmin Wu,
Jiayi Lei,
Pengshuo Qiu,
Pan Lu,
Zehui Chen,
Guanglu Song,
Peng Gao,
Yu Liu,
Chunyuan Li,
Hongsheng Li,
arxiv, 2024
arXiv
/
website
/
dataset
/
GitHub
We investigate the potential of current LMM to function as a multimodal AI search engine. We also introduce a multimodal AI search engine pipeline and outperforms Perplexity-pro with only open-source LMMs.
|
|
💫CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang,
Guanglu Song,
Xiaoshi Wu,
Renrui Zhang,
Dazhong Shen,
Zhuofan Zong,
Yu Liu,
Hongsheng Li
Neurips, 2024
arXiv
/
website
/
GitHub
A fine-tuning strategy to address the text-to-image misalignment issue with image-to-text concept matching. The training data only includes text prompts.
|
|
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Renrui Zhang*,
Dongzhi Jiang*,
Yichi Zhang*,
Haokun Lin,
Ziyu Guo,
Pengshuo Qiu,
Aojun Zhou,
Pan Lu,
Aojun Zhou,
Kai-Wei Chang,
Peng Gao,
Hongsheng Li
ECCV, 2024
arXiv
/
website
/
dataset
/
GitHub
We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams.
|
|
Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction
Zhuofan Zong*,
Dongzhi Jiang*,
Guanglu Song,
Zeyue Xue,
Jingyong Su,
Hongsheng Li,
Yu Liu
ICCV, 2023
arXiv
/
GitHub
We design a plug-and-play approach to enhance the temporal modeling capability of BEV detectors with no additional inference cost.
|
|