Research
I am interested in AIGC. Currently, I am focusing on Text-to-Image Diffusion models and Multimodal Large Language Model (MLLM).
|
|
💫CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang,
Guanglu Song,
Xiaoshi Wu,
Renrui Zhang,
Dazhong Shen,
Zhuofan Zong,
Yu Liu,
Hongsheng Li
arxiv, 2024
arXiv
/
website
/
GitHub
A fine-tuning strategy to address the text-to-image misalignment issue with image-to-text concept matching. The training data only includes text prompts.
|
|
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
Renrui Zhang*,
Dongzhi Jiang*,
Yichi Zhang*,
Haokun Lin,
Ziyu Guo,
Pengshuo Qiu,
Aojun Zhou,
Pan Lu,
Aojun Zhou,
Kai-Wei Chang,
Peng Gao,
Hongsheng Li
arxiv, 2024
arXiv
/
website
/
dataset
/
GitHub
We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams.
|
|
Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction
Zhuofan Zong*,
Dongzhi Jiang*,
Guanglu Song,
Zeyue Xue,
Jingyong Su,
Hongsheng Li,
Yu Liu
ICCV, 2023
arXiv
/
GitHub
We design a plug-and-play approach to enhance the temporal modeling capability of BEV detectors with no additional inference cost.
|
|