Comparison Visual Instruction Tuning

Wei Lin¹, Muhammad Jehanzeb Mirza², Sivan Doveh^3,4, Rogerio Feris⁷, Raja Giryes⁵, Sepp Hochreiter^1,6, Leonid Karlinsky⁷

¹ELLIS Unit, LIT AI Lab, Institute for Machine Learning, JKU Linz, Austria ²TU Graz ICG, Austria ³IBM Research, Israel ⁴Weizmann Institute of Science, Israel ⁵Tel-Aviv University, Israel ⁶NXAI GmbH, Austria ⁷MIT-IBM Watson AI Lab, USA

🤗 Dataset Repo

arXiv Data Code Video

Abstract

Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation.

We develop and contribute a new two-phase approach CaD-VI for collecting synthetic visual instructions, together with an instruction-following dataset CaD-Inst containing 349K image pairs with CaD instructions collected using CaD-VI.

Additionally, we propose an evaluation benchmark with 7.5K open-ended QAs to assess the CaD understanding abilities of LMMs.

Pipeline

Two-phase data collection: In Phase-1, we leverage captions for image pairs and the Mixtral 8x7B model to generate CaD VI data - CaD-Inst V1 (278K), and perform visual instruction tuning on it to arrive at the Phase-1 model CaD-LLaVA-V1.

In Phase-2, we leverage CaD-LLaVAV 1 to generate CaD VI data on additional image pairs and collect CaD-Inst-V2 (71K). Visual instruction tuning with CaD-InstV 1 and CaD-InstV 2 leads to our final model CaD-LLaVA-V2

Anther example of (a) Phase-1 LLM-collected CaD summary and (b) Phase-2 LMM-collected CaD summary

Data Statistics

Word clouds of CaD summaries in (a) Phase-1 data and (b) Phase-2 data collections

Distribution of (a) questions (first 5 words) and (b) answers (first 3 words) in our collected evaluation benchmark CaD-QA

Examples of Collected QA Pairs

Examples of Q&A pairs in our CaD-QA benchmark together with LMM predicted answers and the corresponding LLM evaluation ratings for the prediction (Red and green texts denote incorrect and correct description).

Reasoning on Binary Image Selection Task

Examples of predictions on binary image selection task (selection of the matched image given a text query). Here we instruct the LMMs to, besides the selection, also give a reasoning for the selection (Red and green texts denote incorrect and correct predictions).

BibTeX

@article{lin2024comparison,
  title={Comparison Visual Instruction Tuning},
  author={Lin, Wei and Mirza, Muhammad Jehanzeb and Doveh, Sivan and Feris, Rogerio and Giryes, Raja and Hochreiter, Sepp and Karlinsky, Leonid},
  journal={arXiv preprint},
  year={2024}
}