Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

Jingming Liu*, Yumeng Li*, Boyuan Xiao,
Yichang Jian, Ziang Qin, Tianjia Shao, Yao-Xiang Ding, Kun Zhou
State Key Lab of CAD&CG, Zhejiang University
*Denotes Equal Contribution
Our autonomous imagination method empowers advanced MLLMs to engage in iterative
                                                imaginative reasoning, enabling them to address previously unsolvable tasks without additional
                                                training or fine-tuning
Our autonomous imagination method empowers advanced MLLMs to engage in iterative imaginative reasoning, enabling them to address previously unsolvable tasks without additional training or fine-tuning.

Abstract

There have been recent efforts to extend the Chain-of-Thought (CoT) paradigm to Multimodal Large Language Models (MLLMs) by finding visual clues in the input scene, advancing the visual reasoning ability of MLLMs. However, current approaches are specially designed for the tasks where clue finding plays a major role in the whole reasoning process, leading to the difficulty in handling complex visual scenes where clue finding does not actually simplify the whole reasoning task. To deal with this challenge, we propose a new visual reasoning paradigm enabling MLLMs to autonomously modify the input scene to new ones based on its reasoning status, such that CoT is reformulated as conducting simple closed-loop decision-making and reasoning steps under a sequence of “imagined” visual scenes, leading to natural and general CoT construction. To implement this paradigm, we introduce a novel plug-and-play “imagination space”, where MLLMs conduct visual modifications through operations like focus, ignore, and transform based on their native reasoning ability without specific training. We validate our approach through a benchmark spanning dense counting, simple jigsaw puzzle solving, and object placement, challenging the reasoning ability beyond clue finding. The results verify that while existing techniques fall short, our approach enables MLLMs to effectively reason step by step through autonomous imagination.

Method

The imagination space begins with an unstructured input scene and undergoes an iterative reasoning process. In each cycle, MLLMs first perceive the current state of the imagination space, select an operation to apply, and then reassess the updated imagination space. Upon completing this reasoning sequence, MLLMs generate an answer based on the cumulative context of the process and the final state of the imagination space.

An overview of our autonomous imagination method
An overview of our autonomous imagination method

Video

BibTeX

@misc{liu2024enhancingvisualreasoningautonomous,
      title={Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models}, 
      author={Jingming Liu and Yumeng Li and Boyuan Xiao and Yichang Jian and Ziang Qin and Tianjia Shao and Yao-Xiang Ding and Kun Zhou},
      year={2024},
      eprint={2411.18142},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.18142}, 
    }