AssistGPT: A General Multi-modal Assistant that can
Plan, Execute, Inspect, and Learn

Difei Gao,
Lei Ji,
Luowei Zhou,
Kevin Qinghong Lin,
Joya Chen,
Zihan Fan,
Mike Zheng Shou
Show Lab, National University of Singapore, Microsoft Research Asia Microsoft

AssistGPT can reason in an interleaved language and code format. Given a query input and visual inputs, AssistGPT plans the problem-solving path in language, using structured code to call upon various powerful tools. The Inspector, part of the system, can manage visual inputs and intermediate results, assisting the Planner to invoke tools. Meanwhile, the Learner can assess the reasoning process and collect in-context examples.


Recent advancements in Large Language Models (LLMs) have improved general NLP AI assistants and their capacity to invoke models or APIs. However, complex visual tasks remain a challenge due to diversity in reasoning paths and variable inputs, and intermediate results. In many real-world applications, it's hard to break down a query based purely on the query itself, requiring specific visual content and step-by-step results. Inputs can vary, often involving a combination of videos and images, and the reasoning process can generate diverse multimodal intermediate results like video narrations and segmented video clips.
To address such general cases, we propose a multi-modal AI assistant, AssistGPT, with an interleaved code and language reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with various tools. Specifically, the Planner is capable of using natural language to plan which tool in Executor should do next based on the current reasoning progress. Inspector is an efficient memory manager to assist the Planner to feed proper visual information into a specific tool. Finally, since the entire reasoning process is complex and flexible, a Learner is designed to enable the model to autonomously explore and discover the optimal solution.


By integrating multiple models and employing an interleaved language and code reasoning manner, AssistGPT possesses the following features:

  • Understanding Complex Video Content: It can process various forms of information such as visual content, subtitle, and OCR.
  • Addressing High-Level Queries: The model can automatically plan reasoning path based on visual content.
  • Supporting Flexible Inputs: It can accommodate a variety of input types, each serving different functions.


      title = {AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn},
      author = {Gao, DiFei and Ji, Lei and Zhou, Luowei and Lin, Kevin Qinghong and Chen, Joya and Fan, Zihan and Shou, Mike Zheng},  
      year = {2023},