Recent advancements in Large Language Models (LLMs) have improved general NLP AI assistants and their
capacity to invoke models or APIs. However, complex visual tasks remain a challenge due to diversity in
reasoning paths and variable inputs, and intermediate results. In many real-world applications, it's hard
to break down a query based purely on the query itself, requiring specific visual content and step-by-step
results. Inputs can vary, often involving a combination of videos and images, and the reasoning process
can generate diverse multimodal intermediate results like video narrations and segmented video clips.
To address such general cases, we propose a multi-modal AI assistant, AssistGPT, with an interleaved code
and language reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with
various tools. Specifically, the Planner is capable of using natural language to plan which tool in
Executor should do next based on the current reasoning progress. Inspector is an efficient memory manager
to assist the Planner to feed proper visual information into a specific tool. Finally, since the entire
reasoning process is complex and flexible, a Learner is designed to enable the model to autonomously
explore and discover the optimal solution.
By integrating multiple models and employing an interleaved language and code reasoning manner, AssistGPT possesses the following features:
@article{gao2023assistgpt,
title = {AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn},
author = {Gao, DiFei and Ji, Lei and Zhou, Luowei and Lin, Kevin Qinghong and Chen, Joya and Fan, Zihan and Shou, Mike Zheng},
year = {2023},
}