AssistGPT

AssistGPT: A General Multi-modal Assistant that can
Plan, Execute, Inspect, and Learn

Difei Gao^♡,

Lei Ji^♢,

Luowei Zhou^♤,

Kevin Qinghong Lin^♡,

Joya Chen^♡,

Zihan Fan^♡,

Mike Zheng Shou^♡

^♡Show Lab, National University of Singapore, ^♢Microsoft Research Asia ^♤Microsoft

Abstract

Recent advancements in Large Language Models (LLMs) have improved general NLP AI assistants and their capacity to invoke models or APIs. However, complex visual tasks remain a challenge due to diversity in reasoning paths and variable inputs, and intermediate results. In many real-world applications, it's hard to break down a query based purely on the query itself, requiring specific visual content and step-by-step results. Inputs can vary, often involving a combination of videos and images, and the reasoning process can generate diverse multimodal intermediate results like video narrations and segmented video clips.
To address such general cases, we propose a multi-modal AI assistant, AssistGPT, with an interleaved code and language reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with various tools. Specifically, the Planner is capable of using natural language to plan which tool in Executor should do next based on the current reasoning progress. Inspector is an efficient memory manager to assist the Planner to feed proper visual information into a specific tool. Finally, since the entire reasoning process is complex and flexible, a Learner is designed to enable the model to autonomously explore and discover the optimal solution.

@article{gao2023assistgpt, title = {AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn}, author = {Gao, DiFei and Ji, Lei and Zhou, Luowei and Lin, Kevin Qinghong and Chen, Joya and Fan, Zihan and Shou, Mike Zheng}, year = {2023}, }

AssistGPT: A General Multi-modal Assistant that can
Plan, Execute, Inspect, and Learn

Abstract

Results

BibTeX