TLDR: The first streaming video LLM, high speed (5~10 FPS on RTX 3090 GPU, 10~15 FPS on A100 GPU) on long-form videos (10 minutes), with SOTA performances on online/offline settings.

  • Online Video Streaming: Unlike previous models that serve as offline mode (querying/responding to a full video), our model supports online interaction within a video stream. It can proactively update responses during a stream, such as recording activity changes or helping with the next steps in real time. Even GPT-4o, which is audio-driven, requires user voice interaction with the visual scene, not actual video streaming.

  • Cheap and Scalable Streaming Data Synthesis: Current video datasets for training multimodal LLMs are mostly offline and unsuitable for training an online video language model. Our method transforms any offline annotation into streaming dialogue data by prompting open-source LLM. The model is entirely trained on Llama synthesized data.

  • Real-Time Inference: Our inference method parallelizes video encoding, LLM forwarding for video frames, and LLM response generation, arranging them asynchronously. This significantly enhances real-time performance, achieving 5-10 FPS on RTX 3090 GPU, 10-15 FPS on an A100 GPU.

Abstract

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.

MY ALT TEXT
MY ALT TEXT
MY ALT TEXT
MY ALT TEXT
MY ALT TEXT

BibTeX

@inproceedings{VideoLLM-online,
        author       = {Joya Chen and Zhaoyang Lv and Shiwei Wu and Kevin Qinghong Lin and Chenan Song and Difei Gao and Jia-Wei Liu and Ziteng Gao and Dongxing Mao and Mike Zheng Shou},
        title        = {VideoLLM-online: Online Video Large Language Model for Streaming Video},
        booktitle    = {CVPR},
        year         = {2024},
      }