Paper2Video

Paper2Video
Automatic Video Generation from Scientific Papers

Show Lab, National University of Singapore
*Equal ContributionCorresponding Author

TL;DR

We address How to create a presentation video from a paper
and How to evaluate presentation video.

Teaser

😼 Example of Paper2Video for Paper2Video


🎬 Demos by Paper2Video

Paper + Image + Audio 👉 Presentation Video

🆚 Comparison between Agents and Humans

Veo3
Paper2Video
Human-made
🔗 NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images
Veo3
Paper2Video
Human-made
🔗 Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert
Veo3
Paper2Video
Human-made
🔗 What's Mine becomes Yours: Defining, Annotating and Detecting Context-Dependent Paraphrases in News Interview Dialogs
Veo3
Paper2Video
Human-made
🔗 ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented Argumentation with LLM Judges

Abstract

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics—Meta Similarity, PresentArena, PresentQuiz, and IP Memory—to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation.

PaperTalker

benchmark

To address these challenges and liberate researchers from the burdensome task of manual video preparation, we introduce PaperTalker, a multi-agent framework designed to automatically generate presentation videos directly from academic papers.

As illustrated in Figure 4, to decouple the different roles and make the method scalable and flexible, the pipeline comprises four builders:

  1. Slide builder. Given the paper, we first synthesize slides with LaTeX code and refine them with compilation feedback to correct grammar and optimize layout.
  2. Subtitle builder. The slides are then processed by a VLM to generate subtitles and sentence-level visual-focus prompts.
  3. Cursor builder. These prompts are then grounded into on-screen cursor coordinates and synchronized with the narration.
  4. Talker builder. Given the voice sample and the portrait of the speaker, text-to-speech and talking-head modules generate a realistic, personalized talker video.

Paper2Video Benchmark

benchmark

We present Paper2Video, the first high-quality benchmark of 101 papers with author-recorded presentation videos, slides, and speaker metadata. The Paper2Video Benchmark includes 101 curated paper–video pairs spanning diverse research topics. Each paper averages about 13.3K words, 44.7 figures, and 28.7 pages, providing rich multimodal long-document inputs. Presentations contain on average 16 slides and run for about 6 minutes 15 seconds, with some reaching up to 14 minutes. Rather than focusing only on video generation, Paper2Video is designed to evaluate long-horizon agentic tasks that require integrating text, figures, slides, and spoken presentations.

Paper2Video Metrics

benchmark

Unlike natural video generation, academic presentation videos serve a highly specialized role: they are not merely about visual fidelity but about communicating scholarship. This makes it difficult to directly apply conventional metrics from video synthesis (e.g., FVD, IS, or CLIP-based similarity). Instead, their value lies in how well they disseminate research and amplify scholarly visibility. From this perspective, we argue that a high-quality academic presentation video should be judged along two complementary dimensions:

For the audience

  • The video is expected to faithfully convey the paper’s core ideas.
  • It should remain accessible to diverse audiences.

For the author

  • The video should foreground the authors’ intellectual contribution and identity.
  • It should enhance the work’s visibility and impact.

To capture these goals, we introduce tailored evaluation metrics specifically designed for academic presentation videos: Meta Similarity, PresentArena, PresentQuiz and IP Memory.

BibTeX


      @misc{paper2video,
            title={Paper2Video: Automatic Video Generation from Scientific Papers}, 
            author={Zeyu Zhu and Kevin Qinghong Lin and Mike Zheng Shou},
            year={2025},
            eprint={2510.05096},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2510.05096}, 
      }