Logo Code2Video: A Code-centric Paradigm
for Educational Video Generation

Yanzhe Chen* Kevin Qinghong Lin* Mike Zheng Shou

Show Lab, National University of Singapore
* Equal Contribution Corresponding Author

Code arXiv PDF

Showcase of Code2Video

The videos below are generated via Coding.


Abstract


While recent generative models advance pixel-space video synthesis, they remain limited in producing professional ecu videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicabilityl in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a codecentric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with anchor visual prompts to refine spatial layout and ensure clarity. To support systemati evaluation, we build MMMC, a benchmark of professionally produced, long-form, disciplinespecific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end knowledge transfer measured by a VLM's ability to learn from the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach for educational video generation.

Method


Illustration of Code2Video. Given a user inquiry, Code2Video aims to render an educational video via Manim code writing: (i) the Planner converts a learning topic into a storyboard and retrieves visual assets; (ii) the Coder performs parallel code synthesis with scope-guided refinement to ensure efficiency and temporal consistency; (iii) the Critic uses anchor visual prompts to iteratively adjust spatial layout and clarity, yielding reproducible, pedagogically structured videos.