Showcase of Code2Video
The videos below are generated via Coding.
Hanoi Problem
Puzzles
Neural Network Structure
Neural Networks
History and Definition of π
Calculus
Space-filling Curves
Topology
Abstract
While recent generative models advance pixel-space video synthesis, they remain limited in producing professional ecu videos, which demand disciplinary knowledge, precise visual structures, and coherent transitions, limiting their applicabilityl in educational scenarios. Intuitively, such requirements are better addressed through the manipulation of a renderable environment, which can be explicitly controlled via logical commands (e.g., code). In this work, we propose Code2Video, a codecentric agent framework for generating educational videos via executable Python code. The framework comprises three collaborative agents: (i) Planner, which structures lecture content into temporally coherent flows and prepares corresponding visual assets; (ii) Coder, which converts structured instructions into executable Python codes while incorporating scope-guided auto-fix to enhance efficiency; and (iii) Critic, which leverages vision-language models (VLM) with anchor visual prompts to refine spatial layout and ensure clarity. To support systemati evaluation, we build MMMC, a benchmark of professionally produced, long-form, disciplinespecific educational videos. We evaluate MMMC across diverse dimensions, including VLM-as-a-Judge aesthetic scores, code efficiency, and particularly, TeachQuiz, a novel end-to-end knowledge transfer measured by a VLM's ability to learn from the generated videos. Our results demonstrate the potential of Code2Video as a scalable, interpretable, and controllable approach for educational video generation.
Method
Illustration of Code2Video. Given a user inquiry, Code2Video aims to render an educational video via Manim code writing: (i) the Planner converts a learning topic into a storyboard and retrieves visual assets; (ii) the Coder performs parallel code synthesis with scope-guided refinement to ensure efficiency and temporal consistency; (iii) the Critic uses anchor visual prompts to iteratively adjust spatial layout and clarity, yielding reproducible, pedagogically structured videos.
