Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen,
Mike Zheng Shou✉️

ShowLab, National University of Singapore   

We present Kiwi-Edit, a unified and fully open-source framework for instruction-guided and reference-guided video editing using natural language. Kiwi-Edit supports high-quality, temporally consistent edits across global and local tasks, and delivers strong open-model performance at 720p resolution with released code, models, and datasets.

Fully Open Source

Codes, models, and datasets are fully open-sourced.

Instruction Guided Editing

Edit videos with natural language instructions for global and local modifications.

Reference Image Guided

Replace the background and Insert a reference object from a reference image to the target video.

Instruction-Guided Video Editing

Kiwi-Edit supports a wide range of text instruction-based video editing tasks. Select a category below to view results.

Reference Image-Guided Video Editing

Kiwi-Edit can transfer visual attributes from a reference image to the target video. The model extracts background or subject information from the reference image while preserving the original motion and structure of the video.

Quantitative Results

OpenVE-Bench

We evaluate on OpenVE-Bench, assessed by Gemini-2.5-Pro, across five editing categories: Global Style, Background Change, Local Change, Local Remove, and Local Add. Kiwi-Edit (Stage-3 Instruct-Reference) achieves the best overall score among open-source methods.

Method #Params. #Reso. Overall ↑ Global Style ↑ Background Change ↑ Local Change ↑ Local Remove ↑ Local Add ↑
VACE 14B 1280×720 1.57 1.49 1.55 2.07 1.46 1.26
OmniVideo 1.3B 640×352 1.19 1.11 1.18 1.14 1.14 1.36
InsViE 2B 720×480 1.45 2.20 1.06 1.48 1.36 1.17
Lucy-Edit 5B 1280×704 2.22 2.27 1.57 3.20 1.75 2.30
ICVE 13B 384×240 2.18 2.22 1.62 2.57 2.51 1.97
DITTO 14B 832×480 2.13 4.01 1.68 2.03 1.53 1.41
OpenVE-Edit 5B 1280×704 2.50 3.16 2.36 2.98 1.85 2.15
Ours (Stage-2 Instruct-Only) 5B 720×480 2.92 3.54 3.80 2.59 2.55 2.12
Ours (Stage-2 Instruct-Only) 5B 1280×704 2.98 3.54 3.84 2.57 2.71 2.25
Ours (Stage-3 Instruct-Reference) 5B 1280×704 3.02 3.64 2.64 3.83 2.63 2.36

RefVIE-Bench

We report subject-reference and background-reference evaluation scores on RefVIE-Bench. The table summarizes identity/temporal/physical consistency for subject guidance, reference similarity/matting/video quality for background guidance, and an overall score.

Model Subject Reference Background Reference Overall
Identity Consist. Temporal Consist. Physical Consist. Reference Sim. Matting Quality Video Quality
Runway Aleph 3.79 3.65 3.58 3.33 2.81 2.58 3.29
Kling-O1 4.75 4.66 4.60 3.95 3.21 2.75 3.99
Ours (All data) 3.51 2.96 2.91 3.40 2.58 2.40 2.96
Ours (Ref. data only) 3.98 3.40 3.34 3.72 2.90 2.51 3.31

Methodology

Architecture Design

Kiwi-Edit combines a Multimodal LLM and a video diffusion transformer. Given a source video, an editing instruction, and an optional reference image, the model produces temporally consistent edited videos with controllable appearance and structure.

Kiwi-Edit architecture diagram

Architecture overview. The MLLM provides semantic guidance (instruction and reference context tokens), while the DiT receives both semantic context and direct structural controls from source/reference latents for controllable video editing.

Training Curriculum

We train Kiwi-Edit with a simple three-stage curriculum: alignment, instruction fine-tuning, and reference-guided fine-tuning. This progressive strategy improves stability and enables strong performance across instruction and reference editing settings.

Training Dataset

We build an automated pipeline to convert instruction triplets into reference-guided training quadruplets.

Citation

If you find Kiwi-Edit useful for your research, please cite our paper:

@misc{Kiwi-Edit2026,
title={Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance}, 
author={Yiqi Lin and Guoqiang Liang and Ziyun Zeng and Zechen Bai and Yanzhe Chen and Mike Zheng Shou},
year={2026},
eprint={2603.02175},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.02175}, 
}