Kiwi-Edit:
Versatile Video Editing via Instruction and Reference Guidance
We present Kiwi-Edit, a unified and fully open-source framework for instruction-guided and reference-guided video editing using natural language. Kiwi-Edit supports high-quality, temporally consistent edits across global and local tasks, and delivers strong open-model performance at 720p resolution with released code, models, and datasets.
Codes, models, and datasets are fully open-sourced.
Edit videos with natural language instructions for global and local modifications.
Replace the background and Insert a reference object from a reference image to the target video.
Kiwi-Edit supports a wide range of text instruction-based video editing tasks. Select a category below to view results.
Kiwi-Edit can transfer visual attributes from a reference image to the target video. The model extracts background or subject information from the reference image while preserving the original motion and structure of the video.
We evaluate on OpenVE-Bench, assessed by Gemini-2.5-Pro, across five editing categories: Global Style, Background Change, Local Change, Local Remove, and Local Add. Kiwi-Edit (Stage-3 Instruct-Reference) achieves the best overall score among open-source methods.
| Method | #Params. | #Reso. | Overall ↑ | Global Style ↑ | Background Change ↑ | Local Change ↑ | Local Remove ↑ | Local Add ↑ |
|---|---|---|---|---|---|---|---|---|
| VACE | 14B | 1280×720 | 1.57 | 1.49 | 1.55 | 2.07 | 1.46 | 1.26 |
| OmniVideo | 1.3B | 640×352 | 1.19 | 1.11 | 1.18 | 1.14 | 1.14 | 1.36 |
| InsViE | 2B | 720×480 | 1.45 | 2.20 | 1.06 | 1.48 | 1.36 | 1.17 |
| Lucy-Edit | 5B | 1280×704 | 2.22 | 2.27 | 1.57 | 3.20 | 1.75 | 2.30 |
| ICVE | 13B | 384×240 | 2.18 | 2.22 | 1.62 | 2.57 | 2.51 | 1.97 |
| DITTO | 14B | 832×480 | 2.13 | 4.01 | 1.68 | 2.03 | 1.53 | 1.41 |
| OpenVE-Edit | 5B | 1280×704 | 2.50 | 3.16 | 2.36 | 2.98 | 1.85 | 2.15 |
| Ours (Stage-2 Instruct-Only) | 5B | 720×480 | 2.92 | 3.54 | 3.80 | 2.59 | 2.55 | 2.12 |
| Ours (Stage-2 Instruct-Only) | 5B | 1280×704 | 2.98 | 3.54 | 3.84 | 2.57 | 2.71 | 2.25 |
| Ours (Stage-3 Instruct-Reference) | 5B | 1280×704 | 3.02 | 3.64 | 2.64 | 3.83 | 2.63 | 2.36 |
We report subject-reference and background-reference evaluation scores on RefVIE-Bench. The table summarizes identity/temporal/physical consistency for subject guidance, reference similarity/matting/video quality for background guidance, and an overall score.
| Model | Subject Reference | Background Reference | Overall | ||||
|---|---|---|---|---|---|---|---|
| Identity Consist. | Temporal Consist. | Physical Consist. | Reference Sim. | Matting Quality | Video Quality | ||
| Runway Aleph | 3.79 | 3.65 | 3.58 | 3.33 | 2.81 | 2.58 | 3.29 |
| Kling-O1 | 4.75 | 4.66 | 4.60 | 3.95 | 3.21 | 2.75 | 3.99 |
| Ours (All data) | 3.51 | 2.96 | 2.91 | 3.40 | 2.58 | 2.40 | 2.96 |
| Ours (Ref. data only) | 3.98 | 3.40 | 3.34 | 3.72 | 2.90 | 2.51 | 3.31 |
Kiwi-Edit combines a Multimodal LLM and a video diffusion transformer. Given a source video, an editing instruction, and an optional reference image, the model produces temporally consistent edited videos with controllable appearance and structure.
Architecture overview. The MLLM provides semantic guidance (instruction and reference context tokens), while the DiT receives both semantic context and direct structural controls from source/reference latents for controllable video editing.
We train Kiwi-Edit with a simple three-stage curriculum: alignment, instruction fine-tuning, and reference-guided fine-tuning. This progressive strategy improves stability and enables strong performance across instruction and reference editing settings.
We build an automated pipeline to convert instruction triplets into reference-guided training quadruplets.
If you find Kiwi-Edit useful for your research, please cite our paper:
@misc{Kiwi-Edit2026,
title={Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance},
author={Yiqi Lin and Guoqiang Liang and Ziyun Zeng and Zechen Bai and Yanzhe Chen and Mike Zheng Shou},
year={2026},
eprint={2603.02175},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.02175},
}