VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing

Jiahao Hu1, Tianxiong Zhong2, Xuebo Wang3, Boyuan Jiang3, Xingye Tian3, Fei Yang3, Pengfei Wan3, Di Zhang3

1Northwest Polytechnical University Xi'an, 2Beijing Institute of Technology, 3Kuaishou Technology

arXiv Dataset

🎉 News

2025.07.10: VIVID-10M dataset is released on Huggingface.

「VIVID」Video Local Editing Model

VIVID is a versatile and interactive video local editing model trained on VIVID-10M dataset, which supports entity addition, modifiation, and deletion.
Given a video, VIVID generates high quality, harmonious contents within the mask sequence, guided by the semantics of local caption.

「KIVE」Keyframe-guided Interactive Video Editing

KIVE enables users to iteratively edit keyframes and propagate it to other frames, thereby reducing latency in achieving desired outcomes.
Furthermore, the KIVE mechanism supports local editing of long videos by using the last frame of one edited clip as the keyframe for the next.

For example, if we want to edit a video as a Halloween promotional video ...

Naive Video Editing Pipeline - 34.2PFLOPs

1. Draw Masks

2. Video Editing

17.1 PFLOPs

A skeleton is staring ahead.

3. Video Editing

17.1 PFLOPs

A skeleton is staring ahead, with flames flickering in his eyes.

KIVE Pipeline - 20.1PFLOPs

1. Draw Masks

2. Keyframe Editing

1.5 PFLOPs

A skeleton is staring ahead.

3. Keyframe Editing

1.5 PFLOPs

A skeleton is staring ahead, with flames flickering in his eyes.

4. Edits Propagation

17.1 PFLOPs

A skeleton is staring ahead, with flames flickering in his eyes.

「VIVID-10M」Video Local Editing Dataset

VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs.
VIIVD-10M comprises 9.7M samples that encompass a wide range of video editing tasks.

Ground Truth

Mask Sequence

Masked Video

Local Caption

A woman dressed in a black blazer adorned with golden buttons, complemented by a patterned scarf around her neck. She has short blonde hair and is wearing glasses. The background is textured with ornate designs. She holds a small object in her right hand and appears to be speaking or presenting. Her attire and overall demeanor suggest a formal or professional setting.

Ground Truth

Mask Sequence

Masked Video

Local Caption

A sculpture statue of a dynamic, white marble figure, likely representing a classical or mythological character, with its right arm raised and head turned upward as if capturing a divine moment. The intricate detailing includes flowing drapery that adds a sense of movement, emphasizing the figure's dramatic pose. The statue is mounted on a pedestal, which is placed prominently in a room with various paintings and a red wall, creating a harmonious blend of art forms.

Ground Truth

Mask Sequence

Masked Video

Local Caption

A cat with a predominantly black and white coat, possibly a tabby, running energetically in what seems to be a play area designed for pets. The area has green artificial turf flooring and includes various toys and small pieces of furniture, which are positioned randomly around the space. The cat's coat exhibits a classic tabby pattern with stripes and patches of solid fur, and it appears to be engaged in a lively activity within this environment.

Abstract

Diffusion-based image editing models have made remarkable progress in recent years. However, achieving high-quality video editing remains a significant challenge. One major hurdle is the absence of open-source, large-scale video editing datasets based on real-world data, as constructing such datasets is both time-consuming and costly. Moreover, video data requires a significantly larger number of tokens for representation, which substantially increases the training costs for video editing models. Lastly, current video editing models offer limited interactivity, often making it difficult for users to express their editing requirements effectively in a single attempt. To address these challenges, this paper introduces a dataset VIVID-10M and a baseline model VIVID. VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs, which comprises 9.7M samples that encompass a wide range of video editing tasks. VIVID is a Versatile and Interactive VIdeo local eDiting model trained on VIVID-10M, which supports entity addition, modification, and deletion. At its core, a keyframe-guided interactive video editing mechanism is proposed, enabling users to iteratively edit keyframes and propagate it to other frames, thereby reducing latency in achieving desired outcomes. Extensive experimental evaluations show that our approach achieves state-of-the-art performance in video local editing, surpassing baseline methods in both automated metrics and user studies. The VIVID-10M dataset and the VIVID editing model will be open-sourced.