VIVID-10M

VIVID is a versatile and interactive video local editing model trained on VIVID-10M dataset, which supports entity addition, modifiation, and deletion.
Given a video, VIVID generates high quality, harmonious contents within the mask sequence, guided by the semantics of local caption.

Inputs

Edited Video

A sleek, circular UFO hovers in the air, its metallic surface gleaming with a futuristic sheen. The round body is smooth and streamlined, with a slightly domed top and a series of glowing lights around its outer edge.

A woman is wearing a stylish crossbody bag that rests comfortably against her side. The bag features a chic design, made from soft leather, and is adorned with metallic accents and decorative stitching.

A pillow rests quietly on the sofa, made of warm velvet fabric in a gentle light blue color. Its surface is adorned with delicate white floral patterns, and the edges feature meticulous stitching. The pillow has a classic square shape and a moderate size, with a few subtle wrinkles that add character.

A rock star signs an autograph for a fan. Dressed in edgy rock attire, with tattoos and jewelry adding to their iconic look.Enthusiastic fans were blocked behind the fence and stretched out their markers and notebooks, asking for the star's autograph.The star holds a marker and carefully signs on the notebooks.Some security personnel stood between the celebrities and the bus, carefully observing the surroundings.

A small pond is situated at the center of the indoor space, its water crystal clear, with gentle ripples occasionally disturbed by a few golden koi swimming leisurely, adding vitality to the scene. The edges of the pond are adorned with natural stone, covered in moss and small plants, as if weathered by time. Floating on the water's surface are several white water lilies, their petals shimmering in the sunlight and emitting a subtle fragrance.

The black sunglasses showcase a trendy oversized design, crafted from lightweight and durable plastic. Their chic cat-eye shape flatters various face types, while the deep, reflective black lenses provide UV protection and a touch of allure. The arms are slightly tapered and feature subtle geometric patterns, enhancing their modern appeal.

The yellow sports car features a streamlined design, with a low profile and sharp lines that convey a sporty feel. The front is equipped with bold headlights and a distinctive grille, while the rear includes a stylish spoiler that enhances its performance appearance.

Remove objects and generate areas that blend with the background.

「KIVE」Keyframe-guided Interactive Video Editing

KIVE enables users to iteratively edit keyframes and propagate it to other frames, thereby reducing latency in achieving desired outcomes.
Furthermore, the KIVE mechanism supports local editing of long videos by using the last frame of one edited clip as the keyframe for the next.

For example, if we want to edit a video as a Halloween promotional video ...

Naive Video Editing Pipeline - 34.2PFLOPs

1. Draw Masks

2. Video Editing

17.1 PFLOPs

A skeleton is staring ahead.

3. Video Editing

17.1 PFLOPs

A skeleton is staring ahead, with flames flickering in his eyes.

KIVE Pipeline - 20.1PFLOPs

1. Draw Masks

2. Keyframe Editing

1.5 PFLOPs

A skeleton is staring ahead.

3. Keyframe Editing

1.5 PFLOPs

A skeleton is staring ahead, with flames flickering in his eyes.

4. Edits Propagation

17.1 PFLOPs

A skeleton is staring ahead, with flames flickering in his eyes.

「VIVID-10M」Video Local Editing Dataset

VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs.
VIIVD-10M comprises 9.7M samples that encompass a wide range of video editing tasks.

Abstract

Diffusion-based image editing models have made remarkable progress in recent years. However, achieving high-quality video editing remains a significant challenge. One major hurdle is the absence of open-source, large-scale video editing datasets based on real-world data, as constructing such datasets is both time-consuming and costly. Moreover, video data requires a significantly larger number of tokens for representation, which substantially increases the training costs for video editing models. Lastly, current video editing models offer limited interactivity, often making it difficult for users to express their editing requirements effectively in a single attempt. To address these challenges, this paper introduces a dataset VIVID-10M and a baseline model VIVID. VIVID-10M is the first large-scale hybrid image-video local editing dataset aimed at reducing data construction and model training costs, which comprises 9.7M samples that encompass a wide range of video editing tasks. VIVID is a Versatile and Interactive VIdeo local eDiting model trained on VIVID-10M, which supports entity addition, modification, and deletion. At its core, a keyframe-guided interactive video editing mechanism is proposed, enabling users to iteratively edit keyframes and propagate it to other frames, thereby reducing latency in achieving desired outcomes. Extensive experimental evaluations show that our approach achieves state-of-the-art performance in video local editing, surpassing baseline methods in both automated metrics and user studies. The VIVID-10M dataset and the VIVID editing model will be open-sourced.

VIVID-10M: A Dataset and Baseline for Versatile and Interactive Video Local Editing

🎉 News

「VIVID」Video Local Editing Model

「KIVE」Keyframe-guided Interactive Video Editing

Naive Video Editing Pipeline - 34.2PFLOPs

KIVE Pipeline - 20.1PFLOPs

「VIVID-10M」Video Local Editing Dataset

Abstract