Q: Can I use VOID in my own production pipeline?

Yes. VOID is released under the Apache 2.0 license, which allows commercial use, modification, and integration into proprietary pipelines without licensing fees. The model weights and code are available on GitHub.

Q: Do I need an NVIDIA A100 GPU to run it?

The model requires a minimum of 40GB VRAM for Pass 1. While high-end local setups (such as dual RTX 4090s) can work, VOID is optimized for A100 or H100 data center environments where memory bandwidth is highest.

Q: How is this different from what Runway or Adobe already offer?

Runway and Adobe are general-purpose AI video editors. VOID is a specialized framework engineered specifically for physical causality: it understands that removing a support means gravity must act on whatever was being held. General tools do not reason about downstream physics.

Q: Does it work on busy, crowded scenes?

Current benchmarks focus on sparse scenes with clear object interactions. Performance in high-clutter environments (like a packed city street) remains an area of active research, particularly around overlapping Quadmask regions.

Q: What is a Quadmask and why does it matter?

A Quadmask is a four-value segmentation map that tells VOID which pixels to remove, which to keep, which are overlapping, and which must physically change as a consequence of the deletion. This four-channel system is what enables causal reasoning beyond simple erasure.

Q: Is there a way to try it without enterprise hardware?

Yes. An interactive Gradio demo is available on Hugging Face for browser-based testing, and a Google Colab notebook is provided for developers who want to experiment with the pipeline on smaller sequences.

Rohit Dwivedi

The Ghost Guitar Problem

An editor removes an actor from a film shot. The scene looks clean, but there is a problem: the guitar the actor was holding remains frozen in mid-air, a phantom instrument defying the laws of nature. In VFX, this is called a “ghost object,” and it has been the bane of professional cleanup for decades. Correcting that single artifact (animating a believable fall, adjusting the shadow, matching the bounce) used to cost a team weeks of meticulous, frame-by-frame labor.

NETFLIX VOID (Video Object and Interaction Deletion) doesn’t just paint over the hole. It rewrites the scene’s physical history.

Think of traditional video inpainting tools as “blind wallpaper hangers”: they can cover a mark on the wall, but they have no idea whether the wall is load-bearing. VOID is the structural engineer who understands that once you remove a support beam, the floor above must sag. By the end of this piece, you will understand exactly how VOID teaches AI to reason about gravity, why that matters for your production budget, and how Sterlites measures this new standard of physics-aware video editing.

The Bottom Line

In human preference trials, VOID earned a 64.8% preference rating, dwarfing competitors like Runway (18.4%). This is not a marginal improvement: it is a category shift.

The Causality Shift: From Pixel Painting to Stage Directing

VOID is a physics-aware video deletion framework that uses interaction-aware conditioning to reconstruct physically consistent environments after an object is removed. The crucial word is “interaction.” Every existing inpainting tool can fill in the pixels behind a deleted object. None of them ask the follow-up question: “What should happen to everything that was touching it?”

Imagine removing a bowling ball from a slow-motion video of a strike. A traditional tool replaces the ball with background texture. The pins? They keep scattering as if an invisible force hit them. VOID ensures those pins stay standing, because the cause of their motion has been erased.

Developed through a collaboration between Netflix and INSAIT (the Institute for Computer Science, Artificial Intelligence and Technology in Sofia), and built upon Alibaba’s 5-billion-parameter CogVideoX foundation model, VOID is optimized for NVIDIA’s high-end silicon. It solves the “so what?” of video world models: standard models fail precisely when a removed object was colliding with, supporting, or casting light on something else.

The physics is the product.

The Quadmask Blueprint: A Demolition Crew’s Color-Coded Map

To execute a physically accurate deletion, the AI needs a map of the environment’s dependencies. Imagine a ball resting on a pillow: if you remove the ball, the pillow shouldn’t just stay indented. It needs to recover its shape. VOID’s answer is the Quadmask, a four-value segmentation map that instructs the model on exactly what to keep, what to remove, and what must change as a consequence.

Think of the Quadmask like a multi-colored blueprint for a demolition crew. It doesn’t just show which wall to pull down; it identifies which floorboards will creak, which shadows will vanish, and which adjacent structures need reinforcement.

Quadmask Value	Semantic Meaning	Description
0	Remove	The primary object designated for deletion.
63	Overlap	Regions where the primary object and affected items intersect.
127	Affected	Items that must change: shadows, reflections, or held objects that must fall.
255	Keep	Background elements to be preserved exactly as they are.

Why Four Values?

Traditional inpainting uses a binary mask (keep or remove). The Quadmask’s four-channel system is what enables VOID to distinguish between objects that should be erased and objects that should be physically transformed, a distinction no prior framework could make.

What This Looks Like in Practice

A Sterlites engineer using the VOID pipeline would employ a Vision-Language Model (VLM) like Gemini to analyze a scene. If an actor is jumping into a pool, the VLM identifies the actor as “Remove” (0) and the resulting splash as “Affected” (127). VOID then ensures that when the actor is gone, the water surface remains calm and undisturbed, erasing the splash that no longer has a cause.

But how does the model know that splashes follow jumps, or that guitars fall when released? That required building a school where reality itself was the curriculum.

Training on Truth: Synthetic Data and Blender

How do you teach an AI the “truth” of physics when you can’t film a ghost? You build a digital laboratory to simulate one.

Training a model for causality requires “counterfactual” data: paired videos of the same scene with and without an object, where the physics is correct in both. Netflix solved this by creating two primary synthetic data sources:

HUMOTO: Using motion-capture data rendered in Blender, the team created scenes of humans interacting with objects, then re-ran the simulation with the human removed to observe how gravity, momentum, and contact forces reacted to the absence.
Kubric: A Google Research framework used to simulate object-to-object collisions, bouncing trajectories, and stacking dependencies with pixel-perfect ground truth.

The team trained the model on 8x NVIDIA A100 80GB GPUs using DeepSpeed ZeRO Stage 2 optimization. By rendering a scene once with an object and then re-simulating the physics without it, the model was provided with a provable ground truth for how the world should behave.

You can’t film the absence of something. Netflix’s insight was to simulate it. That shift, from observation to counterfactual simulation, is the bridge between AI that generates plausible textures and AI that reasons about physical causality.

Rohit Dwivedi•Founder & CEO, Sterlites.com

This training methodology explains why VOID outperforms competitors by such a wide margin. Runway, ProPainter, and similar tools were trained on real-world video pairs where the “removed” version was simply the scene without the object appearing. They never learned what should change downstream, because their training data never showed it.

The Two-Pass Engine: How VOID Prevents Morphing

A common “hallucination” in AI video is morphing, where generated objects change shape like melting wax across frames. Picture a vase that starts as a cylinder but slowly warps into an oval over 50 frames. VOID addresses this through a sophisticated Two-Pass Pipeline that acts as a built-in “second opinion.”

Pass 1 utilizes the 5-billion parameter CogVideoX base to sketch the primary action: filling the deleted region and initiating the physical response (for example, starting the guitar’s fall).

If structural instability or warping is detected, Pass 2 kicks in. Using Optical Flow-warped noise (a technique that anchors the random initialization of the diffusion process to the motion vectors from Pass 1), the system locks down the shapes. Objects remain consistent and stabilized along their new physical trajectories.

Loading diagram...

This entire process is optimized using FP8 quantization (a technique that halves the numerical precision of model weights from 16-bit to 8-bit floats), allowing massive computations to fit within memory-efficient profiles without sacrificing the physical fidelity of the output. In practice, this means a 40GB GPU can run what would otherwise require 80GB of VRAM.

What does that cost reduction look like for a real studio?

The Executive’s Calculus: Why This Is a Fiscal Necessity

The business impact of VOID extends far beyond technical novelty. In an era where industry leaders like Ben Affleck (via InterPositive) are pushing to reduce VFX budgets by up to 50%, VOID is not a luxury. It is a fiscal necessity.

Netflix’s reported $600 million acquisition of InterPositive underscores the stakes: they aren’t just buying tools. They are buying “Causal Continuity,” the ability to edit a scene’s physical history without reshooting it.

Capability	Manual VFX Workflow	VOID-Automated Workflow
Ghost Object Removal	2 to 4 weeks per shot	Minutes per shot
Physical Reaction Animation	Senior artist, frame-by-frame	Automated via Quadmask
Shadow and Reflection Cleanup	Separate rotoscoping pass	Integrated in single pipeline
Hardware Requirement	Specialized workstations	40GB+ VRAM (A100/H100)

The Trade-Off

The hardware “ante” is real: running VOID effectively requires enterprise-grade power, specifically 40GB or more of VRAM. For studios, the calculus is clear: invest in the silicon today to eliminate the cost of the reshoot tomorrow.

By automating the “pixel-pushing” tasks (the tedious rotoscoping and clean-plate generation), this technology frees junior VFX artists to focus on higher-level creative and strategic vision.

Here's What Most Teams Get Wrong

Studios often evaluate AI tools solely on visual quality. VOID’s advantage isn’t prettier pixels: it is correct physics. A tool that generates a beautiful but physically impossible scene is a liability, not an asset. Always benchmark causal plausibility alongside visual fidelity.

The Sterlites “Causal Continuity Score” Framework

At Sterlites, we have defined a new metric for this era of production: the Sterlites Causal Continuity Score (CCS).

Think of CCS like a “safety rating” for AI edits. A car’s crash test score tells you how well the vehicle protects its occupants after impact. The CCS tells you how well an AI edit preserves the “downstream” physical effects of a deletion.

The CCS measures three dimensions:

Gravity Compliance: Do unsupported objects fall at physically plausible rates?
Shadow and Reflection Integrity: Are light-dependent artifacts correctly updated or removed?
Interaction Chain Resolution: When Object A is removed, do Objects B and C (which were touching A) behave correctly?

Traditional inpainting tools score poorly here because they leave floating objects or orphaned shadows. VOID is the first tool to achieve high CCS ratings by ensuring that every removal has a physically plausible reaction.

Sterlites POV

The future of video isn’t “generative” (making things from scratch), but “causal” (editing things with physical integrity). While others focus on making AI “look” real, VOID makes it behave realistically. This shift toward physical logic, not visual mimicry, is the defining competitive edge for studios investing in AI-driven post-production pipelines.

Frequently Asked Questions

Conclusion

Within the next 12 months, the concept of a “reshoot” will begin to feel as antiquated as the fax machine. When a stray coffee cup, a visible support wire, or an unwanted extra ruins a multi-million dollar take, VOID allows editors to remove the error and re-simulate the truth of the scene, with gravity, shadows, and reflections all updating automatically.

The trajectory is clear:

Benchmark causal plausibility, not just visual fidelity, when evaluating any AI video tool.
Adopt the Sterlites CCS framework to quantify how well your editing pipeline preserves physical integrity after deletions.
Invest in the hardware foundation (40GB+ VRAM) now, because the models that justify the silicon are already here.

The question for studios is no longer “Can AI remove this object?” It is “Can AI understand what happens next?” VOID answers yes.

Research NoteFor those who enjoy the technical details...

Verified SourceVOID: Video Object and Interaction Deletion (arXiv 2604.02296)

Verified Sourcenetflix/void-model on GitHub