


The Ghost Guitar Problem
An editor removes an actor from a film shot. The scene looks clean, but there is a problem: the guitar the actor was holding remains frozen in mid-air, a phantom instrument defying the laws of nature. In VFX, this is called a “ghost object,” and it has been the bane of professional cleanup for decades. Correcting that single artifact (animating a believable fall, adjusting the shadow, matching the bounce) used to cost a team weeks of meticulous, frame-by-frame labor.
NETFLIX VOID (Video Object and Interaction Deletion) doesn’t just paint over the hole. It rewrites the scene’s physical history.
Think of traditional video inpainting tools as “blind wallpaper hangers”: they can cover a mark on the wall, but they have no idea whether the wall is load-bearing. VOID is the structural engineer who understands that once you remove a support beam, the floor above must sag. By the end of this piece, you will understand exactly how VOID teaches AI to reason about gravity, why that matters for your production budget, and how Sterlites measures this new standard of physics-aware video editing.
The Bottom Line
In human preference trials, VOID earned a 64.8% preference rating, dwarfing competitors like Runway (18.4%). This is not a marginal improvement: it is a category shift.
The Causality Shift: From Pixel Painting to Stage Directing
VOID is a physics-aware video deletion framework that uses interaction-aware conditioning to reconstruct physically consistent environments after an object is removed. The crucial word is “interaction.” Every existing inpainting tool can fill in the pixels behind a deleted object. None of them ask the follow-up question: “What should happen to everything that was touching it?”
Imagine removing a bowling ball from a slow-motion video of a strike. A traditional tool replaces the ball with background texture. The pins? They keep scattering as if an invisible force hit them. VOID ensures those pins stay standing, because the cause of their motion has been erased.
Developed through a collaboration between Netflix and INSAIT (the Institute for Computer Science, Artificial Intelligence and Technology in Sofia), and built upon Alibaba’s 5-billion-parameter CogVideoX foundation model, VOID is optimized for NVIDIA’s high-end silicon. It solves the “so what?” of video world models: standard models fail precisely when a removed object was colliding with, supporting, or casting light on something else.
The physics is the product.
The Quadmask Blueprint: A Demolition Crew’s Color-Coded Map
To execute a physically accurate deletion, the AI needs a map of the environment’s dependencies. Imagine a ball resting on a pillow: if you remove the ball, the pillow shouldn’t just stay indented. It needs to recover its shape. VOID’s answer is the Quadmask, a four-value segmentation map that instructs the model on exactly what to keep, what to remove, and what must change as a consequence.
Think of the Quadmask like a multi-colored blueprint for a demolition crew. It doesn’t just show which wall to pull down; it identifies which floorboards will creak, which shadows will vanish, and which adjacent structures need reinforcement.
Why Four Values?
Traditional inpainting uses a binary mask (keep or remove). The Quadmask’s four-channel system is what enables VOID to distinguish between objects that should be erased and objects that should be physically transformed, a distinction no prior framework could make.
What This Looks Like in Practice
A Sterlites engineer using the VOID pipeline would employ a Vision-Language Model (VLM) like Gemini to analyze a scene. If an actor is jumping into a pool, the VLM identifies the actor as “Remove” (0) and the resulting splash as “Affected” (127). VOID then ensures that when the actor is gone, the water surface remains calm and undisturbed, erasing the splash that no longer has a cause.
But how does the model know that splashes follow jumps, or that guitars fall when released? That required building a school where reality itself was the curriculum.
Training on Truth: Synthetic Data and Blender
How do you teach an AI the “truth” of physics when you can’t film a ghost? You build a digital laboratory to simulate one.
Training a model for causality requires “counterfactual” data: paired videos of the same scene with and without an object, where the physics is correct in both. Netflix solved this by creating two primary synthetic data sources:
- HUMOTO: Using motion-capture data rendered in Blender, the team created scenes of humans interacting with objects, then re-ran the simulation with the human removed to observe how gravity, momentum, and contact forces reacted to the absence.
- Kubric: A Google Research framework used to simulate object-to-object collisions, bouncing trajectories, and stacking dependencies with pixel-perfect ground truth.
The team trained the model on 8x NVIDIA A100 80GB GPUs using DeepSpeed ZeRO Stage 2 optimization. By rendering a scene once with an object and then re-simulating the physics without it, the model was provided with a provable ground truth for how the world should behave.
You can’t film the absence of something. Netflix’s insight was to simulate it. That shift, from observation to counterfactual simulation, is the bridge between AI that generates plausible textures and AI that reasons about physical causality.
This training methodology explains why VOID outperforms competitors by such a wide margin. Runway, ProPainter, and similar tools were trained on real-world video pairs where the “removed” version was simply the scene without the object appearing. They never learned what should change downstream, because their training data never showed it.
The Two-Pass Engine: How VOID Prevents Morphing
A common “hallucination” in AI video is morphing, where generated objects change shape like melting wax across frames. Picture a vase that starts as a cylinder but slowly warps into an oval over 50 frames. VOID addresses this through a sophisticated Two-Pass Pipeline that acts as a built-in “second opinion.”
Pass 1 utilizes the 5-billion parameter CogVideoX base to sketch the primary action: filling the deleted region and initiating the physical response (for example, starting the guitar’s fall).
If structural instability or warping is detected, Pass 2 kicks in. Using Optical Flow-warped noise (a technique that anchors the random initialization of the diffusion process to the motion vectors from Pass 1), the system locks down the shapes. Objects remain consistent and stabilized along their new physical trajectories.
This entire process is optimized using FP8 quantization (a technique that halves the numerical precision of model weights from 16-bit to 8-bit floats), allowing massive computations to fit within memory-efficient profiles without sacrificing the physical fidelity of the output. In practice, this means a 40GB GPU can run what would otherwise require 80GB of VRAM.
What does that cost reduction look like for a real studio?
The Executive’s Calculus: Why This Is a Fiscal Necessity
The business impact of VOID extends far beyond technical novelty. In an era where industry leaders like Ben Affleck (via InterPositive) are pushing to reduce VFX budgets by up to 50%, VOID is not a luxury. It is a fiscal necessity.
Netflix’s reported $600 million acquisition of InterPositive underscores the stakes: they aren’t just buying tools. They are buying “Causal Continuity,” the ability to edit a scene’s physical history without reshooting it.
The Trade-Off
The hardware “ante” is real: running VOID effectively requires enterprise-grade power, specifically 40GB or more of VRAM. For studios, the calculus is clear: invest in the silicon today to eliminate the cost of the reshoot tomorrow.
By automating the “pixel-pushing” tasks (the tedious rotoscoping and clean-plate generation), this technology frees junior VFX artists to focus on higher-level creative and strategic vision.
Here's What Most Teams Get Wrong
Studios often evaluate AI tools solely on visual quality. VOID’s advantage isn’t prettier pixels: it is correct physics. A tool that generates a beautiful but physically impossible scene is a liability, not an asset. Always benchmark causal plausibility alongside visual fidelity.
The Sterlites “Causal Continuity Score” Framework
At Sterlites, we have defined a new metric for this era of production: the Sterlites Causal Continuity Score (CCS).
Think of CCS like a “safety rating” for AI edits. A car’s crash test score tells you how well the vehicle protects its occupants after impact. The CCS tells you how well an AI edit preserves the “downstream” physical effects of a deletion.
The CCS measures three dimensions:
- Gravity Compliance: Do unsupported objects fall at physically plausible rates?
- Shadow and Reflection Integrity: Are light-dependent artifacts correctly updated or removed?
- Interaction Chain Resolution: When Object A is removed, do Objects B and C (which were touching A) behave correctly?
Traditional inpainting tools score poorly here because they leave floating objects or orphaned shadows. VOID is the first tool to achieve high CCS ratings by ensuring that every removal has a physically plausible reaction.
Sterlites POV
The future of video isn’t “generative” (making things from scratch), but “causal” (editing things with physical integrity). While others focus on making AI “look” real, VOID makes it behave realistically. This shift toward physical logic, not visual mimicry, is the defining competitive edge for studios investing in AI-driven post-production pipelines.
Frequently Asked Questions
Conclusion
Within the next 12 months, the concept of a “reshoot” will begin to feel as antiquated as the fax machine. When a stray coffee cup, a visible support wire, or an unwanted extra ruins a multi-million dollar take, VOID allows editors to remove the error and re-simulate the truth of the scene, with gravity, shadows, and reflections all updating automatically.
The trajectory is clear:
- Benchmark causal plausibility, not just visual fidelity, when evaluating any AI video tool.
- Adopt the Sterlites CCS framework to quantify how well your editing pipeline preserves physical integrity after deletions.
- Invest in the hardware foundation (40GB+ VRAM) now, because the models that justify the silicon are already here.
The question for studios is no longer “Can AI remove this object?” It is “Can AI understand what happens next?” VOID answers yes.
Thinking about AI Research? Our team has helped 100+ companies turn AI insight into production reality.
Continue Reading
Hand-picked insights to expand your understanding of the evolving AI landscape.
Need help implementing AI Research?
Book a highly tactical 30-minute strategy session. We apply the engineering rigor developed with McKinsey, DHL, and Walmart to accelerate AI for startups and enterprises alike. Let's bypass the hype, evaluate your specific use case, and map a concrete path to production.
Give your network a competitive edge in AI Research.
Establish your authority. Amplify these insights with your professional network.


