Integrating a real, keyed greenscreen actor into an AI-generated Jefferson Memorial background whose camera move matches the live plate — what was tried, what won, what died, what runs next.
Traditional VFX — real plates, a NeRF or Gaussian-splat reconstruction, full CG, or an LED volume — is still the best route, and none of this beats it. This is R&D into a faster, lower-cost alternative, and the finding is that it has crossed into potentially usable.
What we discovered. The question: can an AI-generated background carry the same camera move as a real greenscreen plate, well enough to drop a keyed performer in? The look is there — fully generated shots can look great. Control is the hard part. You can get a generated background very close to a real plate's move, but exact only happens if the subject is regenerated in the same shot too; keep the real keyed performer and the camera lands close, not exact. Control comes from concrete inputs — a previz carrying the tracked camera data and a start frame that blocks out the subject — not from wordier prompts.
Recommendation: two methods, chosen by how exact the camera must be. Method 1 · Previz-Steered — track the plate, generate the BG from it, key and composite the real subject over it (closest to exact). Method 2 · Plate-Direct — drive generation from the plate itself with person-lock (faster, close enough to composite over). Seedance 2.0 is the recommended engine; Gemini Omni is a strong alternate capped at 720p; Kling, Veo, Runway and local Wan/LTX were ruled out — see the table below.
Every video generation model tested. Movement is the deciding column — matching the original plate's camera move is the goal (only depth-conditioned local gen is truly exact; Seedance/Omni get very close); quality and output limits follow. The 4K-capable rows are the 4K-capable options.
| Model | Movement | Quality | Max output | Verdict |
|---|---|---|---|---|
| Seedance 2.0 | very close to the plate move (not exact) via previz/plate refs · true parallax | best tested | 4K · 15s | RECOMMEND |
| Gemini Omni | very close to the plate move (not exact) · best physics | strong | 720p · 10s | ALTERNATE |
| Kling 3.0 | tightest keyframes | middling — not bad, below Seedance/Omni | 4K · 15s | REJECTED — quality |
| Kling Omni (video-ref) | re-anchors composition | — | 1080p | REJECTED |
| Veo 3.1 | keyframes ok | strong | 4K · 8s | REJECTED |
| Runway Gen-4.5 | drifts (start-only) | — | 4K · 10s | REJECTED |
| LTX-2 depth IC-LoRA | exact (depth input) | weak detail — ok behind heavy bokeh only | 4K · local (VRAM-bound) | GUIDE LAYER ONLY |
| Wan2.2 Fun Control | exact (depth) | unusable | 720p local | DEAD END |
Everything converges on two output methods: Method 1 (previz-steered — closest to frame-exact) generates BG plates that meet the real keyed subject in the W1 composite, and Method 2 (plate-direct) generates the subject in-scene from the plate itself. The remaining graphs are the supporting and historical workflows. Arrows show data direction.
One performer, three camera behaviors: a locked-off medium, a push-in dolly, and a crane-down jib. Together they cover the camera-matching problem space from trivial to hard.
CorridorKey on the 5090 (BiRefNet hint → neural key + despill → sharp hybrid composite) outputs the subject on flat gray-148 for clean matte extraction. Naive chroma keying failed outright — the warm dark-olive "green" had r≈g. A distance-ramp matte off the gray, plus interior hole-fill for near-gray dress folds, yields the final true-alpha ProRes 4444.
03-keyed/jeff_jib_key_ALPHA.mov (ProRes 4444, 4K, 246 MB — local)Keyer validation suite — original greenscreen source and keyed result side-by-side in each clip (keys composited on a gradient specifically to expose edge artifacts).
The law from four early iterations still holds — Seedance interprets a camera, it never hard-locks one — but the ceiling moved. Feeding the solved-camera previz as a video_references motion track (rather than describing the move or clamping endpoints alone) is what closed the gap; that recipe graduated into W5. Current evidence, all against the MegaSaM-solved cameras:
The first of the two methods: the real camera is tracked and re-authored, so the generated plate comes closest to a frame-exact match with the solve.The camera is tracked, not described: MegaSaM (DROID-SLAM + monocular depth, on the 5090) solves every frame of each greenscreen plate — position, rotation, and focal — where Blender's tracker failed outright on the flat green cloth. (MegaSaM's AI solve was only necessary because these test plates were rough greenscreen with no tracking markers — a properly tracked shoot would solve with a standard tracker.) The solves import into one Blender scene as keyframed cameras under registration empties (focal-derived endpoints → look/travel alignment → de-roll), get staged against the art-directed rotunda blockout (with a Gaussian-smoothed twin baked for shaky solves), and render as clay previz — stills for keyframe generation, full-length move videos as the motion reference.
1 · Tracking → Blender. Solved cameras ghosted over their plates — the registration check that earned the "spot on" sign-off before anything was generated:
DollyCamSmooth) is baked alongside2 · Previz move renders. Staged scene, subject marks hidden, one clay move video per shot — these are fed to the video model verbatim as video_references:
LockedCamSmooth)3 · Previz stills → generated keyframes. Clay previz frame (left) beside the photoreal keyframe generated from it (right). GPT-Image-2 marries the previz composition with a REAL photograph's materials — the only model tested that swaps the clay placeholder for the real Jefferson statue in one pass. Nano Banana derives every other view from that master frame, keeping all shots in one coherent generated rotunda:




4 · Elements per shot. Exactly what went into each Seedance 2.0 run (Higgsfield, 720p/5s; job IDs from prior generations work directly as media references):
| Shot | start_image | end_image | video_references | Result |
|---|---|---|---|---|
| Dolly | GPT-Image-2 start | Nano Banana end | Dolly previz move (smoothed) | plate adopted (operator's own run of the recipe) |
| Jib | Nano Banana angle-match v2d | — none — | Jib previz move (raw) | previz alone held the path; end frame unnecessary |
| Locked | GPT-Image-2 start | — none — | Locked previz move (smoothed) | carries the plate's tripod micro-motion |
5 · The prompts. Verbatim production prompts, image rail and video rail — the bolded clauses are load-bearing (removing any one reproduced a documented failure).
IMAGE — keyframes
Generate a photorealistic image of the Thomas Jefferson Memorial interior. The FIRST reference image is a gray clay 3D previz frame — it defines the EXACT composition to reproduce: a LOW-ANGLE camera looking slightly up, the bronze statue on its pedestal at frame-right against the coffered dome, columns placed exactly as shown, floor plane low in frame. Match this composition precisely — do not recenter or reframe. The SECOND reference image is a real photograph of the actual Jefferson Memorial interior — use it as the source of truth for everything visual: the real white Georgia marble and its veining, the real engraved inscription text panels with laurel wreaths on the walls, the real dark weathered bronze Jefferson statue (replace the clay placeholder figure with the real statue's actual sculpted form), the real coffered dome detail, the polished floor reflections, and its soft natural daylight. Empty interior, no people.
IMAGE 1 is a real photograph of the Thomas Jefferson Memorial interior — the MASTER SCENE: this exact bronze statue, white marble, inscription panels, soft even neutral daylight. IMAGE 2 is a gray clay 3D previz frame that defines the CAMERA for a second photo of the exact same scene — and its framing is COMPLETELY DIFFERENT from image 1: the camera is down near the FLOOR, tilted strongly UPWARD. Reproduce IMAGE 2's framing exactly: the coffered dome ceiling fills the entire TOP HALF of the frame, the statue on its pedestal stands at the RIGHT THIRD of frame seen from below against the dome, columns lean inward with strong upward perspective convergence, and only a small strip of floor shows at the very bottom. Do NOT reuse image 1's eye-level framing. Every material and lighting property still comes from IMAGE 1 unchanged: same neutral white-balanced daylight (not warm, not golden, not moody), same marble, same dark weathered bronze, same grade — two photos minutes apart in the same session. Empty interior, no people. Photorealistic.
VIDEO — generation
Slow cinematic crane shot inside a neoclassical marble rotunda. The image reference is the exact opening frame: a low camera position looking up, coffered dome filling the top of frame, memorial sculpture at the right third. The video reference is a gray 3D architectural previsualization showing the exact camera path to follow: a smooth motorized crane move — the camera starts low and rises steadily while tilting down, moving closer, settling on a tighter framing near the sculpture's stone base. Constant speed, no handheld sway, no walking rhythm, no speed ramps. Follow the previsualization's framing trajectory exactly. Keep the marble architecture, engraved wall text and soft neutral daylight from the opening frame consistent for the whole shot. Empty interior, documentary style.
A static tripod shot inside a neoclassical marble rotunda. The image reference is the exact opening frame — hold this exact composition for the whole shot. The video reference is a gray 3D architectural previsualization of the EXACT camera behavior to reproduce: a locked-off tripod camera with only the faintest natural micro-movement — follow it exactly. NO push, NO drift, NO pan, NO tilt, NO zoom, NO handheld sway beyond what the previsualization shows. The interior is empty and still; soft neutral daylight through the colonnade. Keep the marble architecture, engraved wall text, sculpture and lighting from the opening frame perfectly consistent from first frame to last. Documentary style, photorealistic.
6 · Generated plates + previz verification. Every generation ships with a 50% previz-ghost overlay — the acceptance test that the model rode the solved camera:
7 · Composite. Method 1's endpoint — the real keyed subject over each previz-steered plate. All three assembled; the one visible seam is grade, not key (she carries the warm greenscreen light against the plate's cooler daylight — a curves pass closes it).
The second of the two methods: no solve, no Blender — the original plate itself is the reference. An Nano Banana start frame places her in the master scene at the plate's framing, then the plate drives motion and performance, person-locked. The camera match is close but interpretive — and close enough that compositing over it works. (This began as a dead end: the early "replace the background" test repainted a stranger. That verdict was prompt-shaped and reference-shaped, not architectural.)
The reversed recipe, run on all three shots — each with its own Nano Banana start frame (plate frame 1 + GPT master) and its own original plate as motion/performance reference: