LTX-2: Audio-Video Generation Model
What is the LTX-2 model?
LTX-2 is an open-weights audio-video foundation model developed by Lightricks. Unlike most open video models that generate silent clips, LTX-2 is designed to generate video and audio together in a single, synchronized process—so motion, speech, ambience, music, and foley cues align naturally.
Under the hood, LTX-2 is described as a DiT-based (Diffusion Transformer) joint audiovisual model with an asymmetric dual-stream transformer: a larger video stream and a smaller audio stream, connected with cross-modal attention. This design aims to keep video quality high while still producing coherent sound.
Key features (why people care)
1) Synchronized video + audio in one pass
LTX-2 is built to generate the soundtrack (speech, background noise, music, effects) together with the visuals. This is the headline capability that distinguishes it from many open video models.
2) Strong efficiency focus
LTX-2’s research and related coverage emphasize inference efficiency (speed per diffusion step) compared to some open alternatives. In practice, speed still depends heavily on your resolution, FPS, duration, and precision/quantization choice.
3) Production-friendly modes and workflows
In real workflows, creators often iterate quickly with “fast” settings and then re-run for higher quality. LTX-2 is commonly used through:
- ComfyUI node workflows (visual, repeatable)
- Hosted APIs (usage-based billing by generated seconds)
4) LoRA and IC-LoRA for control
LTX-2 supports LoRA adapters (lightweight add-ons) for styles, motion biases, and camera behaviors. IC-LoRA (in-context LoRA) can also condition generation on reference frames / control signals for tighter results.
How to run LTX-2 in ComfyUI (LTX-2 ComfyUI)
- Update ComfyUI to the latest version.
- Open the Template Library (Video) and choose an LTX-2 workflow (Text-to-Video, Image-to-Video, etc.).
- Let the nodes download required weights automatically (or install manually for offline setups).
- Set your prompt in the text encoder node and click “Queue Prompt”.
Tip: Start with shorter duration + lower resolution while iterating, then scale up once the shot is stable.
LTX-2 prompt guide (with examples)
LTX-2 prompts tend to work best when written like a short shot description:
- Establish the shot type (wide/close-up, handheld, cinematic, etc.)
- Set the scene (lighting, mood, textures)
- Describe the action in chronological order
- Define characters with concrete visual details
- Specify camera movement
- Describe audio (and put spoken dialogue in quotes)
Prompt example (Text-to-Video)
Wide establishing shot of a rainy neon street at night. A delivery rider stops under a storefront awning, shakes water off the jacket, then looks up at a flickering sign. Slow dolly-in, slight handheld feel. Reflections ripple on wet asphalt, pink and cyan color palette. Ambient rain, distant traffic hiss, occasional horn. Soft lo-fi beat plays from inside the shop.
Prompt example (Dialogue + language)
Medium close-up, warm indoor lighting. A young woman in a knitted sweater holds a ceramic mug, smiles, then whispers: “We made it.” (English, soft voice). Camera gently pans right revealing a cozy living room with a fireplace. Crackling fire, subtle room tone.
Prompt example (Image-to-Video guidance)
Use the input image as the first frame. Keep the character’s face and outfit identical. The character turns head slightly left, then smiles. Subtle breathing motion. Slow push-in. Quiet room ambience, faint computer fan noise.
Practical warning: text/logos
Like many video generators, LTX-2 is not reliable for readable on-screen text or accurate logos. If you need titles, captions, or branding, plan to add them in post.
LTX-2 system requirements (and what “minimum” really means)
Official documentation lists a very high “recommended” setup (A100/H100-class GPU, 64GB+ RAM, large SSD, modern CUDA). Many users do run smaller GPUs, but typically by:
- lowering resolution / FPS / duration
- using distilled or quantized weights
- accepting slower renders and occasional instability
A realistic approach:
- Iterate at 540p–720p, 3–6 seconds, lower FPS.
- Scale up only when the prompt + motion are already working.
LTX-2 LoRA (and IC-LoRA) explained
What is LTX-2 LoRA?
LoRA (Low-Rank Adaptation) is a small set of additional weights that nudges the base model toward a style, look, or motion behavior—without full fine-tuning. LoRAs are usually easy to swap and share.
What is IC-LoRA?
IC-LoRA (In-Context LoRA) enables conditioning video generation on reference frames / control signals at inference time. This is useful when you want:
- stronger identity consistency
- more controlled motion paths
- refinement on top of a base generation
Performance and evaluation notes
LTX-2’s paper reports strong audiovisual alignment and competitive quality among open systems, emphasizing efficiency and speed in benchmark settings. Real-world performance varies by workflow (ComfyUI graphs), model variant, and hardware.
Pricing (API)
If you use the official API, pricing is typically per second of generated video, and it increases with resolution (Fast vs Pro tiers). Audio-to-Video is also billed per second (based on input audio duration). Always confirm the latest pricing page before budgeting.
Best use cases
- Fast ideation for short ads, social clips, and storyboards
- Audio-driven video concepts (voice/music guiding motion and pacing)
- ComfyUI pipelines for repeatable creative workflows
- Custom looks and camera behaviors via LoRA / IC-LoRA
FAQ
Is LTX-2 open source?
LTX-2 is released with open weights and public code, but it uses a specific “Open Weights License.” Read the license text carefully for redistribution and commercial usage.
Can I use LTX-2 with ComfyUI?
Yes. LTX-2 is supported through ComfyUI workflows and nodes, including templates to get started quickly.
What are the minimum system requirements?
The official docs list a high recommended configuration. Lower-end GPUs can work with reduced settings or lighter variants, but results vary.
Where can I find LTX-2 LoRAs?
LoRAs and IC-LoRAs are published in the LTX docs and on Hugging Face (including camera-control and detailer variants). Pick LoRAs that match your workflow and base checkpoint.
How do I get better results?
Use shot-like prompts (clear action sequence + camera + audio). Iterate at low settings first, then scale up. Avoid demanding readable text/logos inside the video.