intermediate

7 min read

•Sunday, April 12, 2026

Counting on AI: NUMINA Makes Text-to-Video Models Actually Get the Numbers Right

Tired of your AI-generated videos showing two objects when you asked for three? This paper introduces NUMINA, a game-changing, training-free framework that finally tackles numerical accuracy in text-to-video diffusion models, making your creative and commercial AI applications far more reliable.

Original paper: 2604.08546v1

Authors:Zhengyang SunYu ChenXin ZhouXiaofan LiXiwu Chen+2 more

Key Takeaways

1. Text-to-video diffusion models frequently fail to generate the correct number of objects specified in a prompt.
2. NUMINA is a training-free framework that identifies and guides the generation process to improve numerical alignment.
3. It leverages discriminative self- and cross-attention heads to derive a countable latent layout and modulates cross-attention for guidance.
4. NUMINA significantly improves counting accuracy (up to 7.4%) on various models while maintaining temporal consistency and CLIP alignment.
5. This structural guidance complements existing techniques and offers a practical path toward count-accurate text-to-video diffusion for developers.

Why Your AI Agent Needs to Count

As developers and AI builders, we're constantly pushing the boundaries of what AI can create. Text-to-video diffusion models have opened up incredible possibilities, from generating marketing content to simulating complex scenarios. But there's a persistent, often frustrating, flaw: these models frequently struggle with basic numerical accuracy. You ask for 'five red apples,' and you might get four, or six, or even just a blurry pile that vaguely suggests 'some' apples.

This isn't just a minor annoyance; it’s a critical limitation that impacts the utility and trustworthiness of AI-generated content. Imagine building a tool for e-commerce that generates product ads, only for it to misrepresent the quantity of items. Or a simulation engine for logistics where the number of vehicles or packages is crucial. This is where the new research on NUMINA steps in, offering a practical, elegant solution that promises to make our AI agents far more precise and reliable.

The Paper in 60 Seconds

The paper "When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models" introduces NUMINA, a novel framework designed to improve the numerical alignment in text-to-video diffusion models. The core problem is that existing models often fail to generate the correct number of objects specified in a text prompt. NUMINA addresses this with a training-free identify-then-guide approach. It works by analyzing the model's internal attention mechanisms to "identify" the intended numerical layout and then "guides" the generation process to adhere to that count. The results are significant: NUMINA boosts counting accuracy by up to 7.4% on various models, improves CLIP alignment, and maintains temporal consistency, all without requiring additional training.

The Problem: When AI Can't Count Past Two (or Three)

Text-to-video diffusion models are incredibly powerful at understanding concepts, styles, and actions. They can generate a 'dog running on a beach at sunset' with impressive fidelity. However, when you add a specific number, like 'three dogs running on a beach at sunset,' the model often falters. This isn't because the model doesn't understand 'three'; it's a fundamental challenge in how these models synthesize visual information from textual prompts.

Diffusion models rely heavily on attention mechanisms to map elements from the text prompt to specific regions and features in the generated image or video. While attention is excellent at understanding *what* should be where, it's less adept at consistently enforcing *how many* of a particular object should appear. The model might distribute the 'dog' concept across multiple latent regions, but without explicit guidance, it struggles to consolidate these into an exact, countable number of distinct entities. This leads to common issues like:

• Under-generation: Fewer objects than requested.

• Over-generation: More objects than requested.

• Ambiguous objects: Features that look like parts of an object but aren't clearly a distinct entity, making counting difficult.

These inconsistencies undermine the precision required for many real-world applications, making the generated content less useful or even misleading.

NUMINA's Elegant Solution: Identify, Then Guide

NUMINA tackles this numerical alignment problem with a clever, two-phase, training-free framework:

Phase 1: Identify Prompt-Layout Inconsistencies

Before generating the video, NUMINA first tries to understand the *intended numerical layout* from the prompt. It does this by inspecting the internal workings of the diffusion model itself, specifically the self-attention and cross-attention heads. These heads are crucial for how the model understands relationships within the image (self-attention) and between the text prompt and the image (cross-attention).

NUMINA's innovation here is to identify discriminative attention heads – those particular heads that seem to be most sensitive to numerical information in the prompt. By analyzing the patterns within these specific attention heads, NUMINA can derive a countable latent layout. Think of it as the model internally sketching out the exact number of objects it *should* generate, based on the prompt's numerical cues.

Phase 2: Guide Regeneration

Once NUMINA has identified this intended numerical layout, it doesn't just hope for the best. It actively guides the video generation process to match this layout. This guidance is achieved by modulating cross-attention. Essentially, NUMINA intervenes in how the text prompt's numerical information influences the visual synthesis, ensuring that the latent representation of the objects aligns precisely with the desired count.

The key here is that this guidance is conservative. It doesn't force the model to create objects out of nothing or drastically alter the scene. Instead, it gently steers the generation, refining the latent layout to ensure numerical accuracy while preserving the overall quality and temporal consistency of the video. This 'structural guidance' acts as a powerful complement to other techniques like prompt engineering or seed searching, providing a new dimension of control over text-to-video outputs.

What This Means for Your AI Projects

The implications of NUMINA are significant for developers and AI product builders:

1.Increased Reliability: Your text-to-video models will generate content that is numerically accurate, making them suitable for applications where precision is non-negotiable.

2.Enhanced Realism: Videos will look more natural and believable when object counts match expectations.

3.Reduced Iteration Time: Less time spent prompt engineering or regenerating videos to get the correct number of objects.

4.Training-Free Advantage: Since NUMINA doesn't require retraining the underlying diffusion model, it's incredibly practical to implement. You can integrate it into existing pipelines without massive computational overhead.

5.Complements Existing Techniques: It works alongside and enhances current methods for improving video quality, offering a new layer of control.

Building with NUMINA: Practical Applications

NUMINA opens up a wealth of possibilities across various industries:

• Advertising & Marketing: Generate product showcases with exact quantities of items. Imagine an ad for a 'pack of six sodas' that consistently shows six cans, not five or seven. This ensures brand message accuracy and avoids customer confusion.

• Content Creation & Entertainment: Produce animated scenes or short films with precise numbers of characters, props, or environmental elements. A director could specify 'four knights approaching a castle' and trust the model to deliver exactly four, not an army or a duo.

• Simulation & Training: Create highly controlled virtual environments for training AI agents or human operators. For instance, a self-driving car simulator could reliably generate scenarios with 'three pedestrians crossing the street' or 'two cars merging into traffic,' ensuring consistent and repeatable test conditions.

• Education: Develop engaging instructional videos where numerical accuracy is paramount. A science video demonstrating 'two atoms bonding with three others' can now visually represent those exact counts, aiding comprehension.

• Architectural Visualization: Generate walkthroughs of buildings with specific numbers of furniture pieces, windows, or fixtures, providing accurate representations for clients.

By ensuring that AI can not only *understand* but also *accurately represent* numerical information, NUMINA takes a significant step towards more intelligent, reliable, and practically useful text-to-video generation.

The code for NUMINA is openly available, meaning you can start experimenting and integrating this powerful capability into your own projects today. This isn't just an academic breakthrough; it's a practical tool ready for your developer toolkit.

---

Cross-Industry Applications

Advertising/Marketing

Generating product advertisements or promotional videos that accurately display the specified quantity of items, e.g., 'a six-pack of soda' showing exactly six cans.

Ensures brand message consistency and eliminates visual inaccuracies that could mislead customers or require costly manual corrections.

Gaming/Metaverse

Populating virtual environments with a precise number of NPCs (Non-Player Characters), enemies, collectibles, or environmental features based on game logic or narrative prompts.

Enables more consistent and predictable game world generation, facilitating level design, quest creation, and dynamic content scaling.

Robotics/Simulation

Creating highly controlled and numerically accurate training or testing environments for robotic systems, e.g., 'three obstacles in a row' for navigation tasks or 'four specific objects on a table' for manipulation training.

Improves the reliability and reproducibility of simulated scenarios, accelerating robot learning and validation by ensuring consistent object counts.

Education/Training

Developing instructional videos that visually represent exact numerical concepts, such as 'two molecules bonding with three others' or 'five historical figures meeting at a table'.

Enhances learning effectiveness and clarity by providing precise visual aids that accurately reflect the numerical information being taught.

Back to Research Lab Read full paper