Test-Time Training Done Right

Tianyuan Zhang 1, Sai Bi2, Yicong Hong2, Kai Zhang2, Fujun Luan2 Songlin Yang1, Kalyan Sunkavalli2, William T. Freeman1, Hao Tan2
1 Massachusetts Institute of Technology    2Adobe Research   


Lact Architecture

Figure 2. The basic diagram for a LaCT block. The large-chunk TTT layer updates the fast weight W to store history chunk information, while the window attention handles the internal structures within the chunk. The solid line denotes the information flow over model depth and the dashed line denotes the information flow over time (i.e., the fast weight W passing through chunks).

Abstract

Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (often referred to as fast weights) at inference time. This adapted fast weight, similar to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods have struggled to demonstrate effectiveness in handling long-sequence data, due to their computational inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often below 5%) because they deliberately apply small online mini-batch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small mini-batch implies fine-grained block-wise causal dependencies in the data, making them unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by proposing an extremely large chunk update, ranging from 2K to 1M tokens across tasks of varying modalities, which we refer to as Large Chunk Test-Time Training (LaCT). This approach improves hardware utilization by orders of magnitude, and more importantly, facilitates scaling of nonlinear state size (up to 40% of model parameter size), hence substantially improving state capacity, all without requiring cumbersome and error-prone custom kernel implementations. It also allows easy integration of sophisticated optimizers like Muon for online memory updates. We validate our approach across diverse data modalities and tasks, including novel view synthesis from image sets, language models, and auto-regressive video diffusion models. Our approach can scale up to 14-billion-parameter auto-regressive video diffusion models handling sequences of up to 56K tokens. In our longest sequence experiment, we perform novel view synthesis with more than one million context length. Our results highlight the computational and performance benefits of large-chunk test-time training, paving the way for more efficient and scalable long-context sequence modeling. We hope that this work will inspire and accelerate new research in the field of long-context modeling and test-time training.

Hardware Utilization

Figure 1. Using larger chunk sizes significantly improves GPU utilization compared to the original test-time training (TTT) method that even uses customized kernels (a). This enhanced utilization enables efficient and effective scaling to larger state sizes (b), (c),leading to better overall performance in less wall-clock time (d). The dotted line in (a) is the theoretical peak BF16 throughput of the GPU. Panel (c) measure average validation loss of the last 2K tokens in sequences processed by a LaCT language model with different state sizes, demonstrating benefits of larger state size. Panel (d) compares performance versus training time across different baselines on the novel view synthesis benchmark

Results on Novel View Synthesis (GSO Dataset)

(Click to see more results)

Online compressing 262K tokens (16 images of 1024x1024 resolution) into fast weights.

Rendering at 1024x1024 resolution, best viewed in full screen.

Results on Novel View Synthesis (DL3DV Dataset)

(Click to see more results)

Online compressing 1 million tokens (128 images of 960x536 resolution) into fast weights.


Results on Language Modeling

Language Modeling

Figure 5. Language Model results. (a, c) Our model achieves lower per-position loss at larger token indices compared to GLA and DeltaNet at both 760M and 3B scale, indicating stronger long-context modeling capability. (b, d) Our model consistently outperforms GLA and DeltaNet in retrieval accuracy. Furthermore, our Muon variant consistently outperforms our Momentum variant.


Results on Autoregressive Video Diffusion

(Click to see more results)

This close-up shot of a Victoria crowned pigeon showcases its striking blue plumage and red chest. Its crest is made of delicate, lacy feathers, while its eye is a striking red color. The bird's head is tilted slightly to the side, giving the impression of it looking regal and majestic. The background is blurred, drawing attention to the bird's striking appearance.

A gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.

An astronaut runs on the surface of the moon, the low angle shot shows the vast background of the moon, the movement is smooth and appears lightweight.

A man riding a horse through the Gobi Desert with a beautiful sunset behind him, movie quality.

A woman singing and standing in a concert stage with a bright light in the background.

A cyclist powering up a steep hill in a road race.