When to Ponder: Adaptive Compute Allocation for Code Generation via Test-Time Training

Abstract

We propose PonderTTT, a method that applies selective computational updates to language model inputs based on difficulty. Using a reconstruction loss signal from TTT layers, the system decides when to trigger additional processing—a decision made without any learned classifier. A single scalar threshold, calibrated on unlabeled data and adapted during inference, governs update frequency. Testing on GPT-2 models (124M to 1.5B parameters) for code language modeling shows our approach achieves 82–89% Oracle Recovery while being fully training-free, and substantially improves performance on out-of-distribution languages compared to random baseline approaches.

Results

Main Results (Python In-Distribution)

Model SKIP Oracle Ours Recovery
Small (124M) 2.324 1.935 1.977 89.2%
Medium (355M) 1.909 1.653 1.697 82.8%
Large (774M) 2.005 1.580 1.656 82.1%
XL (1.5B) 1.875 1.518 1.576 83.8%

OOD Generalization (XL 1.5B)

Language SKIP Random Ours Oracle
JavaScript 2.85 2.12 1.84 1.57
Java 3.21 2.33 1.95 1.64
Go 6.52 4.25 4.15 3.70

BibTeX

@article{sim2025ponderttt, title={When to Ponder: Adaptive Compute Allocation for Code Generation via Test-Time Training}, author={Sim, Gihyeon}, journal={arXiv preprint arXiv:2601.00894}, year={2025} }