When to Ponder: Adaptive Compute Allocation for Code Generation via Test-Time Training

Gihyeon Sim

Abstract

We propose PonderTTT, a method that applies selective computational updates to language model inputs based on difficulty. Using a reconstruction loss signal from TTT layers, the system decides when to trigger additional processing—a decision made without any learned classifier. A single scalar threshold, calibrated on unlabeled data and adapted during inference, governs update frequency. Testing on GPT-2 models (124M to 1.5B parameters) for code language modeling shows our approach achieves 82–89% Oracle Recovery while being fully training-free, and substantially improves performance on out-of-distribution languages compared to random baseline approaches.

Results

Main Results (Python In-Distribution)

Model	SKIP	Oracle	Ours	Recovery
Small (124M)	2.324	1.935	1.977	89.2%
Medium (355M)	1.909	1.653	1.697	82.8%
Large (774M)	2.005	1.580	1.656	82.1%
XL (1.5B)	1.875	1.518	1.576	83.8%

OOD Generalization (XL 1.5B)

Language	SKIP	Random	Ours	Oracle
JavaScript	2.85	2.12	1.84	1.57
Java	3.21	2.33	1.95	1.64
Go	6.52	4.25	4.15	3.70

BibTeX

@article{sim2025ponderttt, title={When to Ponder: Adaptive Compute Allocation for Code Generation via Test-Time Training}, author={Sim, Gihyeon}, journal={arXiv preprint arXiv:2601.00894}, year={2025} }