Inference Time Compute Scaling

Inference Time Compute Scaling is the concept that allocating more computational resources (compute) during the inference stage leads to better model performance and reasoning capabilities.

Just as humans give better answers when they think for more time, LLMs generate better answers when they are allowed to “think” by generating more tokens (reasoning steps) before the final answer.
Model accuracy generally scales up with the amount of Test-Time Compute.

This method does not involve changing the underlying model parameters (learning/training) but rather changing how the model is used during inference, for example, through techniques like Chain-of-Thought Prompting

Inference Time Compute Scaling

Chat with Mike 3.0