Reasoning with Large Reasoning Models (LRMs) can be slow, and users often want to interrupt them mid-thought. We test SOTA reasoning models under several real-world interruption scenarios, and find three new failure modes!
What would happen if we interrupt LRMs when they're 30% done thinking?
Because LRMs' reasoning and inference can take a lot of time, users often don't want to wait for them to finish. Instead, they want to interrupt mid-inference: forcing an immediate answer (hard interrupt), asking the model to accelerate (speedup), or changing the task specification (info update). We explore how well LRMs handle these interruptions and dynamic context changes across math and coding tasks. The figure below illustrates our evaluation protocol:
In practice, we run a single inference session to obtain the full reasoning chain $r$ and then interrupt the interrupt message $i$ at different stages of the reasoning process ($0 \leq X < |r|$) based on different scenarios:
For Hard Interrupt and Speedup evaluations, we directly utilized existing math and coding datasets, including GSM8K, MATH-500, AIME 24/25, and LiveCodeBench-v6. For the info update track specifically, we augmented these datasets to create a new collection containing 1401 reasoning problems with multiple conflicting information updates.
Explore the full Update-Interrupt Benchmark dataset on Hugging Face.
View full dataset on Hugging FaceWhen cutting LRMs' thinking budget in the end-thinking setup, the results look surprisingly good: Pass@1 doesn't drop much even when interrupting halfway through the reasoning process. When we look at the answer lengths, however, most models "cheat" when told to answer right away. Even with the inserted end-thinking token, they keep reasoning inside the answer section, sneaking in extra thoughts to reach the right result. This hidden reasoning, we call reasoning leakage, makes models look smarter than they really are. Once we disable it in the force-answering setup, Pass@1 accuracy drops sharply, showing that LRMs still have room for improvement under anytime scenarios. In coding tasks, it's even worse. Models still "think" through inline comments in the force-answering setup, producing up to 10x longer code.
Do you recall the "thinking tokens vs. accuracy" plot from prior work: (Qwen3 and NVIDIA's Nemotron)? Our results show that this plot may not tell the full story. Longer outputs often hide extra reasoning and quietly inflate compute cost, even when the model appears to stop thinking early. You can see this in our full results below:
Instead of secretly thinking more during the answer, ideal models should strike a better balance between following instructions and producing correct results.
Takeaway: Current LRMs show only partial anytime behavior, thanks to reasoning leakage. In the future, we need to understand and mitigate this leakage to build truly interruptible models.
We all want faster AI. But can we speed up a reasoning model after it has already started a complex task? Here, we explore this idea of "on-the-fly" acceleration. We found that timing is everything, and the effect follows a distinct U-shaped curve: interrupt at just the right moment, and you can get a faster response with almost no drop in quality:
For some models working on math problems, this works incredibly well. It's almost a "free lunch"; models think faster and consume fewer resources without sacrificing accuracy. However, the story changes dramatically with coding tasks.
When pushed to accelerate on code generation, some models like Qwen and GPT-OSS tend to panic. Instead of gracefully speeding up, their performance collapses by up to 25%. They rush to a conclusion, providing low-quality, incomplete answers before their first reasoning cycle is even finished. This "panic answering" shows that while we can speed up AI, pushing too hard can cause it to stumble completely.
Takeaway: When we want to speed up LRMs, we can often get a "free lunch", achieving faster responses with minimal accuracy loss. However, pushing too hard can lead to panic answering, where models rush to incomplete conclusions. Future work should focus on developing more robust acceleration techniques that avoid this pitfall.
To simulate real-time context changes, we prompt the model with a system instruction
explaining that updates will appear between
<update>...</update> tags, and then insert these updates
mid-reasoning. This simple setup caused a noticeable performance drop across models.
To mitigate this, we introduced a more natural baseline: guided prompting using the model's own voice. Instead of a raw update tag, we phrase the update as if the model is reminding itself of the new information. This almost eliminates the self-doubt issue and significantly stabilizes performance, as shown in the chart below.
Models generally maintain strong performance, even outperforming the "stop-and-redo" baseline. For example, GPT-OSS on LiveCodeBench-v6 and Qwen on GSM8k preserve around 95% of oracle accuracy with much lower computation cost.
One limitation of current LRMs is that not all of them natively support multi-turn thinking. This means we cannot easily close a reasoning block, inject a user update, and reopen it. To validate this, we performed an ablation study simulating that setup, confirming that current LRMs remain fragile under such mid-thinking interruptions, performing worse than our guided prompting baseline.
Takeaway: LRMs can adapt to mid-reasoning updates with proper prompting, but they still struggle with late or conflicting information. Future work should focus on developing models that can natively handle multi-turn reasoning and interruptions.
In this work, we are the first to systematically evaluate large reasoning models (LRMs) under dynamic contexts, simulating real-world interruptions like hard stops, speedups, and information updates. Our findings reveal three new failure modes: reasoning leakage, panic answering, and self-doubt. These issues highlight the fragility of current LRMs in time-sensitive applications.
We call on researchers to: develop LRMs that can gracefully handle interruptions, implement checkpointing mechanisms for reasoning steps, and design adaptive inference strategies that maintain consistency under dynamic constraints. We encourage practitioners to: rigorously test reasoning models in interrupt-rich environments before deployment, establish clear fallback mechanisms for time-sensitive applications, and report interruptibility failures to inform the research community. Together, we can build more robust, deployable reasoning systems.
We are deeply grateful to Lisa Dunlap for her invaluable feedback and thoughtful discussions. We also thank Modal for supporting this work through their Academics Compute Grant. Sky Computing Lab is supported by gifts from Accenture, AMD, Anyscale, Cisco, Google, IBM, Intel, Intesa Sanpaolo, Lambda, Lightspeed, Mibura, Microsoft, NVIDIA, Samsung SDS, and SAP. Authors, as part of their affiliation with UC Berkeley, were supported in part by the National Science Foundation, US Department of Defense, and/or the Berkeley Artificial Intelligence Research (BAIR) industrial alliance program, as well as gifts from Amazon.
@article{wu2025interruptible,
title={Are Large Reasoning Models Interruptible?},
author={Wu, Tsung-Han and Miroyan, Mihran and Chan, David M and Darrell, Trevor and Norouzi, Narges and Gonzalez, Joseph E},
journal={arXiv preprint arXiv:2510.11713}
year={2025}
}