Shorn of all the computing details, LLMs are extremely interesting. Here is an article about an assessment when the AI goes absolutely mental 😂 🦜🦜🦜
World-Model Collapse as a Phase Transition
https://arxiv.org/html/2606.31399v1
"Water looks unchanged as it warms, then at a critical point it boils. We ask whether long-horizon language agents show an analogous transition in their implicit world models. In some parameter settings, changing state load by a small amount, or adding a single step of horizon, leaves behavior nearly unchanged; near a critical boundary, the same small change causes a sudden world collapse. We study this effect in a deterministic task family with exact per-step gold state. A large grid search over state cardinality, dependency density, horizon, branching, observation mode, and mutation rate reveals a phase diagram: a solved plateau, a narrow transition band, and a collapse floor. Per-step traces show the mechanism: world-state fidelity fails before action validity, so the agent is not merely choosing a bad action; it is acting from a corrupted world. Stronger models translate the critical boundary but do not remove the qualitative transition. These results make world-model collapse a measurable bottleneck for long-horizon agents."
Recommendation:
Long-horizon LLM agent collapse is better described as a world-model phase transition than as ordinary gradual drift. When state cardinality and dependency density cross a critical region, the agent first loses the represented world and only then loses valid action. Fine scans localize this boundary, cross-model probes show that stronger models shift it rather than erase it, and secondary-axis ablations show that horizon, branching, observation, and mutation play distinct supporting roles. The practical lesson is direct: world-model capacity is a measurable, model-specific bottleneck. Evaluations that only average final success over naturalistic tasks can hide this boundary; reliable long-horizon agents require stress grids, per-step state instrumentation, and scaffolds that support the world representation before the planner fails. The broader message is that agent evaluation should measure the state the agent thinks it is acting in, not only the action it finally takes.