LazyHumanity

In the midst of so much information—new models, post-training strategies (read: RL), and others—I found myself very excited when I first read the blog post about the RLM approach. The reason is simple: inference strategies are mostly overlooked and non-sexy in terms of discussion, yet very present in AI engineering day-to-day (Chain of Thought, ReAct, etc.).

For most practitioners out there, at this point context engineering is not just jargon anymore, and everyone is taking context rot somewhat seriously (if you are not, you should!). Keeping the context controlled and relevant is definitely one of the main engineering tasks when building performant agents. For this reason, strategies at inference time are something I really look forward to whenever something different shows up in the wild—and that’s the case for RLM.

I’ll be very straight to the point here: from the many things discussed in the RLM blog post and idea, for me the one that is most interesting is having the REPL as the main environment for agent actions.

I tried it myself over the last week, and I want to share below some of the things I found.

The good, and the not-so-good

The thoughts logged here came from interacting with the repo published by the authors, to which I only added LiteLLM and enabled some async operations for my own use cases (here).

The good

Quoting from the authors:

“The natural solution is something along the lines of, ‘well maybe if I split the context into two model calls, then combine them in a third model call, I’d avoid this degradation issue’. We take this intuition as the basis for a recursive language model.”

From this, there are two obvious things the RLM approach enables:

Dealing with context rot (by outsourcing a lot of the info to the REPL).
Helping with very large context (letting the LLM use native programming-language resources to parse and break down the understanding as variables).

There are a few things that are not so obvious from the above, but are enabled by the REPL environment idea:

It enables the model to think and spend tokens mostly on coding—which we already know these models do somewhat well (or well enough for the tasks we assign them).
(The most interesting to me) it enables the model to create its own tools on the fly.

From the blog post, and from my own interactions, it’s clear to see an interesting pattern:

LLMs try to get ahead in the REPL, running many commands at once, which to me seemed like an interesting way to compress information gathering, making it more efficient and rich.

As I said, the REPL idea is something I really found interesting, but from the tests, the positives above can kind of backfire…

The bad

LLMs doing more coding: with the REPL, most of the LLM interactions involve coding responses that will be parsed and executed. If you’ve used LLMs for coding tasks, you know errors happen quite often—which isn’t different here, which leads to…
REPL management: as you start getting errors, dealing with the REPL starts getting somewhat complex. Depending on the error location, you could have some variables from the proposed code added to the REPL, but further lines that didn’t execute might not be, and I’ve seen this confuse LLMs quite a bit. I believe you could prompt/RL this out and/or improve context management for REPL information, but instead of feeling like this makes context gathering go fast, when it goes wrong, it actually gets very slow - and frustrating.
Tool-creation automation: this one sounds very attractive, but I have the feeling it might not play out well in production agents. The reason is simple: for tasks that are well-defined/scoped, you tend to see consolidation in how the task is performed, which means you’ll have at least one well-defined tool that helps the LLM achieve good results most of the time. If you’re not adding these functions to the REPL globals—and, as a consequence, to the prompts—you’ll be relying on the randomness of the model to create these tools from scratch every time.

Conclusion

The verdict is that I find the REPL idea very interesting, and I’ll definitely follow more closely other people doing work with this going forward. For now, it doesn’t seem like a clear replacement for the normal ways (read: for-loops and context management in templated prompts).

From my tests so far, the frustrations with code execution and handling the REPL are not trivial to overcome. On the same note, dealing with code suggests you might be better off with handcrafted functions—which puts you back into the traditional tool loop all over again.

Inference strategies: RLM and the REPL idea

The good, and the not-so-good

The good

The bad

Conclusion

Table of Contents