Cantrip’s intellectual lineage: an annotated bibliography

I wrote Cantrip as a synthesis of several conversations I’d been following — and participating in — for years. Read together, the sources here show those conversations converging: recursive and code-based models of agency; RL-style thinking about trajectories, replay, and environments; Unix-like compositional environments; the recent emergence of harness engineering, agent-computer interface design, and agent experience; and the containment-oriented naming and invocation practices that give Cantrip its vocabulary.

These sources don’t all belong to one conscious school, and they didn’t all shape Cantrip in a simple linear way. Some are direct architectural precursors. Some are symbolic neighbors. Some are later formalizations. Some are independent convergences. The aim here is not to collapse those distinctions but to make them navigable.

Architectural shift: from chat to medium

deepfates. “Recursive language models.”

Recursive language models | deepfates deepfates.com

In my quest to understand the true nature of an Agent I have been thinking a lot about the loop and the actions and the environment. And I think i see where we're headed next.

Where I articulate the shift from chat to programmable environment most directly. Agents are not chatbots with tools but something closer to “programming languages come alive,” with state and action relocated into the environment.

Zhang, Alex, et al. “Recursive Language Models.”

Alex L. Zhang | Recursive Language Models alexzhang13.github.io

We propose Recursive Language Models (RLMs), an inference strategy where language models can decompose and recursively interact with input context of unbounded length through REPL environments.

The academic formalization of the same move. RLMs let language models interact recursively with unbounded context through a REPL environment rather than consuming it all as prompt.

Zhang, Alex. “RLM vs Agents.”

Fundamentally, what really is the difference between an RLM and S={context folding, Codex, Claude Code, Terminus, agents, etc.}?

This is the last and most important RLM post I'll make for a while to finally answer all the "this is trivially obvious" from HackerNews, Reddit, X,…
— alex zhang (@a1zhang) January 22, 2026

Sharpens a distinction that matters here: in an RLM, recursion is embedded in the symbolic medium itself rather than delegated to an external orchestration layer. Explains why I put Cantrip’s composition model inside the medium.

Cheng, Ellie Y., et al. “Enabling RLM Inference with Shared Program State.”

Enabling RLM Inference with Shared Program State | Ellie Y. Cheng elliecheng.com

A formalization of the shared-state variant. Shows how natural-language code can read and write host-program state, which maps closely to how my gates work: host functions projected into the medium as native callables, so the entity writes read("data.json") in code without knowing it’s crossing the circle’s boundary.

Wang, Xingyao, et al. “Executable Code Actions Elicit Better LLM Agents.”

Executable Code Actions Elicit Better LLM Agents arxiv.org

Large Language Model (LLM) agents, capable of performing a broad range of actions, such as invoking tools and controlling robots, show great potential in tackling real-world challenges. LLM agents are typically prompted to produce actions by generating JSON or text in a pre-defined format, which is usually limited by constrained action space (e.g., the scope of pre-defined tools) and restricted flexibility (e.g., inability to compose multiple tools). This work proposes to use executable Python code to consolidate LLM agents' actions into a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct can execute code actions and dynamically revise prior actions or emit new actions upon new observations through multi-turn interactions. Our extensive analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that CodeAct outperforms widely used alternatives (up to 20% higher success rate). The encouraging performance of CodeAct motivates us to build an open-source LLM agent that interacts with environments by executing interpretable code and collaborates with users using natural language. To this end, we collect an instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn interactions using CodeAct. We show that it can be used with existing data to improve models in agent-oriented tasks without compromising their general capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with Python interpreter and uniquely tailored to perform sophisticated tasks (e.g., model training) using existing libraries and autonomously self-debug.

The empirical case for code as action space. Executable code gives agents a more expressive and composable action surface than pre-specified JSON or text tool calls — a direct validation of the code-circle idea.

Varda, Kenton, and Sunil Pai. “Code Mode: the better way to use MCP.”

Code Mode: the better way to use MCP blog.cloudflare.com

It turns out we've all been using MCP wrong. Most agents today use MCP by exposing the "tools" directly to the LLM. We tried something different: Convert the MCP tools into a TypeScript API, and then ask an LLM to write code that calls that API. The results are striking.

A production restatement of the same principle. Tools work better when presented to agents as a code-facing API rather than a huge menu of tool schemas.

Zechner, Mario. “What I learned building an opinionated and minimal coding agent.”

What I learned building an opinionated and minimal coding agent mariozechner.at

Lessons I learned while building my own coding agent from scratch.

A good small-system example. Shows the shell-first, low-token, minimal-tool, context-disciplined style of coding agent that makes the broader architectural claims feel concrete.

Weitekamp, Raymond A. “ypi: a recursive coding agent.”

RAW.works - ypi: a recursive coding agent raw.works

RAW.works

A working recursive example. Demonstrates shell-based recursive agent composition rather than describing the idea in abstract terms.

Loom / RL / learning

deepfates. “Cantrip: On summoning entities from language in circles.”

Cantrip: On summoning entities from language in circles | deepfates deepfates.com

With Cantrip, deepfates reimagines the fundamentals of language model agents. Available as a ghost library with generative test specification.

The primary text. I define an agent as an LLM acting in a loop with an environment (the circle). The Loom — the append-only tree of every turn across every run — is simultaneously the debugging trace, the training data, and the replay buffer. I map this vocabulary explicitly onto RL: the circle is the environment, the entity’s code is the action, gate results are observations, threads are trajectories, and the Loom’s branching structure provides exactly the comparative data that methods like GRPO need.

Moire. “Loom: interface to the multiverse.”

Loom: interface to the multiverse generative.ink

Loom, a tool for generating, navigating and visualizing natural language multiverses

The origin point for the Loom. Moire presents it as an interface for generating, navigating, and visualizing “natural language multiverses” — a branching tree of possible continuations, not a log or memory store. In Cantrip, I take the Loom and make it the durable record of every turn an entity takes across every run: the structure that persists after the entity is gone.

cyborgism.wiki. “Loom” and “Pyloom.”

https://cyborgism.wiki/hypha/loom https://cyborgism.wiki/hypha/pyloom

The native-context definition and original software implementation lineage around Loom. Useful together because they preserve both the conceptual meaning (an interface to probabilistic generative models for navigating multiverses) and the concrete interface practice, including the Janus/Morpheus origin story, outside later formalizations.

Epoch AI. “An FAQ on Reinforcement Learning Environments.”

An FAQ on Reinforcement Learning Environments epoch.ai

We interviewed 18 people across RL environment startups, neolabs, and frontier labs about the state of the field and where it's headed.

Describes how practitioners actually think about RL environments, tasks, and graders. Keeps my RL reading honest and tied to contemporary usage rather than free association.

Zhang, Jenny, et al. “Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents.”

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents arxiv.org

Today's AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The Gödel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin Gödel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.

Important for the archive/tree/branching side of the Loom. Its archive of generated agents and open-ended tree of self-improving trajectories is a close formal neighbor to the Loom as branching training substrate.

Shao, Zhihong, et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.”

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models arxiv.org

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

Here for one reason: it’s the canonical source for GRPO. The Loom reads more naturally as a comparative structure once you have a concrete reference for relative optimization across grouped rollouts.

Sutton, Richard. “The Bitter Lesson.”

https://www.incompleteideas.net/IncIdeas/BitterLesson.html

Not a direct ancestor of my terminology, but a pressure source on the whole formation. Supports the subtractive, capability-first instinct and warns against overdesigned abstractions that hard-code too much human structure.

Amp. “Feedback Loopable.”

Feedback Loopable ampcode.com

How to make hard-to-debug problems debuggable by making them feedback loopable.

A compact articulation of how environment design and fast feedback turn brittle prompting into something adaptive. Treats the loop itself as the load-bearing object, connecting operational practice to RL-flavored intuitions.

Computing substrate

Kay, Alan, and Adele Goldberg. “Personal Dynamic Media.”

https://augmentingcognition.com/assets/Kay1977.pdf

The oldest substrate for the word “medium” in this bibliography. Not a direct precursor to agents, but it roots my insistence that computation is a medium for action and thought in a much older computing tradition.

Raymond, Eric S. The Art of Unix Programming.

https://www.catb.org/~esr/writings/taoup/html

The Unix worldview: composition over monoliths, text and inspectability over opaque abstraction, small tools over giant frameworks. So much of the current shell-first agent formation reads as a return of Unix values under LLM conditions.

Miller, Mark S., Ka-Ping Yee, and Jonathan Shapiro. “Capability Myths Demolished.”

https://papers.agoric.com/assets/pdf/papers/capability-myths-demolished.pdf

A retrospective substrate for wards and structural containment. Explains why restricting ambient authority and passing explicit capabilities is a computing pattern, not a fantasy metaphor.

Emmerling, Jakob. “FUSE is All You Need – Giving agents access to anything via filesystems.”

GitHub - Jakob-em/agent-fuse: Example repo showcasing how to represent anything as a filesystem to an agent via FUSE github.com

Example repo showcasing how to represent anything as a filesystem to an agent via FUSE - Jakob-em/agent-fuse

Here for the filesystem-as-medium idea. Many agent interfaces collapse into a filesystem abstraction — a Unix-native expression of the medium concept.

Operational convergence

Anthropic. “Building Effective AI Agents.”

Building Effective AI Agents anthropic.com

Discover how Anthropic approaches the development of reliable AI agents. Learn about our research on agent capabilities, safety considerations, and technical framework for building trustworthy AI.

A clear industry expression of the thin-harness, simple-patterns, compositional view. Doesn’t share my vocabulary but independently converges on many of the same operational assumptions.

Anthropic. “Effective harnesses for long-running agents.”

Effective harnesses for long-running agents anthropic.com

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Focused on continuity across context windows. Makes artifact handoff, persistence, and long-horizon coordination explicit rather than treating them as implementation details.

OpenAI. “Harness engineering: leveraging Codex in an agent-first world.”

https://openai.com/index/harness-engineering

A direct statement that the harness is becoming a first-class engineering discipline. “The loop around the model” is now the main work.

OpenAI. “Unrolling the Codex agent loop.”

https://openai.com/index/unrolling-the-codex-agent-loop

A lab-side decomposition of the core coding-agent loop into repeated inference and tool use. Makes loop structure, compaction, and iteration visible in plain operational terms.

OpenAI. “Shell + Skills + Compaction: Tips for long-running agents that do real work.”

Shell + Skills + Compaction: Tips for long-running agents that do real work | OpenAI Developers developers.openai.com

Practical patterns for building with skills, hosted shell, and server-side compaction in the Responses API.

A specific source for the shell/skills/compaction stack. Shows that reusable skill bundles and explicit compaction are a recognizable design pattern, not just folk practice.

Cursor. “Towards self-driving codebases.”

Towards self-driving codebases · Cursor cursor.com

We're making a part of our multi-agent research harness available to try today in preview.

A large-scale example of the multi-agent, long-running, repository-centered version of this formation. Shows these ideas aren’t confined to toy agents or essays.

StrongDM. “The StrongDM Software Factory: Building Software with AI.”

The StrongDM Software Factory: Building Software with AI discover.strongdm.com

Inside the StrongDM Software Factory: a new approach to building software with AI using agent-driven execution, scenario-based validation, and digital twin systems—where validation replaces code review.

The “spec persists, implementation regenerates” idea given industrial form. An example of the ghost-library pattern — specification as the durable artifact, implementation code as ephemeral and regenerable — that runs through the broader formation.

Biilmann, Mathias. “Introducing AX: Why Agent Experience Matters.”

Introducing AX: Why Agent Experience Matters biilmann.blog

As builders, we need to start focusing on AX or “agent experience” — the holistic experience AI Agents will have as the user of a product or platform.

The main source for AX as a product and platform problem. Moves the frame from building agents to designing software ecosystems that agents can inhabit well.

Snyder, Josh Bleecher. “AX: Agent Experience.”

AX: Agent Experience sketch.dev

AX is UX, but for AI agents

Shows AX isn’t just Biilmann’s coinage but part of a broader design discourse. Grounds agent-facing design in concrete concerns: prompts, tools, handoffs, interaction timing.

Yang, John, et al. “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.”

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering arxiv.org

Language model (LM) agents are increasingly being used to automate complicated tasks in digital environments. Just as humans benefit from powerful software applications, such as integrated development environments, for complex tasks like software engineering, we posit that LM agents represent a new category of end users with their own needs and abilities, and would benefit from specially-built interfaces to the software they use. We investigate how interface design affects the performance of language model agents. As a result of this exploration, we introduce SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent's custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs. We evaluate SWE-agent on SWE-bench and HumanEvalFix, achieving state-of-the-art performance on both with a pass@1 rate of 12.5% and 87.7%, respectively, far exceeding the previous state-of-the-art achieved with non-interactive LMs. Finally, we provide insight on how the design of the ACI can impact agents' behavior and performance.

A bridge between interface design and later AX work. Argues that language models are a new kind of end user, and that interface quality materially affects agent performance.

Breunig, Drew. “How to Fix Your Context.”

How to Fix Your Context dbreunig.com

6 tactics for fixing your context and shipping better agents. As Karpathy says, building LLM-powered apps means learning to ‘pack the context windows just right’—smartly deploying tools, managing information, and maintaining context hygiene.

A practical counterweight to the more architectural pieces. Makes context management a central engineering problem in its own right.

Breunig, Drew. “A Software Library with No Code.”

A Software Library with No Code dbreunig.com

Do we still need libraries of 3rd party code when AI agents are this good?

An independent articulation of the ghost-library form. Makes “specs and tests as the shipped product” legible outside my world.

Fly.io. “Unfortunately, Sprites Now Speak MCP.”

Unfortunately, Sprites Now Speak MCP fly.io

Documentation and guides from the team at Fly.io.

Less about MCP specifically than the broader point: agent infrastructure is being shaped around durable process habitats, discoverable APIs, and small composable interfaces.

Zunic, Gregor. “The Bitter Lesson of Agent Frameworks.”

The Bitter Lesson of Agent Frameworks browser-use.com

All the value is in the RL'd model, not your 10,000 lines of abstractions.

A modern restatement of Sutton in agent terms. Captures the backlash against overly abstracted agent middleware and defends the loop as the real unit.

Terminus.

Terminal-Bench tbench.ai

A benchmark for terminal agents

A terminal-native benchmark. Treats persistent terminal control via tmux as the neutral interface for evaluating agent capability, rather than assuming a bespoke tool menu.

Symbolic / worldview

deepfates. “AI as planar binding.”

AI as planar binding | deepfates deepfates.com

The command line is an incantation. Programming languages are magic words. The GUI is a somatic space, where you can use careful gestures

My seed text for the circle/ward/containment vocabulary. The central move: AI interfaces require structural containment, not just better instructions. That commitment carries directly into Cantrip.

deepfates. “The mirror of language.”

The mirror of language | deepfates deepfates.com

Mirror worlds The new generative AI models like GPT-3 and DALL-E have amazing powers

Where I develop the simulator-style framing, the magical taxonomy of prompting, and the idea that precise naming is itself part of practice. Also explains why terms like “crystal” (the model itself, named for what it does rather than what it is) are operative, not ornamental.

Prompt / program lineage

Moire. “Methods of prompt programming.”

Methods of prompt programming generative.ink

\"Like programming, but more fluid. You're not programming a computer, you're writing reality. It's strange. It's always different. It's never the same twice.\"

An early and explicit treatment of promptcraft as procedural method rather than ad hoc prose. Connects the Janus/Moire lineage to later skill systems and program-like prompt design.

Khattab, Omar, et al. DSPy.

DSPy dspy.ai

The framework for programming—rather than prompting—language models.

DSPy treats prompts as modules with typed signatures, optimizes them against metrics rather than hand-tuning them, and composes them into pipelines — the same move from craft to engineering that Cantrip makes for agent loops. Where Cantrip says “the cantrip is a value, not a running process,” DSPy says “the prompt is a program, not a string.”