Architecting the Future of AI Agents
Rethinking AI Agent Design with O1-Powered Reasoning at its Heart
I've been exploring OpenAI's o1 over the last couple of weeks, particularly its potential for reasoning. o1 takes a variable amount of time to "think" through problems, making it well-suited for complex reasoning tasks. In my last post, I delved into how this represents system 2 thinking - deliberative and time-consuming.
One consequence of this idea, is enabling more interesting architectures for agents. I also covered agents in my last post, but to recap:
An AI agent is software that can perform tasks or make decisions on behalf of a user, similar to how a person might delegate responsibilities to a trusted representative.
Now, let's explore what a software architecture might look like with o1 at an agent's core. Buckle up!
Components
In brief, my proposed AI agent architecture would consist of the following components:
A preamble prompt that summarizes the agent's purpose, the user's goals, and relevant interactions.
A reasoning engine, which would be built around a model like o1.
Tools/functions for task-specific functionality.
A customer experience model, whose entire purpose is to translate the agent's thoughts, actions and outputs to the user in an intelligible and personable way.
General purpose memory that the agent, tools and other components can use to query relevant information.
Let's double click into each component to understand how they each fit into the agent.
Preamble Prompt
An agent frequently needs to recall several key pieces of information for almost any user interaction: its purpose, the user's details, and a summary of past interactions.
Let's start with purpose: role-playing is a powerful prompting technique crucial for an agent to understand its role. Without it, an agent defaults to a generic personality dependent on the model provider (e.g., generic ChatGPT for OpenAI's APIs). Technically, role-playing serves as contextual priming for LLMs, activating specific knowledge domains and behavioral patterns. By specifying a role, we narrow the model's vast knowledge base to a specific context, potentially improving response efficiency and accuracy.
Next, let's consider the user. At a basic level, an agent needs to know a user's name and other useful identifiers (e.g., IDs). Depending on the application, this could extend to include a comprehensive user profile.
The final part of the preamble prompt is more fluid: a summary of interactions. Think of this as short-term working memory. Just as you might not remember every lunch with a colleague but recall the recent ones, this summary is application-specific and requires experimentation. While not strictly necessary, it's worth considering what frequently-accessed data could benefit the agent.
This information is a preamble because it should precede any prompt, leveraging the model's tendency to pay closest attention to the beginning and end of data.
At the end of the day, the preamble prompt is just a dynamically generated text string; I'm not proposing something very sophisticated. However, it is useful because you'll likely need the same basic piece of context over and over again. Rather than having all components re-implement this time and time again, it pays dividends to design this upfront.
Reasoning Engine
Now let's circle back to my recent obsession: o1 as a reasoning engine. I conceptualize a reasoning engine as a model capable of routing requests to appropriate tools with appropriate parameters. Simple enough? Well, in practice this is not so. I've previously leveraged function/tool calling in OpenAI's models to serve this purpose. However, I've found performance to be lacking, mainly because calling the right tools at the right time requires a fair amount of reasoning over context.
Let me illustrate with a real-world example. Recently, while ordering chicken and waffles for breakfast, I considered adding lemonade. I inquired about it, and the cashier added it to my order without my knowledge. To correct this, I said, "I don't want both... just the chicken and waffles." Surprisingly, this was misinterpreted as keeping only the lemonade (🤦🏽).
This mixup was the result of a few things:
Long-lived context. The interaction included several messages, and meaning of later partial messages was influenced by earlier messages.
An understanding of preference. My hesitancy around adding the lemonade should have provided context around what I wanted to keep in the order.
While o1 isn't quite ready to man a cash register at your local eatery, I believe its reasoning capabilities bring us much closer to "getting it right" in multi-step and ambiguous situations.
Note: o1 currently lacks native function/tool calling, but as the model evolves this capability is inevitable. Furthermore, function/tool calling is not strictly necessary and is more often than not a developer convenience (a much welcomed one at that!).
TLDR: a model like o1 serves as both the brain and air traffic control for an agent.
Tools
An agent isn't very interesting if it can't do stuff. That's where I imagine tools coming in. Tools serve as a very broad label for subroutines that do stuff. A tool can be as simple as a function call, or as a complex as a separate RAG pipeline that has several inference steps. It's a very broad spectrum. I'm excited to see that tool interfaces are standardizing around JSON schema specifications, with OpenAI and Anthropic, the leading model providers essentially using the API. Standardization like this will make it easier to share best practices, and hit the ground running for AI application developers. Standardization also makes switching providers easier, what's not to love?
When designing tools I think about a few characteristics. First, tools should follow the Unix Philosophy, namely:
Make each program do one thing well.
AI app development is challenging enough, and tools offer a great way to manage overall complexity. Moreover, I believe tools are the most impactful component for introducing determinism into an agent. LLMs are inherently probabilistic, flipping traditional programming on its head - the same input can lead to different outputs, unlike traditional software. Tools provide an opportunity to inject more determinism through entirely deterministic code. However, this isn't guaranteed, as tools can also call LLMs and other models for additional inference.
Finally, tools are yet another opportunity to imbue agents with system 2, deliberative thinking. When leveraging LLMs in tools, it's advantageous to break up complex steps into multi-step RAG pipelines, essentially affording any underlying models more time to "think". For example, suppose you wanted to use an LLM model with vision capabilities to provide users an estimate of macronutrient content in a picture of a meal (ahem ). Instead of performing this as a single step, such a tool could be decomposed into the following:
Determine whether the picture consists of food in the first place
Identify the foods in the picture
Identify the portion sizes in the picture
For each food and portion size estimate, estimate macronutrient content
Combine the results for the entire meal
By breaking the task down, the model essentially has more time to "think", à la o1, which leads to better quality results.
The design space for tools is very big, and tools are a pillar of this proposed architecture. As such, I believe this component has the most room for creative software design and engineering.
CX model
So far, we've covered fairly technical, behind-the-scenes aspects of an overall agent architecture. Most agents will need to communicate with humans, whether internal or external to an organization. Human communication differs significantly from inter-component communication, even when both use natural language. For instance, communication between the food logging tool and the reasoning engine might be programmed as: "A sandwich with 40g carbs, 20g protein and 10g fat was logged on behalf of the user". This terse, impersonal message suffices for software components that understand human language, like the reasoning engine.
When communicating with a person, the requirements are likely to be different:
You likely want to ensure the response matches your brand voice. You do not want to sound like a generic ChatGPT wrapper.
You likely want the model to be more affable. While people appreciate brevity, they also appreciate feeling heard.
On a technical level, you may want to have a model that responds fast. After inference is complete, you likely do not want the response to become a bottleneck. Relatedly, this is an opportunity for cost-optimization: this model can be a smaller fine-tuned variant of a simpler model. Think: fine-tuned gpt-4o-mini rather than gpt-4o.
While I believe a fine-tuned small model is best, I also think separating this into a separate component allows developers to leverage models that are better communicators out-of-the-box. Personally, I enjoy and prefer using Claude for writing, and would consider using it here if a fine-tuned model is not initially feasible.
For now, I'm calling this the "customer experience model", but I'm open to more catchy names, especially since it's not exclusive to customers - any human interaction could benefit from such a model. Got any ideas? I'm all ears!
Memory
This is where things get a bit hazy (pun intended!). While memory is intuitively core to any AI application or agent, it's currently the most ill-defined and least understood component. A slew of startups are attempting to build this as a separate service, with varying degrees of success.
In my experience, a good memory component would accomplish the following:
Allow "querying" for disparate facts. Querying can be highly structured (think: SQL) or unstructured (eg "what important things do I know about this user?"). Folks are using knowledge graphs to bridge the gap between structured and unstructured data. It remains to be seen if this is a viable long-term solution.
It should be usable by all components, as presumably all components benefit from contextual understanding of what's going on.
As alluded to in "querying", this component would likely be backed by several databases:
Potentially a knowledge graph, like Neo4j, for more flexible querying of facts.
Example Query: "Who are known associates of this user?"
A relational database like PostgreSQL for highly structured querying.
Example Query: "When was the last time this user logged in?"
A vector database for semantic searches
Example Query: "The user asked about the health of their product line, retrieve all documents related to sales and revenue."
Memory is also the subject of a lot of interesting research, some of which at least tries to emulate human memory. Most notably, Generative Agents: Interactive Simulacra of Human Behavior discusses a novel architecture that compresses memory and adds a time-decay to them. In brief:
Generative agents are AI characters that can simulate believable human behavior in interactive environments. The key to their abilities is a novel memory architecture that stores experiences as natural language, retrieves relevant memories, and synthesizes them into higher-level reflections to guide behavior. This allows the agents to form relationships, spread information, and coordinate activities in emergent ways. While not perfect, the architecture enables more coherent and context-aware behavior compared to previous approaches, opening up new possibilities for interactive AI characters in games, simulations, and other applications.
Memory is a domain I'm watching closely, as it holds the potential to be just as transformative for AI agents as advanced reasoning capabilities.
Putting it all together
To summarize, this is the proposed architecture:
Wrapping Up: Where Do We Go From Here?
Let's bring this home. We've just taken a whirlwind tour of a potential AI agent architecture, from preamble prompts to reasoning engines and everything in between. But we're still feeling our way through the dark. This isn't the definitive blueprint for AI agent design - far from it. We're in the "throw spaghetti at the wall and see what sticks" phase, and that's thrilling.
This reminds me of the wild west days of web development. Remember when we were all writing PHP spaghetti code, mixing business logic and presentation with reckless abandon? Then frameworks like Rails and Django swooped in, making MVC the new standard. They didn't invent MVC, but they certainly democratized it.
I can't help but wonder if we're on the brink of a similar watershed moment for AI agents. Perhaps this architecture, or something like it, will bring some order to the chaos. Or maybe it'll be something entirely different that we haven't even conceived yet. That's the thrill of innovation, isn't it?
One thing's certain: the AI landscape is evolving at breakneck speed. Today's cutting-edge could be tomorrow's old news. So let's stay flexible, keep experimenting, keep building, and above all, keep sharing our insights.
Now, I'm eager to hear your thoughts. Have you ventured into building AI agents? What triumphs or spectacular failures have you encountered? Does this architecture seem promising, or am I way off base? Share your insights in the comments - let's spark a discussion. After all, collective wisdom is how we'll navigate this new frontier.