Agents with Senses: OpenAI's DevDay
Remember Intel's famous tick-tock model? Big changes ("tocks") followed by refinements ("ticks"). While "ticks" seemed minor, they drove rapid progress. I see OpenAI's DevDay as their "tick" – a slew of developer-facing changes that will ripple through applications in the coming months.
Here's my quick recap of the announcements and initial impressions.
Realtime API
OpenAI's new Realtime API is a game-changer. It lets developers create their own versions of OpenAI's Advanced Voice Mode, enabling low-latency voice interactions. The possibilities are vast – from customer support to AI tutors. But what caught my eye was the pricing:
This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output.
Let's crunch the numbers: at ~$15 per hour (based on output pricing), it's already competitive with many call center roles. Having tinkered with advanced voice mode, I can attest to its impressive capabilities. The Realtime API's support for function-calling is the cherry on top – it allows for dynamic information retrieval and action-taking based on conversation flow. We're one step closer to building assistant experiences straight out of Her.
Prompt Caching
OpenAI has joined the prompt caching party, following Anthropic and Google's Gemini. Their implementation stands out for its automatic nature – no explicit requests needed. It's an Apple-esque move, prioritizing simplicity for developers. However, the cost savings (50%) pale in comparison to Anthropic's "up to ~90%", and latency improvements remain unclear. Let's break it down:
OpenAI: Automatic, 50% cost savings, unspecified latency improvements
Anthropic: Explicit requests, variable savings (50-90%)
Google: Most flexible, with configurable TTL for cached prompts
While I'd love to see configurable TTL across the board, flexilibity is tricky to balance with simplicty.
I expect the proliferation of prompt caching to have substantial but subtle downstream effects:
UX is going to improve
They'll be a wider proliferation of "chat with data" use cases, and AI more broadly.
Prompt caching is set to revolutionize AI app UX through reduced latency. Picture this: an entire book cached, ready for instant Q&A. While OpenAI's specific latency improvements are yet to be quantified, this addresses a key pain point in AI-native apps.
The cost reduction from prompt caching opens doors for more "chat with data" features across applications. While RAG with dynamic data might not benefit, simpler implementations – where document corpora are injected into prompts – are ripe for caching. As context windows expand, it's becoming increasingly cost-effective to load extensive context into a single prompt.
Prompt caching isn't just a minor upgrade – it's a quality-of-life improvement that could dramatically expand the horizons of what developers can create.
Model Distillation
Model distillation is a technique in machine learning where knowledge from a larger, more complex model (often called the 'teacher' model) is transferred to a smaller, more efficient model (the 'student' model). This is typically done by training the student model to mimic the outputs or intermediate representations of the teacher model, rather than training directly on the original dataset. Think of it as a master chef (the larger model) teaching a novice cook (the smaller model) to prepare gourmet dishes by demonstrating techniques and sharing insights, rather than just handing over a recipe book.
While developers have been distilling models for a while, OpenAI's latest announcement is a game-changer. It streamlines the process by integrating multiple steps – including automatic prompt saving and evals for fine-tuning smaller models – directly into their platform. This integration isn't just convenient; it's a catalyst for rapid iteration and experimentation.
This addition implies a few things:
OpenAI is continuing to differentiate itself as not just a model provider, but a developer platform.
As a consequence of 1, many AI infra startups are in trouble. There's still use-cases that necessitate this tooling outside of model providers, but the bar has been raised.
As LLMs become commodities, OpenAI is pivoting towards becoming a full-fledged developer platform. This strategy echoes the early 2010s cloud evolution:
OpenAI's journey:
2023: Launched Assistants API, offering higher-level abstractions
Now: Integrating model distillation and fine-tuning
Parallel with AWS's evolution:
Started with basic services (blob storage, virtual servers)
Moved "up the stack" with managed services like RDS
The rationale? Just as cloud services became commoditized in the 2010s, LLMs are following suit in the 2020s. OpenAI's platform approach not only differentiates their offering but also increases stickiness and workload share. The Assistants API, while not as flexible as custom solutions, opens the LLM playground to a broader developer base.
OpenAI's developer platform endangers AI infrastructure startups that are superficial wrappers on top of the model providers. It's the classic case of bundling and defaults. Why would a developer or organization choose an external vendor for fine-tuning and distillation, when they get it all for "free" with OpenAI?
I don't think the cost-benefit analysis is universally in OpenAI's favor, but the inclusion of this suite of features raises the bar for startups and the completeness of their offerings. One obvious reason developers will continue to use external vendors for this kind of work is to avoid vendor lock-in, and have uniform tooling for their portfolio of models, since advanced AI applications typically use more than one model provider.
There's yet another cloud analogy: as multiple cloud providers rose to prominence, and the vision of the cloud came into fruition, tools like terraform enabled teams to seamlessly manage multi-cloud deployments. However, tools like terraform made a name for themselves by creating substantial value beyond simply wrapping the cloud providers. I expect something similar will happen with AI infrastructure startups: the best ones will deliver on a vision that's so compelling, you'll use them because you want to, not just to avoid vendor lock-in.
Vision Fine-Tuning
OpenAI also announced it's now possible to fine-tune GPT-4o for both text and vision. My excitement for vision fine-tuning is second only to the Realtime API because it will enable substantially better performance on vision-related tasks. OpenAI's models (and other leading LLMs) were already remarkably good at computer vision tasks, and it didn't require training with a custom dataset. With vision fine-tuning, developers now have a clear path for improving performance of these tasks.
Some use-cases I suspect will take off:
Better structured extraction from documents.
Automation of (currently) manual data labeling tasks, which facilitates training smaller, more efficient models.
UI automation (browser + desktop) because the model can be trained on specific sites + apps.
I think vision fine-tuning will essentially "unlock" vision use-cases, the same way GPT-4 was enough of a leap from GPT3.5 that it enabled a litany of new applications. And what's even more remarkable is that OpenAI suggests the performance of many tasks can improve substantially with as little as 100 samples!
Conclusion
Phew! We've covered a lot, but there's even more: free limited-time fine-tuning (clever customer acquisition!), the summer's cheaper model becoming the default, and a slew of new developer tools in the OpenAI playground.
Can you feel my excitement? These updates are transformative. The Realtime API and vision fine-tuning pave the way for agents with voice, hearing, and sight. As you scale, model distillation and prompt caching offer powerful optimization tools. Buckle up – we're in for an exhilarating season of AI innovation 🤓.