A bird's eye view of the competitive landscape in AI part 1

Or why it's such an exciting time to build up and down the stack

Feb 16, 2023

As a technologist excited about the promise of AI, my initial enthusiasm left a few nagging concerns. If AI represents a platform shift in tech, will everyone be left paying an OpenAI tax? Or an Nvidia tax? Is there any differentiation when everyone is using the same models? As I dived into these questions I found the AI ecosystem is extremely competitive and there’s no reason to believe that’s going to change.

In this first part, I set the table by discussing the hardware and infrastructure inputs that are necessary for models. In the next part, I’ll dive into the landscape at a higher level.

the decamillionaire gorilla in the room

The immediate concern around the centralization of value capture in AI stems from the fact that training large state of the art (SOTA) models is a costly endeavor that favors scale and incumbents. Let’s jump into a few examples.

Training GPT-3 (one of ChatGPT’s ancestors) likely cost OpenAI around $12 million to train:

Elliot Turner @eturner303

Reading the OpenAI GPT-3 paper. Impressive performance on many few-shot language tasks. The cost to train this 175 billion parameter language model appears to be staggering: Nearly $12 million dollars in compute based on public cloud GPU/TPU cost models (200x the price of GPT-2)

Google’s PaLM, another large language model (LLMs), is estimated to have cost between $10 and $20 million to train.

Examples of similar costs abound for LLMs, please refer to this thorough breakdown.

SemiAnalysis

The AI Brick Wall – A Practical Limit For Scaling Dense Transformer Models, and How GPT 4 Will Break Past It

Large generative AI models unlock massive value for the world, but the picture isn’t only roses. Costs for training for these civilization-redefining models have been ballooning at an incredible pace. Modern AI has been built on scaling parameter counts, tokens, and general complexity an order of magnitude every single year. This report will discuss the brick wall for scaling dense transformer models, the techniques and strategies being developed to break through that wall, and which specific ones will be used in GPT 4…

3 years ago · 35 likes · 7 comments · Dylan Patel

This is not limited to LLMs. For example, Stable Diffusion initially cost ~$600k to train:

Emad @EMostaque

@KennethCassel We actually used 256 A100s for this per the model card, 150k hours in total so at market price $600k

Of course these figures exclude a number of nuances, like the fact that developing these models incurs other costs like SG&A and R&D. Furthermore, large clients typically have preferential pricing with the large cloud providers like AWS, so they rarely pay market prices. Despite these nuances, it’s clear that bringing large models to life is capital-intensive. So what hope do builders and new entrants have?

infrastructure

Put simply, I believe there’s opportunity for upstarts and small players because there’s lively competition up and down the stack.

Let’s start at the lowest level.

foundries

SOTA models require massive compute and that typically relies on GPUs.

TSMC is the world’s leading contract foundry; they manufacture chips designed by other companies including AMD, Nvidia, and Apple. Though not limited to GPUs, TSMC manufactures chips for leading GPU makers.

TSMC faces stiff competition from existing competitors, especially bolstered by the CHIPS Act. The CHIPS Act not only boosted America’s domestic semiconductor production capability, but it will also help diversify the foundry business writ large as it provides incentives to a litany of manufacturers besides just TSMC. Notably, the CHIPS Act and relentless competition from AMD has awoken the sleeping giant Intel, who is seeking to substantially expand its own contract manufacturing business.

Interestingly, Berkshire Hathaway atypically just took short-term profits in TSMC, entering and drastically reducing their position in a single quarter.

Putting it all together, I don’t see TSMC or another manufacturer capturing all the value in AI.

chips

As I touched upon earlier, GPUs drive SOTA models. Similar to foundries, there’s a clear leader, and it’s Nvidia. GPUs got their start in gaming, but are critical to AI/ML workflows because they’re well-suited for these compute workloads that favor parallel processing, high memory bandwidth, and are cost-effective (they’re not considered specialized hardware).

Nvidia’s dominance is multifaceted but the lynchpin in their AI strategy is CUDA, which enables performant parallel computing that makes the most of their chips. CUDA is proprietary, and Nvidia’s closest competitor in the GPU space, AMD, leverages open source software for it’s equivalent. There’s reason to believe Nvidia will face increased competition here, as different software in the broader AI ecosystem abstract over CUDA:

SemiAnalysis

How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0

Over the last decade, the landscape of machine learning software development has undergone significant changes. Many frameworks have come and gone, but most have relied heavily on leveraging Nvidia's CUDA and performed best on Nvidia GPUs. However, with the arrival of PyTorch 2.0 and OpenAI's Triton, Nvidia's dominant position in this field, mainly due to its software moat, is being disrupted…

3 years ago · 88 likes · 18 comments · Dylan Patel

Furthermore, Google leverages its own optimized processors for compute, the tensor processing unit (TPU). TPUs are famously used for Google’s class-leading machine learning features on Pixel phones. TPUs are not direct substitutes for GPUs (only certain software and models can benefit from them), and are themselves proprietary.

Similarly, Amazon has developed its Graviton chips in the CPU space. At the edge, Apple’s custom chips contain its “Neural Engine” that is optimized for machine learning workloads. Microsoft is also rumored to be working on its own chips. Again, these are not one-to-one substitutes for Nvidia, but it demonstrates that other firms have taken notice and intend to compete vigorously.

These facts notwithstanding, I suspect Nvidia to retain a dominant lead in the short-term since it has a substantial head start and is the most integrated into the ecosystem. Over a longer time horizon, other firms will compete for Nvidia’s juicy margins.

cloud

The most powerful GPUs that make training SOTA models economically tenable cost a lot of money. Nvidia’s A100s each cost $10,000+, and their newer GPUs like the H100 are even more expensive. This is the crux of why training is expensive.

Cloud computing remains advantageous for most enterprises, in no small part because infrastructure costs like GPUs can become an OpEx instead of a CapEx. Especially with ML/AI, this advantage is substantial; instead of spending tens of millions of dollars on infrastructure, companies can lease the infrastructure for training and ad-hoc for inference.

This space remains highly competitive, even as growth has slowed: AWS has 34% marketshare, Azure has 21% and Google Cloud has 11% share.

What’s especially exciting for new entrants in AI about cloud providers is that its unclear if any provider has an outsized advantage in AI specifically, so I would expect them to compete vigorously:

Microsoft has its interesting partnership with OpenAI, but trails Amazon in marketshare. AI can serve as a substantial growth driver in the years to come.
Amazon is the overall cloud leader, but relies on partnerships with AI startups, like Stable AI. It’s Alexa model is deployable, but lacks capabilities like fine-tuning.
Google is widely believed to have the lead in ML/AI expertise and it has custom chips to boot, but is the cloud laggard so can use those advantages to aggressively take marketshare.

coming up next

I’ve discussed the broad inputs that go into training SOTA models. We’ve covered why these inputs showcase a dynamic industry, where no firm has an unassailable advantage.

In the next part I’ll discuss the competitive landscape closer to the applications end of the stack. Specifically, I’m excited to discuss the human capital and technology forces that are likely to keep AI competitive for all market participants, especially builders.