Compound AI Systems Beat Monolithic Models
The best AI results are increasingly achieved not by single monolithic models but by compound systems that combine multiple models, retrievers, tools, and orchestration logic into something greater than the sum of its parts.
"State-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models." Matei Zaharia et al., Berkeley BAIR
Microsoft demonstrated that a chaining strategy exceeded GPT-4's accuracy on medical exams by 9% not by using a better model, but by using the same model more cleverly. Google's Gemini launch measured MMLU with a CoT@32 strategy that calls the model 32 times. RAG pipelines, tool-use agents, multi-step verification chains, and ensemble methods all exemplify this shift. The pattern is consistent: careful orchestration of existing models outperforms simply scaling a single model.
This shift opens AI development to a much broader set of practitioners. While training a frontier model requires hundreds of millions of dollars and rare expertise, building a compound system requires engineering skill, domain knowledge, and creative architecture the kind of work most software teams can do. The challenge moves from "how do we train a bigger model" to "how do we design the right system around the models we have." This includes decisions about when to retrieve vs. generate, how to decompose complex tasks, when to verify outputs, and how to route between specialized models.
The practical implications are significant. The "new language model stack" is not just a model but an entire pipeline: embedding models for retrieval, vector databases for storage, orchestration frameworks for chaining, guardrails for safety, and evaluation harnesses for quality. Fine-tuning is often less important than getting the retrieval and prompting right. The system boundary not the model boundary is where competitive advantage lives.
Takeaway: Engineering leverage in AI has shifted from training bigger models to designing smarter systems around existing ones and this is where most teams should invest.
See also: Text Is the Universal Interface | The Bitter Lesson Scale Beats Cleverness | Choose Boring Technology