I have been wary of the prediction game in this field. The five-year predictions made in 2020 were almost all wrong. The five-year predictions made in 2023 are mostly already embarrassing. With that caveat noted, I am going to make a set anyway. The framing is what I expect to see in the applied work. I am going to be specific about confidence, and I am going to be willing to be wrong in print.
Capability
I expect frontier model capability to continue improving, but at a slower rate than the headline narrative suggests on the dimensions that matter for applied work. The improvements in benchmarks are not always improvements in production behaviour. The model that scores ten points higher on a reasoning benchmark may not produce a ten-point improvement on your eval.
The dimensions where I expect meaningful progress. Long context handling, especially the reliability of finding specific information in a large context. Tool use, particularly the ability to chain a small number of tools correctly. Multimodal handling, especially document and image inputs.
The dimensions where I expect slower progress. Genuine robustness on adversarial inputs. Calibrated uncertainty. The ability to reliably refuse rather than hallucinate when information is unavailable. These are the failure modes that matter most in production. They are also the ones the training paradigm has the least direct grip on.
Cost
I expect inference cost to continue falling, on a per-token basis, at a meaningful but decelerating rate. The hardware gains will continue. The architectural gains in inference engines have largely been harvested. The pricing pressure from open-weight models will keep the hosted providers honest.
Total cost of ownership for AI workloads will not fall as fast as the per-token cost, because workloads are getting larger. Reasoning models that consume more tokens per task. Longer context. More extensive tool use. The unit cost of a task is more stable than the per-token cost suggests.
The shape of production deployments
The architectural patterns that have been emerging will solidify. Most production AI will run as a deterministic system with LLMs in narrow roles, validated against schemas, with human review on edges, and tracing across the whole stack. The unconstrained-agent paradigm will not be the norm in production, even as it remains a research direction.
Multi-model deployments will be common. The right model for a task will be selected from a small set, with cheaper models on routine cases and more capable models on edge cases. The selection logic will live above the model layer, in the system the team controls.
Domain-specific fine-tunes will be more common than they are now, particularly in regulated industries where the cost of operating a custom model is dominated by the regulatory overhead, not the inference. A fine-tuned model that runs inside the regulatory boundary is a strong default in those settings.
Regulation
The regulatory picture will continue to fragment. The major jurisdictions will continue to take different approaches. The EU will be more prescriptive. The US will be more sectoral. The UK will be lighter-touch. Asia will vary significantly by country.
For practitioners, this means that any AI deployment with international exposure will face a patchwork of compliance requirements. The teams that have invested in audit logs, model documentation, and incident response will absorb the new requirements as paperwork. The teams that have not will spend quarters retrofitting.
Healthcare and financial services will see the most movement. Government adoption will lag, but the early adopters in government will set patterns that subsequent agencies follow.
What gets built
A short list of categories I expect to see real money built in over the next five years.
Document intelligence at scale, in domains where the documents are heterogeneous and high-volume. Insurance. Legal. Healthcare claims. Government benefits. The infrastructure to extract structured data from messy documents will be a meaningful business category.
Workflow automation in operationally heavy back offices. Prior authorisation. Claims adjudication. Regulatory filing. The tasks that involve reading several systems, applying rules, and producing a structured output. The pattern is narrower than the agent vision but more durable.
Voice AI in narrow domains. Customer service for specific industries. Healthcare triage. The voice modality has been slower to mature than the text modality, and it is now catching up. The deployments that work are constrained.
Domain-specific models in regulated industries. Models trained on the specific data and tasks of an industry, operated inside the regulatory boundary. This is a quieter category than the frontier model news, and it is where I expect a lot of the durable value to settle.
What does not pan out
Predictions where I think the consensus is wrong.
The autonomous agent vision will not arrive on the timeline the marketing has suggested. The constrained patterns will continue to dominate production. This is a confidence prediction.
General self-service for business users will not arrive broadly. Specific narrow self-service tools will work. The broader vision of natural-language analytics for any user will continue to be hard, mostly because of trust and context, not because of query generation.
The displacement narrative will be more uneven than the headlines suggest. Some categories will see real displacement. Many will see role evolution. The economic data will be noisier and slower than either the boosters or the doomers expect.
What I am committed to
A few things I will keep doing in the work, regardless of how the broader trajectory plays out.
Build with discipline. Validate every output against a schema. Trace every system. Measure unit economics. Reserve the LLM for narrow, well-bounded roles. Use deterministic backbones.
Engage the regulatory environment as a design input, not as a checklist. The systems that are easy to govern are the systems that survive.
Stay in the work. The most useful predictions about applied AI come from people who are still shipping it. The predictions that age worst come from people who have moved to the conference circuit.
Five years from now, I will reread this. I expect to be right about the shape of production deployments and the direction of cost. I expect to be wrong about at least one of the capability dimensions. I look forward to finding out which.