The conversation about self-hosting LLMs has moved on from where it was a year ago. Open-weight models have improved. Inference engines have matured. Quantisation is better understood. The decision is no longer obvious in either direction. Here is how I think about it for clients in 2025.

The decision is workload-shaped

There is no universal answer. The right answer depends on the shape of the workload, and the relevant variables are not the ones the vendor pitches usually focus on.

The relevant variables are token volume per month, latency sensitivity, the privacy classification of the data, the acceptable quality floor, and the team's capacity to operate GPU infrastructure. Each of these moves the calculation significantly.

When self-hosting wins

Self-hosting wins when several of the following are true.

Token volume is high and the workload is steady. The break-even on a self-hosted deployment requires high utilisation of the hardware. A workload that runs eight hours a day with idle time at night will not amortise the GPUs unless you can sell the idle time to another workload.

The data classification prohibits or strongly discourages sending payloads to a third-party API. Healthcare PHI in many jurisdictions, regulated financial data, and government data are common cases. The compliance cost of a third-party API can be high enough to dominate the unit economics.

The latency budget is tight, and the network cost of round trips to a hosted API is non-trivial. This is rare in practice for text workloads, but it does come up for high-throughput streaming use cases or for deployments inside a private network.

The team has GPU operations capacity, or can buy it credibly. This is the part that gets understated. Operating an inference deployment well requires real expertise. If the team does not have it, the operational cost will erode whatever savings the hardware promised.

When the hosted API wins

The hosted API wins when the workload is bursty, the volume is moderate, the data is not specially classified, and the team does not have the operational depth to run GPUs. This is most enterprise workloads.

For a team running a few thousand tokens per minute on average, the math almost always favours the hosted API. The hardware would sit idle most of the time. The savings from self-hosting do not materialise.

The quality floor is real

There is a quality floor below which self-hosting falls down. The best open-weight models are competitive with the best hosted models on many tasks, particularly on summarisation, classification, and structured extraction. They lag on long context reasoning, on complex multi-step tasks, and on tasks that require very tight instruction following on the first attempt.

The right way to evaluate is to run your own quality bench against your actual workload. Published benchmarks are directional but not decisive. I have seen workloads where the open model was indistinguishable from the hosted model in production, and workloads where the gap was twenty points on the metric we cared about. The only way to know is to run the test on your data.

The cost model that makes it real

Here is the rough shape of the calculation I use for a self-hosted deployment of a 70-billion-parameter open-weight model on H100s in 2025.

A pair of H100s, with the right inference engine, will deliver roughly 2,000 to 4,000 input tokens per second and 100 to 200 output tokens per second on a 70B model with long context. Output throughput is the binding constraint for most workloads.

At reasonable utilisation, this gives an effective cost per million output tokens that is competitive with the hosted API when monthly output token volume crosses roughly 50 to 100 billion tokens. Below that, the hosted API is cheaper. Above that, self-hosting starts to pull ahead, and the gap widens with volume.

The break-even moves with hardware prices, with the rate at which open-weight models improve, and with what hosted providers charge. It has been moving in favour of self-hosting for two years and is likely to continue moving in that direction, but the threshold is still high enough that most enterprise workloads do not cross it.

The hidden costs

The break-even calculation usually ignores three costs that matter in practice.

Operational labour. The fully loaded cost of an MLOps engineer is high. A serious self-hosted deployment will need at least one, and probably more.

Capacity planning risk. Hosted APIs absorb traffic spikes for you. A self-hosted deployment either has to be provisioned for peak, which kills the unit economics, or has to handle spikes gracefully through queuing or fallback to a hosted provider. The fallback path is itself an integration cost.

Model upgrade cost. The hosted provider upgrades the model for you. Self-hosted, you upgrade. That is an evaluation cycle, a deployment cycle, and a regression test cycle every quarter or so. The labour cost is real.

The decision framework I actually use

For most clients, the answer for 2025 is to start on a hosted API and instrument the workload to make a future migration possible. Build the prompts and the evaluation harness so that swapping the model is a configuration change. Track the data classification of every prompt. Track the token volume and latency profile.

Revisit the self-hosting decision when one of the variables crosses a threshold. Volume crosses 50 billion output tokens per month. Data classification changes. The hosted provider prices change materially. The team gains GPU operations capacity through some other workload.

This is unsexy advice. It is the advice that has saved my clients the most money over the last two years.