Fine-Tuning in 2026 — Cheaper Than You Think, Harder Than You Think
Bloomberg spent roughly $10 million on it. You can now get started for under $100. What changed, and when does fine-tuning actually make sense?

In 2023, Bloomberg built BloombergGPT on 42 days of continuous compute: a 50-billion-parameter model, 363 billion tokens of financial data, roughly $10 million in total cost. It significantly outperformed similarly-sized general models on financial NLP tasks.⁹
Today, a team can fine-tune a capable open-source model on their domain data over a weekend, on a single rented GPU, for under $100.
Open-source models matured. A technique called LoRA collapsed the hardware requirements. What was a Bloomberg-scale project in 2023 is a realistic weekend build for a mid-sized engineering team today.
But there's a common mistake worth addressing before the how, because it sends most fine-tuning projects off course from the start.
The Most Common Misuse
Most teams come to fine-tuning wanting to inject facts into the model: product catalog, pricing, internal policies.
Wrong use case.
Fine-tuning teaches a model how to respond: the reasoning pattern, output format, tone, domain structure. It doesn't reliably encode what to know.
Fine-tune on your legal briefs and the model learns to structure arguments like your firm, use your citation format, hit the right level of formality. Behavioral learning. It sticks.
Try to encode facts ("our refund policy is 30 days," "Product X supports API v2.3") and those facts compete with similar patterns from original training, get blurred, and resurface as confident approximations. The model appears to know your policy. What it's carrying is a plausible reconstruction.
For facts, use RAG. Fine-tuning is for teaching the model how your domain thinks, not what it contains.
The Risk Nobody Budgets For
Fine-tuning on a new domain can degrade performance on things the model previously did well. Research by Luo et al. found that after domain-specific fine-tuning, performance on the social science subset of the MMLU benchmark dropped from 36% to 26%. The new specialization overwrote general capability.¹⁰
There's a separate finding that's harder to anticipate: fine-tuning on ordinary, benign domain data can silently erode safety alignment, the built-in guardrails from original training, even when the fine-tuning set contains nothing harmful. Qi et al. confirmed this at ICLR 2024.¹¹
Neither finding means fine-tuning is too risky. It means fine-tuning requires evaluation across more dimensions than just the task you trained for. Teams that skip general-capability and safety regression checks tend to get surprised.
What LoRA Changed
Full fine-tuning updates every parameter in the model. For a 7B model, that's 7 billion numbers, requiring significant GPU memory and long training runs.
LoRA (Low-Rank Adaptation) freezes all existing weights and trains a small set of lightweight adapter layers alongside them.¹² At inference time, these adapters shape outputs without overwriting core knowledge.
| Full Fine-Tuning | LoRA | |
|---|---|---|
| GPU memory (7B model) | 60+ GB | 12-16 GB |
| Training time (few thousand examples) | Hours to days | 30-120 minutes |
| Adapter file size | ~16 GB | 50-200 MB |
| Base model capability preserved | No | Yes |
Because the base model stays intact, you can train multiple LoRA adapters for different tasks and swap them at runtime. The general-capability regressions above are also less severe with LoRA than with full fine-tuning.
LoRA is the default fine-tuning method in 2026. Full fine-tuning is for very large datasets and teams with compute budgets to match.
What It Actually Costs
Open-source models (Llama 3.2, Phi-4, Mistral 7B, Gemma 3) changed the cost structure. You don't need a proprietary API.
Via proprietary APIs
(Prices as of Q1 2026. Verify at provider pricing pages before budgeting — these change.)
| Model | Training cost | Fine-tuned inference |
|---|---|---|
| GPT-4o | ~$25 / 1M training tokens | ~$3.75 / 1M input tokens |
| GPT-4.1 mini | ~$0.80 / 1M training tokens | ~$0.80 / 1M input tokens |
| Gemini 2.0 Flash | Competitive with above | Same rate as base model |
¹³
Via open-source on rented compute
Fine-tuning Llama 3.2 8B with LoRA runs approximately $0.48 per 1M training tokens on providers like Together AI. A 100K-token training run costs roughly $0.05 in compute. Hosting a fine-tuned 7B model on a single private GPU runs $500-$2,000/month versus $5,000-$50,000/month for equivalent frontier LLM API volume at scale.¹⁴
The GitHub Copilot Case Study
Copilot started with RAG: knowledge bases and repository indexing for the chat experience. In 2024, GitHub added fine-tuning for inline code completion, reasoning that autocomplete demands speed that retrieval latency can't match. Enterprises could train custom models on their private codebases.¹⁵
According to GitHub's documentation and community reports from late 2025, the custom model feature was subsequently marked for discontinuation. GitHub pointed enterprise users toward contextual customization via embeddings instead.¹⁶
The reasoning in those discussions: maintaining custom models as codebases evolved was difficult to justify. Every meaningful codebase change made the fine-tuned model stale. Retraining meant engineering cycles. The embedding-based approach updated more gracefully.
Fine-tuning creates an ongoing maintenance obligation. A fine-tuned model isn't a one-time project. It's a system component with a retraining cadence attached. That cost belongs in the decision from day one.
When Fine-Tuning Is Worth It
Domain reasoning is the product. BloombergGPT's value wasn't stored financial facts. It was reasoning about finance the way a trained analyst would. When deep domain reasoning is the core value proposition, fine-tuning builds it in.
Output format consistency is non-negotiable. Valid JSON in an exact schema, every time. Reports that always match your firm's structure. Consistency that prompting alone won't reliably sustain.
Latency is constrained. Real-time code completion, voice interfaces, anything where an extra 200-500ms of retrieval breaks the experience.
On-device or air-gapped deployment. A model on a factory floor or medical device with no network access. Fine-tuning is often the only path to domain competence here.
Very high query volume. At millions of queries per day, shorter prompts (no retrieved context) reduce per-query costs meaningfully.
For everything else, start with RAG and add fine-tuning when you hit a specific behavioral problem it can't solve.
Up next — Part 4: Agentic RAG: The architecture that reframes everything in this series and explains what production AI actually looks like in 2026.
Key Takeaways
- Fine-tuning teaches how to respond, not what to know. Using it for facts is the most common mistake.
- Catastrophic forgetting is real: domain fine-tuning can degrade general capability and safety alignment.
- LoRA cut GPU memory requirements for a 7B model from 60+ GB to 12-16 GB.
- Fine-tuning creates a retraining obligation. Factor that into the build decision.
- GitHub deprecated Copilot's fine-tuning feature in favor of embedding-based approaches.
Sources
⁹ Wu, S., et al. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564, 2023; Shah, A., Bloomberg's $10M Data Experiment, Medium, July 2025 ¹⁰ Luo, Y., et al. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning. arXiv:2311.16789, 2023 ¹¹ Qi, X., et al. Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Are Not Malicious. ICLR 2024 ¹² Hu, E., et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685, 2021 ¹³ OpenAI: platform.openai.com/docs/pricing; Google: cloud.google.com/vertex-ai/pricing. Verify before use. LLM Fine-Tuning Pricing 2026, PricePerToken.com, February 2026 ¹⁴ How to Fine-Tune a Small AI Model for Your Business in 2026, AI Magicx, March 2026; Small Language Models 2026, Iterathon ¹⁵ Fine-Tuned Models Are Now in Limited Public Beta for GitHub Copilot Enterprise, GitHub Blog, September 2024 ¹⁶ GitHub Community Discussion #161278, 2025. Reflects documentation references and community reports, not a standalone official announcement at time of writing.