Shipping with AI, responsibly. The practices we hold ourselves and our work to.
For a while, AI tooling inside the IDE felt effectively free. A flat seat fee, unlimited prompts, and no obvious downside to firing off another question. That window is closing.
GitHub Copilot now meters premium requests against a monthly allowance and charges for overages. Claude Code and the broader agentic tooling stack bill on usage from day one. And the underlying frontier models — the ones these tools quietly route to by default — keep getting more expensive per token, not less. A team that doubled its AI usage last year and barely noticed will feel the same behaviour on next quarter’s invoice.
That changes the job. Adopting AI is no longer the achievement; using it with discipline is. Knowing which model to route to, how much context to load, how long an answer needs to be, and when to stop an agent loop — these are now core engineering skills, not nice-to-haves. The rest of this guide is the playbook 8 West engineers work to, shared openly with the teams we build alongside, so AI delivery stays fast, predictable and worth what it costs.
Where dev teams quietly lose the budget
These are the patterns that show up in agent traces, IDE telemetry and provider dashboards across real-world delivery. Each one is a guardrail we configure against from day one — on our own work and any team we’re embedded with.
Per-seat token burn climbs week-on-week unnoticed
Without per-developer and per-workflow attribution, the only visible signal is the monthly provider invoice. By then the new prompt patterns, longer agent loops or model upgrades that drove the drift have been the team’s default for weeks.
Opus / GPT-5.5 / Sonnet handling work a Haiku can close
IDE assistants and agent harnesses default to the strongest model in the account. Summarisation, lint fixes, doc-string generation, commit messages and trivial refactors then run at 5–10× the necessary unit cost. This is typically the single largest line item.
Full-file rewrites where a diff would do
Output tokens cost 2–8× input. Prompts that ask for "the updated file" instead of a unified diff, or that omit length and format constraints, blow the output side of the bill without improving the change. Most chat-style coding sessions are output-dominated.
max_tokens deliberately per task class, prefer structured (JSON / tool-call) responses over prose where the consumer is code.Agents re-reading the repo on every step
Coding agents that grep the whole tree, re-load files already in context, or retry failed tool calls without a stop condition can spend in a single task what a focused engineer spends in a week. Agent-attributed tokens are now the majority share on most projects we measure.
Loading 300k tokens of repo into a 1M window
Beyond roughly 100k tokens of live context, attention spreads thin and instruction-following degrades — the "smart zone" ceiling. Larger context produces more confused output and more reruns, and you pay input cost on every token of the haystack.
Provider dashboard is the only source of truth
Anthropic/OpenAI dashboards report cost, but not which repo, branch, agent, developer or task class drove it. Without that attribution, the team cannot self-correct — they cannot see what they did differently this week.
Output is the expensive side of every request
You pay for tokens in and tokens out, but not at the same rate. Generating a token takes a full sequential pass through the model, so output runs roughly two to eight times the price of input — and five to eight times across the models in our current stack. The cheapest way to cut a bill is rarely a smaller model. It is a shorter, more precise answer.
Caching is the mirror image. Context the model has already seen — the system prompt, your coding standards, the brief — reads back at about a tenth of the price. So stable context is cached, not re-sent on every turn, and the input side shrinks toward zero. Toggle Cache the repeated context below to see it fall.
Models reason clearly only in a narrow band
A million-token context window is a storage claim, not a thinking claim. As context fills, attention spreads thinner and reasoning degrades. The engineer and educator Matt Pocock calls the reliable region the “smart zone” and puts its ceiling near 100k tokens — past that, a bigger window mostly ships more “dumb zone.” So we size tasks to fit the band: small, vertically-sliced units, research offloaded to sub-agents that hand back short summaries, and context curated for quality over volume.
Ten levers every engineer operates under
These are configured into the tooling — assistant rules, agent harnesses, CI prompts, dashboards — not left to individual habit. They’re the standard we hold our own engineers to, and the practices we bring into any team we’re working with. Toggle them in the model below to see how the savings compound against a monthly budget.
Right-size the model
Routing policy in the assistant config: Haiku / Flash / mini for summaries, refactors, lint fixes, commit messages. Sonnet / GPT-5 by default. Opus / GPT-5.5 / Codex behind explicit task-class opt-in.
industry 60–70%Cache the stable context
System prompt, repo conventions, coding standards and architectural briefs marked as cacheable so they read back at ~1/10 input rate instead of being re-billed every turn.
up to ~90% on cached inputGenerate less — force brevity
Diff-only output, deliberate max_tokens per task class, structured / tool-call responses where the consumer is code. Brevity skills like caveman and explicit "terse, no preamble, no recap" system prompts are switched on by default for routine work. Targets the 2–8× output premium directly.
Stay in the smart zone
Vertically-sliced tasks sized to <60k working context. Symbol-level retrieval (LSP, tree-sitter) over whole-file dumps. Fewer reruns, sharper diffs.
fewer reruns, higher qualitySub-agents for research
Child agents handle repo exploration, doc search and log triage, returning short summaries to the main task. Protects the working window and the spend that goes with it.
preserves the windowRight tool, not always the model
Variable renames, file moves, formatting and mass find-and-replace go through deterministic IDE refactors, tsc, codemods or a one-shot generated script — not a multi-file model edit. Faster, cheaper, and verifiable by diff.
Focus on the right code
Every repo carries a project constitution linked from AGENTS.md that names the major modules, boundaries and entry points. The model is pointed at the relevant slice instead of grepping the tree. Repetitive flows are formalised as skills, locking that focus in.
Curate the skill library
Skills are reviewed like code. Anything not meant to auto-trigger is tagged disable-model-invocation: true in its frontmatter so it loads only on explicit call. Stops dormant skills from inflating every prompt.
Trim the active tool & MCP list
Every tool and MCP server enabled for a session ships its schema into the system prompt on every turn. Engineers disable connectors and MCPs that aren’t needed for the current task, keeping the active list short. Heavy catalogues sit behind deferred / on-demand loading rather than eager registration.
shorter system promptSession hygiene between tasks
When the next task does not benefit from the previous context, close the session and open a fresh one — or compress the conversation down to the decisions and artefacts that matter before pivoting. Stops context bloat from being silently re-billed turn after turn.
resets accumulated contextBatch the deferrable
Evals, bulk refactors, doc generation, codebase analyses run through batch APIs at ~50% off — not through interactive sessions.
~50% offMeasure, never compress blind
Every guardrail is validated against output quality — PR acceptance rate, rework rate, test pass rate — before it ships. Discipline that hurts delivery is not discipline.
quality gateStack the disciplines
Switch levers on to see an illustrative monthly AI spend fall. They compound rather than simply add.
What we configure into the tooling
Concrete, configurable changes prioritised by impact on the run-rate — whether the project is internal or a team we’re embedded with. Each one targets a measured driver of spend in the current model mix, not friction for its own sake.
Set default model routing
Configure the IDE assistants (GitHub Copilot policy, Claude Code settings, agentic tooling rules) to route summarisation, simple explanations, boilerplate and small edits to Haiku, Flash or mini. Reserve Sonnet, GPT-5.4, Codex, Opus and GPT-5.5 for genuine reasoning and architecture work.
Control output volume
Standard prompt templates request concise answers, unified-diff / patch-only code, and explicitly forbid full-file rewrites unless asked. Brevity skills like caveman and "no preamble, no recap" system prompts are on by default for routine work. max_tokens set deliberately per task class. Targets the 5–8× output-token premium directly.
Use the right tool, not always the model
For variable / file renames, formatting, mass find-and-replace and mechanical refactors, reach for IDE refactors, tsc, codemods or a one-shot generated script — not a multi-file model edit. Cheaper by orders of magnitude, faster, and verifiable by diff.
Constrain agent scope
Agents operate on named files or scoped folders, must produce a plan before execution, and stop before broad repo scans. Tool-call depth and iteration count capped. Agent share of spend is the majority on most projects, so this is where governance pays back hardest.
Focus the model with a project constitution
Every repo carries a project constitution linked from AGENTS.md that names the major modules, boundaries and entry points — so the model is pointed at the relevant slice instead of reading the tree. Recurring flows are formalised as skills, locking that focus in across the team.
Manage the skill library
Skills are reviewed like code. Anything not meant to auto-trigger is tagged disable-model-invocation: true in its frontmatter, so it loads only on explicit call instead of inflating every system prompt.
Trim active tools & MCPs
Every enabled tool and MCP server ships its schema into the system prompt on every turn. Keep the active list short — disable connectors and MCPs that aren’t needed for the current task, and put large catalogues behind deferred / on-demand loading instead of eager registration.
Session hygiene between tasks
When the next task does not benefit from the previous context, close the session and start fresh — or compress the conversation to the decisions and artefacts that matter before pivoting. Stops accumulated context from being silently re-billed turn after turn.
Review auto-routing
Auto-routed traffic in tools like GitHub Copilot and agentic assistants is a material slice of spend. Decide per task class when auto may select expensive models, and when simple prompts must stay on cheaper routes. Re-tune monthly from the model-mix report.
The numbers the work is run against
Guardrails only hold if the spend is observable. We operate against an explicit per-workflow budget, a live measurement view shared with the engineering lead, and threshold alerts that fire before the line is crossed — not after the invoice lands.
Monthly budget, per workflow and per team
A hard ceiling per IDE seat group, agent harness and CI workflow — so every euro of spend has a named owner and a named task class.
Live telemetry: tokens, cost, model mix, agent share, cache-hit rate
Provider API calls tagged with engagement / workflow / actor metadata, surfaced in a shared dashboard refreshed daily — same view we use internally, same view the engineering lead sees.
Threshold alerts at 50%, 75%, 90% of budget
Routed to Slack and email for the engagement lead and budget owner. Also fires on 7-day moving-average deltas, so behavioural drift is caught before it hits the absolute ceiling.
Hard stop at 100% — explicit approval required to continue
Provider keys throttle or rotate at the line. Continuing past the budget is a deliberate, written decision — not a quiet overrun discovered at month-end.
Weekly review — run-rate, anomalies, top spenders
A 20-minute review against the budget line with the engineering lead. Anomalies (a new prompt pattern, an agent loop, a model upgrade) are tied back to a workflow and corrected the same week.
Example alert ladder
We hold ourselves to the same standard
These principles are instrumented, not aspirational. Internally and on every project, we baseline usage, watch it weekly, set per-team budgets with alerts before the line, ship model-routing defaults and prompt templates into the tooling, and review token spend the way we review any other cost. Here is an anonymised read from our own most recent month.