8 West Consulting AI Delivery Guardrails
Best practice guide · AI delivery guardrails

Shipping with AI, responsibly. The practices we hold ourselves and our work to.

For a while, AI tooling inside the IDE felt effectively free. A flat seat fee, unlimited prompts, and no obvious downside to firing off another question. That window is closing.

GitHub Copilot now meters premium requests against a monthly allowance and charges for overages. Claude Code and the broader agentic tooling stack bill on usage from day one. And the underlying frontier models — the ones these tools quietly route to by default — keep getting more expensive per token, not less. A team that doubled its AI usage last year and barely noticed will feel the same behaviour on next quarter’s invoice.

That changes the job. Adopting AI is no longer the achievement; using it with discipline is. Knowing which model to route to, how much context to load, how long an answer needs to be, and when to stop an agent loop — these are now core engineering skills, not nice-to-haves. The rest of this guide is the playbook 8 West engineers work to, shared openly with the teams we build alongside, so AI delivery stays fast, predictable and worth what it costs.

Failure modes we see in the wild

Where dev teams quietly lose the budget

These are the patterns that show up in agent traces, IDE telemetry and provider dashboards across real-world delivery. Each one is a guardrail we configure against from day one — on our own work and any team we’re embedded with.

Risk 01 · Run-rate drift

Per-seat token burn climbs week-on-week unnoticed

Without per-developer and per-workflow attribution, the only visible signal is the monthly provider invoice. By then the new prompt patterns, longer agent loops or model upgrades that drove the drift have been the team’s default for weeks.

In practice: instrument usage at the API key / workspace level, attribute by IDE session and CI job, alert on 7-day moving-average deltas, not just absolute spend.
Risk 02 · Default-to-frontier routing

Opus / GPT-5.5 / Sonnet handling work a Haiku can close

IDE assistants and agent harnesses default to the strongest model in the account. Summarisation, lint fixes, doc-string generation, commit messages and trivial refactors then run at 5–10× the necessary unit cost. This is typically the single largest line item.

In practice: set a model-routing policy in the assistant config (GitHub Copilot org policy, Claude Code settings, agentic tooling rules); cap frontier models behind explicit opt-in or task class; review the model-mix report weekly.
Risk 03 · Output-token bloat

Full-file rewrites where a diff would do

Output tokens cost 2–8× input. Prompts that ask for "the updated file" instead of a unified diff, or that omit length and format constraints, blow the output side of the bill without improving the change. Most chat-style coding sessions are output-dominated.

In practice: enforce diff/patch output in system prompts and CI-generated PR descriptions, set max_tokens deliberately per task class, prefer structured (JSON / tool-call) responses over prose where the consumer is code.
Risk 04 · Unbounded agent loops

Agents re-reading the repo on every step

Coding agents that grep the whole tree, re-load files already in context, or retry failed tool calls without a stop condition can spend in a single task what a focused engineer spends in a week. Agent-attributed tokens are now the majority share on most projects we measure.

In practice: scope agents to named paths, require a plan-then-execute pattern, cap tool-call depth and iteration count, prefer sub-agents that return summaries over a single agent that hoards context.
Risk 05 · Context-window abuse

Loading 300k tokens of repo into a 1M window

Beyond roughly 100k tokens of live context, attention spreads thin and instruction-following degrades — the "smart zone" ceiling. Larger context produces more confused output and more reruns, and you pay input cost on every token of the haystack.

In practice: vertical-slice tasks to fit <60k of working context, push retrieval/search out to sub-agents, prefer symbol-level context (LSP, tree-sitter) over whole-file dumps, cache the stable parts (system prompt, standards, conventions).
Risk 06 · No live telemetry

Provider dashboard is the only source of truth

Anthropic/OpenAI dashboards report cost, but not which repo, branch, agent, developer or task class drove it. Without that attribution, the team cannot self-correct — they cannot see what they did differently this week.

In practice: tag every API call with project / workflow / actor metadata, ship a shared dashboard (tokens, cost, model mix, agent share, cache-hit rate), review run-rate against budget weekly with the engineering lead.
The number that drives the bill

Output is the expensive side of every request

You pay for tokens in and tokens out, but not at the same rate. Generating a token takes a full sequential pass through the model, so output runs roughly two to eight times the price of input — and five to eight times across the models in our current stack. The cheapest way to cut a bill is rarely a smaller model. It is a shorter, more precise answer.

Input + context
you send this once
Cached read
~1/10
Generated output
priced 2–8× higher, token by token
Try it on the instrument below. As the answer gets longer, the output side grows and the bill grows with it. Shortening the output — diff-only edits, concise answers, structured formats with hard length limits — is the single highest-leverage saving available.

Caching is the mirror image. Context the model has already seen — the system prompt, your coding standards, the brief — reads back at about a tenth of the price. So stable context is cached, not re-sent on every turn, and the input side shrinks toward zero. Toggle Cache the repeated context below to see it fall.

Cost of one answer Standard tier
0units / 1k answers
output 5× input
 
input + context generated output
800 output tokens
Context discipline · after Matt Pocock

Models reason clearly only in a narrow band

A million-token context window is a storage claim, not a thinking claim. As context fills, attention spreads thinner and reasoning degrades. The engineer and educator Matt Pocock calls the reliable region the “smart zone” and puts its ceiling near 100k tokens — past that, a bigger window mostly ships more “dumb zone.” So we size tasks to fit the band: small, vertically-sliced units, research offloaded to sub-agents that hand back short summaries, and context curated for quality over volume.

40k smart zone

40k tokens
Discipline here is not only cheaper, it is higher quality. A focused 40k-token task beats a sprawling 300k-token one on both counts.
The 8 West delivery guardrails

Ten levers every engineer operates under

These are configured into the tooling — assistant rules, agent harnesses, CI prompts, dashboards — not left to individual habit. They’re the standard we hold our own engineers to, and the practices we bring into any team we’re working with. Toggle them in the model below to see how the savings compound against a monthly budget.

01

Right-size the model

Routing policy in the assistant config: Haiku / Flash / mini for summaries, refactors, lint fixes, commit messages. Sonnet / GPT-5 by default. Opus / GPT-5.5 / Codex behind explicit task-class opt-in.

industry 60–70%
02

Cache the stable context

System prompt, repo conventions, coding standards and architectural briefs marked as cacheable so they read back at ~1/10 input rate instead of being re-billed every turn.

up to ~90% on cached input
03

Generate less — force brevity

Diff-only output, deliberate max_tokens per task class, structured / tool-call responses where the consumer is code. Brevity skills like caveman and explicit "terse, no preamble, no recap" system prompts are switched on by default for routine work. Targets the 2–8× output premium directly.

output is 2–8× input
04

Stay in the smart zone

Vertically-sliced tasks sized to <60k working context. Symbol-level retrieval (LSP, tree-sitter) over whole-file dumps. Fewer reruns, sharper diffs.

fewer reruns, higher quality
05

Sub-agents for research

Child agents handle repo exploration, doc search and log triage, returning short summaries to the main task. Protects the working window and the spend that goes with it.

preserves the window
06

Right tool, not always the model

Variable renames, file moves, formatting and mass find-and-replace go through deterministic IDE refactors, tsc, codemods or a one-shot generated script — not a multi-file model edit. Faster, cheaper, and verifiable by diff.

$0 vs. multi-file edit
07

Focus on the right code

Every repo carries a project constitution linked from AGENTS.md that names the major modules, boundaries and entry points. The model is pointed at the relevant slice instead of grepping the tree. Repetitive flows are formalised as skills, locking that focus in.

less context, fewer tokens
08

Curate the skill library

Skills are reviewed like code. Anything not meant to auto-trigger is tagged disable-model-invocation: true in its frontmatter so it loads only on explicit call. Stops dormant skills from inflating every prompt.

smaller system prompt
09

Trim the active tool & MCP list

Every tool and MCP server enabled for a session ships its schema into the system prompt on every turn. Engineers disable connectors and MCPs that aren’t needed for the current task, keeping the active list short. Heavy catalogues sit behind deferred / on-demand loading rather than eager registration.

shorter system prompt
10

Session hygiene between tasks

When the next task does not benefit from the previous context, close the session and open a fresh one — or compress the conversation down to the decisions and artefacts that matter before pivoting. Stops context bloat from being silently re-billed turn after turn.

resets accumulated context
11

Batch the deferrable

Evals, bulk refactors, doc generation, codebase analyses run through batch APIs at ~50% off — not through interactive sessions.

~50% off
+

Measure, never compress blind

Every guardrail is validated against output quality — PR acceptance rate, rework rate, test pass rate — before it ships. Discipline that hurts delivery is not discipline.

quality gate

Stack the disciplines

Switch levers on to see an illustrative monthly AI spend fall. They compound rather than simply add.

$12,000 $12,000
0% reduction · illustrative baseline
Within the budget line
Cost-saving actions, in practice

What we configure into the tooling

Concrete, configurable changes prioritised by impact on the run-rate — whether the project is internal or a team we’re embedded with. Each one targets a measured driver of spend in the current model mix, not friction for its own sake.

High impact

Set default model routing

Configure the IDE assistants (GitHub Copilot policy, Claude Code settings, agentic tooling rules) to route summarisation, simple explanations, boilerplate and small edits to Haiku, Flash or mini. Reserve Sonnet, GPT-5.4, Codex, Opus and GPT-5.5 for genuine reasoning and architecture work.

High impact

Control output volume

Standard prompt templates request concise answers, unified-diff / patch-only code, and explicitly forbid full-file rewrites unless asked. Brevity skills like caveman and "no preamble, no recap" system prompts are on by default for routine work. max_tokens set deliberately per task class. Targets the 5–8× output-token premium directly.

High impact

Use the right tool, not always the model

For variable / file renames, formatting, mass find-and-replace and mechanical refactors, reach for IDE refactors, tsc, codemods or a one-shot generated script — not a multi-file model edit. Cheaper by orders of magnitude, faster, and verifiable by diff.

High impact

Constrain agent scope

Agents operate on named files or scoped folders, must produce a plan before execution, and stop before broad repo scans. Tool-call depth and iteration count capped. Agent share of spend is the majority on most projects, so this is where governance pays back hardest.

High impact

Focus the model with a project constitution

Every repo carries a project constitution linked from AGENTS.md that names the major modules, boundaries and entry points — so the model is pointed at the relevant slice instead of reading the tree. Recurring flows are formalised as skills, locking that focus in across the team.

Medium impact

Manage the skill library

Skills are reviewed like code. Anything not meant to auto-trigger is tagged disable-model-invocation: true in its frontmatter, so it loads only on explicit call instead of inflating every system prompt.

Medium impact

Trim active tools & MCPs

Every enabled tool and MCP server ships its schema into the system prompt on every turn. Keep the active list short — disable connectors and MCPs that aren’t needed for the current task, and put large catalogues behind deferred / on-demand loading instead of eager registration.

Medium impact

Session hygiene between tasks

When the next task does not benefit from the previous context, close the session and start fresh — or compress the conversation to the decisions and artefacts that matter before pivoting. Stops accumulated context from being silently re-billed turn after turn.

Medium impact

Review auto-routing

Auto-routed traffic in tools like GitHub Copilot and agentic assistants is a material slice of spend. Decide per task class when auto may select expensive models, and when simple prompts must stay on cheaper routes. Re-tune monthly from the model-mix report.

Budgets · measurement · alerts

The numbers the work is run against

Guardrails only hold if the spend is observable. We operate against an explicit per-workflow budget, a live measurement view shared with the engineering lead, and threshold alerts that fire before the line is crossed — not after the invoice lands.

Monthly budget, per workflow and per team

A hard ceiling per IDE seat group, agent harness and CI workflow — so every euro of spend has a named owner and a named task class.

Live
μ

Live telemetry: tokens, cost, model mix, agent share, cache-hit rate

Provider API calls tagged with engagement / workflow / actor metadata, surfaced in a shared dashboard refreshed daily — same view we use internally, same view the engineering lead sees.

Live
!

Threshold alerts at 50%, 75%, 90% of budget

Routed to Slack and email for the engagement lead and budget owner. Also fires on 7-day moving-average deltas, so behavioural drift is caught before it hits the absolute ceiling.

Live

Hard stop at 100% — explicit approval required to continue

Provider keys throttle or rotate at the line. Continuing past the budget is a deliberate, written decision — not a quiet overrun discovered at month-end.

Live

Weekly review — run-rate, anomalies, top spenders

A 20-minute review against the budget line with the engineering lead. Anomalies (a new prompt pattern, an agent loop, a model upgrade) are tied back to a workflow and corrected the same week.

Weekly

Example alert ladder

Monthly budget€8,000 / month
Notice — 50%€4,000
Warning — 75%€6,000
Critical — 90%€7,200
Hard stop — 100%€8,000
Without this instrumentation, the first signal of an overrun is the provider invoice — weeks after the behaviour that caused it became the team’s default. That is not how 8 West engineers work.
Discipline in practice

We hold ourselves to the same standard

These principles are instrumented, not aspirational. Internally and on every project, we baseline usage, watch it weekly, set per-team budgets with alerts before the line, ship model-routing defaults and prompt templates into the tooling, and review token spend the way we review any other cost. Here is an anonymised read from our own most recent month.

~⅔
of spend sat in just the top three models — so routing effort is aimed there first.
5–8×
output-to-input cost across our active model mix — why we govern answer length.
~77%
of AI-changed lines came from agents — so agent scope and stop conditions are governed.
67%
of the monthly allowance used, with zero overage — run-rate held by the disciplines above.