Skip to main content

The Single-Vendor Ceiling - We began Gysho’s R&D and early projects with a single cloud provider. OpenAI offered the strongest reasoning and instruction-following capabilities at the time, and standardizing on one API let us ship fast. Over three years, however, the trade-offs crystallized into operational friction.

Vendor lock-in becomes a structural risk. Pricing changes, token-rate revisions and model retirements arrive without alternative paths. Engineering teams organized their prompting patterns and output parsers around one schema, which made any thought of migration expensive and painful. More importantly, we hit a hard ceiling on data control. Every prompt traveled outside our network, which meant sensitive context had to be carefully filtered or omitted entirely. Rising inference volumes translated directly into rising bills with no lever to pull except usage restrictions.

Analysts now describe this inflection point as inevitable.

Gartner notes that 2026 is the year AI sovereignty shifts from preference to mandate, with fragmentation into regional blocks already underway. The Deloitte State of AI 2026 survey of 3,235 leaders confirms the pattern: 83 percent view sovereign AI as strategically important, 77 percent factor a vendor's country of origin into selection decisions, and 73 percent cite data privacy and security as the top AI risk. Yet only 21 percent report mature governance models, and 58 percent are building primarily with local vendors. 

We reached these conclusions earlier than many, but we were not alone.

 


01 | Three Requirements for Change

We defined three non-negotiable outcomes before writing any routing code.

  1. Vendor and Model Agnosticity. We would never again be forced into a painful migration because one provider changed terms or fell behind on capability.

  2. Cost Control. We needed to place suitable workloads on internal compute alongside cloud inference rather than treating every token as a billable event.

  3. Privacy and Data Control. Sensitive processing had to stay inside our network unless we explicitly chose to route it outward.

Deloitte's research quantified the economic pressure we felt. Some enterprises now face tens of millions in monthly AI compute spend, and the tipping point for on-prem investment arrives when cloud costs hit 60 to 70 percent of equivalent hardware costs. We were approaching that threshold on certain workloads. We also recognized that intelligent routing across heterogeneous infrastructure is now widely associated with 20 to 80 percent OpEx reduction, according to general analyst consensus. We intended to capture that efficiency without sacrificing engineering velocity.



02 | The Balancer: A Single Endpoint for Every Model

The result is the Balancer, a single-tenant LLM router running on a small Azure VM. It took approximately six weeks to build and deploy.

The Balancer exposes one OpenAI-compatible endpoint to every calling application:

/[tier]/v1/chat/completions

Behind that path, the system routes requests to any cloud provider or any local model entirely in the background. Callers configure three tier URLs and require zero code changes when we swap providers, rotate API keys, or add local inference workers.

Inbound traffic arrives in standard OpenAI Chat Completions format. The Balancer translates and normalizes the request into whatever the upstream provider expects — Anthropic Messages, Azure Responses, or another proprietary schema — then normalizes the response back into OpenAI-format JSON before returning it to the caller. Applications speak one dialect. The infrastructure speaks many. This translation layer is what makes downstream agnosticity possible without rewriting client code.


 

03 | Tiered Routing and Transparent Fallback

We simplified capacity planning by collapsing model selection into three tiers: frontier, mini, and nano.

The Balancer uses tier-or-higher routing. A worker capable of serving the mini tier can also handle nano requests, but it will not accept frontier workloads. This lets us match capability to demand without exposing model names or provider details to the calling services. An agent requesting nano inference might land on a local four-billion-parameter model or a cloud micro-instance; the caller does not know, and does not need to know.

Cloud fallback is handled inside the Balancer, not by the callers. If a local worker drops offline or a regional endpoint throttles traffic, the router retries against the next eligible upstream according to a declarative priority list. Applications receive a consistent response format regardless of which provider ultimately served the request. The failover is invisible.

 

 

04 | The Local Inference Pool

Cloud inference is not always necessary. We built a local pool from idle team laptops — Apple Silicon Macs running LM Studio and Linux boxes running Ollama — connected to the Balancer through a lightweight Electron agent.

These machines run open-weight models from the Gemma family. On short prompts, a 4-billion-parameter Gemma instance on a MacBook Air averaged 1.32 seconds to first token. For longer outputs of roughly 500 tokens, the same local setup was up to eight times faster than a comparable cloud mini model, with zero API cost.

The economics are direct. Every request served locally is a request that incurs no external usage charge. For high-volume, low-context tasks — classification, summarization, internal tool orchestration — this removes a significant share of API cost entirely. Deloitte's observation that on-prem investment becomes rational when cloud prices reach 60 to 70 percent of equivalent hardware costs guided our thinking here, but we pushed further by utilizing existing hardware rather than provisioning a new server farm.

Forrester Predictions 2026 suggest at least 15 percent of enterprises will seek private AI options, and on-premises servers will capture 50 percent of server share. BCG and Benedict Evans argue that vendor fragmentation will force enterprises to compose "agentlakes" — composable agent architectures across multi-vendor AI deployments. Our local pool is the inference layer of that composition. It is not a rejection of cloud; it is a geographically distributed, cost-aware extension of it.

 

05 | Operations Through Configuration 

Managing multiple upstream providers through hand-maintained configuration files does not scale. We replaced that approach with a web-based administrative interface called LLM Profiles.

The UI lets us switch providers, endpoints, models, and API keys declaratively. A profile change propagates to the Balancer without a deploy or a restart. Callers continue hitting the same tiered URLs while the back end shifts underneath them.

Security and observability are built in, not bolted on. API keys are encrypted at rest with AES-256-GCM. An append-only audit log records every routing decision, model selection, and configuration change. Health checks run every five minutes against every registered upstream, including local workers, so the Balancer knows which nodes are alive before it routes traffic to them.

 

06 | Our Mixed Estate Today

Today we run a heterogeneous AI estate that would be chaotic without the abstraction layer above it.

Claude powers our coding workflows. OpenAI remains in application workflows where its reasoning style fits best. Local open-weight models handle nano and mini tiers, running on Apple Silicon and Linux laptops through LM Studio and Ollama.

To the calling applications, this mixture is completely invisible. A customer-facing feature does not know whether its summary request was handled by a cloud frontier model or a local four-billion-parameter instance. An internal agent does not distinguish between Anthropic Messages semantics and OpenAI Chat Completions syntax. The Balancer normalizes both directions.

This architecture aligns with the direction analysts describe. The shift toward best-of-breed solutions is accelerating as enterprise disappointment with platform-centric approaches grows. We no longer ask which single vendor can solve every problem. We ask which model, on which infrastructure, at which cost profile, is optimal for the specific task.

 

Conclusion | Results and Lessons

The operational results are concrete.

Migration between providers is now a configuration change measured in minutes, not a refactoring project measured in sprints. Local inference has eliminated API cost for a significant share of our total request volume. Sensitive data never leaves our network unless we explicitly route it outward, which means internal documents, proprietary code context, and customer metadata can be processed without sanitization trade-offs.

The governance layer matters as much as the routing engine. With AES-256-GCM encryption, five-minute health checks, and an append-only audit log, we closed the gap between experimentation and production readiness. Only 21 percent of enterprises in the Deloitte survey report mature governance models; these controls are our foundation for joining that minority.

We built the Balancer in six weeks on a single small Azure VM. The investment was modest because the scope was precise: one compatible endpoint, three tiers, translation middleware, and a declarative control plane. We did not attempt to build a multi-tenant platform or a universal model hub. We built an abstraction layer that keeps our options open and our data sovereign.

That single decision — routing over rewriting — has given us vendor independence, material cost avoidance, and architectural flexibility we could not have achieved by remaining on a single provider.

WHERE GYSHO FITS

The Balancer wasn't a product we set out to build. It was the answer to a specific question: where does our infrastructure end and a vendor's begin? Most firms running on AI will face some version of that question, and the right response is rarely a single platform choice. It's working out what to standardise, what to keep flexible, and what must stay inside your own walls.


That's the kind of problem Gysho works on. Not picking a model or a provider, but building the layer underneath that keeps those decisions reversible, so a change in pricing, capability, or strategy is a configuration change rather than a rebuild. The Balancer is one example of that approach. The specifics differ for every firm; the principle of routing over rewriting does not.


If AI is becoming part of how you deliver, the question of what you own versus what you rent is no longer optional. Working out where to draw that line is the next step.