From OpenAI Proxy to Multi-Provider Routing: The ASC Architecture in Plain English

The first implementation of an AI gateway at most organisations is a three-line nginx config or a thin Python proxy that appends an Authorization header and forwards to the OpenAI API. This works until it doesn't: a provider outage takes down a production feature, a new model launch by a competitor offers better economics, or the compliance team asks why all AI requests are going to a single vendor in a single region.

The gap between a naive proxy and a production-grade multi-provider routing fabric is significant. ASC's routing layer bridges that gap. This article explains how it works, the design decisions behind it, and what it takes to operate reliably at scale.

Why Naive Proxying Is Not Enough

A simple proxy forwards requests to a single upstream. The operational properties of a single-upstream proxy are fully inherited from the upstream: if OpenAI's API is degraded, your proxy is degraded. If OpenAI raises prices, your entire AI spend is repriced. If you need to add Anthropic for a reasoning task, you write a second integration, a second error handler, a second rate limiter.

Multi-provider routing adds four capabilities that a single-upstream proxy cannot provide:

Provider failover. When a primary provider is unhealthy, route to an alternative without application code changes.
Cost arbitrage. Route to the cheapest capable provider for each request class.
Capability routing. Route requests to the model with the best capability profile for the task.
Regulatory compliance. Route requests that contain sensitive data to providers within the required geographic boundary.

These capabilities require the gateway to maintain state about provider health, pricing, and capabilities, and to make a routing decision on each request based on that state. That is the core of what ASC's routing fabric does.

The Routing Fabric: Architecture Overview

ASC's routing fabric is composed of four components that work together on each request.

The normalisation layer translates incoming requests from the OpenAI API format to the target provider's API format. OpenAI, Anthropic, Mistral, Cohere, and Google each have slightly different request schemas, parameter names, and authentication mechanisms. The normalisation layer absorbs these differences, presenting a unified interface to the calling application. On the response side, it translates provider-specific response schemas back to the OpenAI format.

The routing engine selects the target provider and model for each request. It evaluates a prioritised list of routing rules defined in the tenant's configuration. Rules can be:

Static: always route gpt-4o requests to OpenAI
Failover: route to Anthropic Claude if OpenAI's error rate exceeds 5%
Cost-based: route to the cheapest provider that supports the requested context length
Latency-based: route to the provider with the best p95 latency over the past five minutes
Canary: route a configurable percentage of traffic to a new model for A/B evaluation

Rules are evaluated in priority order. The first matching rule wins.

The health tracker maintains a real-time view of provider health. It monitors error rates, latency percentiles, and connection pool saturation per provider. Health state is stored in a shared cache accessible to all gateway pods, ensuring that a provider degradation observed by one pod is immediately visible to all pods. Health checks use an exponential backoff to avoid thundering herd problems during provider recoveries.

The circuit breaker implements the classic circuit-breaker pattern per provider. When a provider's error rate crosses a threshold, the circuit opens: requests that would route to that provider are immediately redirected to the next rule in the routing chain, without waiting for a timeout. When the circuit is open, ASC periodically probes the provider with a small percentage of traffic. When the probe succeeds, the circuit closes and normal traffic resumes.

How Credentials Are Stored

Provider API keys are among the most sensitive secrets in an organisation's AI infrastructure. A leaked OpenAI key can generate tens of thousands of dollars in charges before the incident is detected.

ASC uses a three-layer credential architecture.

At the outermost layer, tenants register credentials through the ASC management API or console. Credentials are transmitted over TLS and are never logged.

In the data plane, credentials are stored using envelope encryption. Each credential is encrypted with a unique data encryption key (DEK). The DEK is itself encrypted by a key encryption key (KEK) stored in the tenant's KMS — AWS KMS, Google Cloud KMS, or Azure Key Vault. The encrypted DEK and the encrypted credential are stored together in ASC's credential store. The KEK never leaves the KMS.

In the gateway pods, credentials are cached in memory in decrypted form for the duration of a configurable TTL, typically five minutes. When the cache expires, the pod re-fetches and decrypts the credential from the data plane. This cache avoids a KMS call on every request while ensuring that revoked credentials stop working within the TTL window.

Request and Response Normalisation

Different providers implement the same conceptual operations with different API shapes. This is most visible in the message format, stop sequence handling, and streaming response structure.

ASC's normalisation layer maintains a provider adapter for each supported provider. Each adapter implements a bidirectional transformation: from the canonical OpenAI format to the provider's native format on the request path, and from the provider's native format back to the canonical format on the response path.

This normalisation is transparent to the calling application. An application that calls gpt-4o can be rerouted to claude-3-5-sonnet without changing any application code, because the response it receives is in the same format regardless of which model served the request.

Tool call normalisation deserves special mention. The OpenAI tool call format, Anthropic's tool use format, and Google's function calling format all model the same concept differently. ASC normalises tool call requests and responses across providers, allowing an application that uses tool calls to be routed to any provider without modification.

Streaming Support

Streaming responses — where the model outputs tokens incrementally rather than waiting for the full response — require careful handling in a proxy layer. A naive proxy that buffers the full response before forwarding defeats the purpose of streaming.

ASC implements pass-through streaming: the gateway opens a persistent connection to the provider, begins receiving SSE (Server-Sent Events) chunks, and immediately forwards each chunk to the caller as it arrives. The gateway parses each chunk to extract token count metadata for billing and observability purposes, but does not buffer the response body.

For providers that use different streaming formats — Anthropic's streaming format differs from OpenAI's in several ways — ASC translates the chunk format in real time before forwarding to the caller.

Semantic Caching

For workloads where the same or semantically similar prompts are sent repeatedly, semantic caching can significantly reduce both cost and latency. ASC implements optional semantic caching at the routing layer.

When a request arrives, ASC computes an embedding of the prompt and compares it to embeddings in the cache index. If a cached entry with cosine similarity above a configurable threshold is found, the cached response is returned immediately without forwarding to the provider.

Cache entries are scoped to the tenant. A tenant's cached responses are never visible to other tenants. Cache TTL and similarity threshold are configurable per tenant; typical production configurations use a 0.95 similarity threshold and a 24-hour TTL.

Semantic caching is not appropriate for all workloads. Requests where prompt variation is high, where responses are time-sensitive, or where determinism is critical should disable caching. The cost and latency benefits accrue primarily to use cases with repetitive prompt patterns: document classification, FAQ answering, and batch analysis workloads.

Load Balancing Across Provider Endpoints

Some providers offer multiple regional endpoints. ASC can load-balance across endpoints within a provider for throughput or latency optimisation.

The load balancing algorithm is configurable per provider: round-robin, least-connections, or latency-weighted. Latency-weighted balancing measures the recent p50 latency of each endpoint and routes proportionally more traffic to lower-latency endpoints. This is particularly useful when a provider's US-East endpoint is consistently faster than their EU-West endpoint for a given workload.

Operational Reality

Operating a multi-provider routing fabric in production requires attention to a few areas that are easy to overlook during initial deployment.

Schema drift. Providers update their APIs without always maintaining perfect backward compatibility. ASC's provider adapters need to be updated when providers introduce breaking changes. AIARCO maintains the adapters as part of the ASC release cycle; for self-hosted deployments, staying current on ASC releases is important.

Cost model accuracy. Cost-based routing requires accurate pricing data per provider per model. ASC maintains a pricing table that is updated with each release. Organisations with custom pricing agreements with providers can override the default pricing table in their configuration.

Audit completeness. When a request fails after a provider routing attempt and is retried on a different provider, ASC records both attempts in the audit log. This is important for cost attribution — the first failed attempt may have consumed tokens even if it returned an error — and for debugging provider-specific issues.

Conclusion

The progression from a simple OpenAI proxy to a multi-provider routing fabric reflects the maturity of an organisation's AI infrastructure. The naive proxy solves the immediate need. The routing fabric solves the production requirements: reliability, cost control, regulatory compliance, and provider independence.

ASC's routing architecture encodes the lessons from operating AI infrastructure at scale: normalised request formats so provider switching is invisible to applications, credential management that eliminates plaintext secret exposure, circuit breaking that prevents provider failures from becoming application failures, and semantic caching that makes repeated workloads dramatically cheaper.

Want to see the routing fabric in action? AIARCO ASC ships with pre-built provider adapters for OpenAI, Anthropic, Mistral, Cohere, Google, and more. Start free or read the architecture docs.

From OpenAI Proxy to Multi-Provider Routing: The ASC Architecture in Plain English

From OpenAI Proxy to Multi-Provider Routing: The ASC Architecture in Plain English

Why Naive Proxying Is Not Enough

The Routing Fabric: Architecture Overview

How Credentials Are Stored

Request and Response Normalisation

Streaming Support

Semantic Caching

Load Balancing Across Provider Endpoints

Operational Reality

Conclusion

Ready to take control of your AI services?

Related Articles

Context Window Management at the Gateway Level: Truncation, Summarization, and Compression

Failover Strategies for AI Gateways: From Simple Retries to Provider Arbitrage

Designing Immutable Audit Logs for an AI Platform: Schema, Storage, and Query Patterns