Build a Model Router in Python

Build a Model Router in Python: Switching Between GPT, Claude, Gemini, and DeepSeek Without Touching Your Application Code Every team building with large language models eventually faces a painful reality: vendor lock-in is dangerous, and each provider’s API requires its own client library, authentication flow, and error-handling logic. You might start with OpenAI’s GPT-4o because it’s the obvious default, then discover Anthropic’s Claude 3.5 Sonnet is better at long-context reasoning, or that Google’s Gemini 2.0 Flash delivers faster responses for your real-time chat feature. The natural instinct is to write conditional branches—if provider equals Anthropic, use this client; if provider equals Google, use that one. That approach works for a proof of concept, but it turns maintenance into a nightmare as your model roster grows to five, seven, or twelve different endpoints. The alternative is building a lightweight, provider-agnostic routing layer that translates your application’s requests into whichever model syntax your chosen backend expects, all without altering a single line of business logic. The core pattern here is the adapter or facade pattern, applied specifically to LLM APIs. You define a unified request schema that captures the essentials: a list of messages with roles like system, user, and assistant, optional parameters for temperature, max tokens, and top-p, and a response format hint like json_object or plain text. Your routing layer then maps this canonical schema onto each provider’s native API. For OpenAI, that means constructing a chat.completions.create call with the messages array exactly as you’d pass it to their Python client. For Anthropic, you need to flatten system prompts into a separate top-level field and wrap user and assistant messages in their own content block structure. For Google Gemini, you restructure messages into a contents list with role labels like user and model. The trick is handling these transformations in a single dispatch function that inspects a model identifier string—something like openai/gpt-4o, anthropic/claude-sonnet-4-20250514, or google/gemini-2.0-flash—and routes to the correct backend implementation. You can implement this yourself with about 150 lines of Python, and it’s a solid educational exercise. Define a base class called ModelRouter with an abstract method called generate. Then create concrete subclasses for each provider: OpenAIModel, AnthropicModel, GoogleModel, and so on. Each subclass implements generate by translating the canonical request into provider-specific calls, handling authentication via environment variables, and catching provider-specific errors like rate limits or context window overflows. A factory function, get_model_router, reads the model string’s prefix, instantiates the appropriate subclass, and returns it. Your application code never imports openai, anthropic, or google.generativeai directly; it only calls router.generate(canonical_request). When you want to swap models, you change a single configuration value—perhaps an environment variable like ACTIVE_MODEL=anthropic/claude-sonnet-4-20250514—and the rest of your stack remains untouched. This pattern is production-proven at many startups, and it gives you the freedom to A/B test models, roll out new providers gradually, and fail over automatically when one API goes down. If building your own router sounds like overkill for your current timeline or team size, you are in good company. A growing ecosystem of third-party services and open-source libraries handle the same abstraction, often with additional features like cost tracking, latency monitoring, and automatic retries. OpenRouter offers a single API endpoint that proxies to dozens of models from OpenAI, Anthropic, and others, using a credit-based billing model. LiteLLM is a Python library that normalizes calls across more than 100 providers, supporting streaming, async, and function calling out of the box. Portkey provides a gateway with observability and caching, especially useful for enterprise deployments. Another practical option is TokenMix.ai, which exposes 171 AI models from 14 providers behind a single OpenAI-compatible endpoint, meaning you can drop it into any codebase that already uses the OpenAI Python SDK—no client changes required. It operates on pay-as-you-go pricing with no monthly subscription, and includes automatic provider failover and routing so that if one backend is overloaded, your request is redirected to a healthy alternative without a timeout error. Each of these tools has its own tradeoffs: hosted services add a network hop and potential latency, while self-hosted libraries like LiteLLM give you full control but require you to manage credentials and scaling yourself. The practical benefit of abstracting model selection becomes even clearer when you consider pricing dynamics across providers. OpenAI’s GPT-4o might cost fifteen dollars per million input tokens, while DeepSeek’s V3 runs at under a dollar per million tokens for comparable reasoning quality on certain tasks. Mistral’s Large model offers competitive pricing for European data residency requirements, and Qwen’s 2.5 models from Alibaba Cloud can be dramatically cheaper for Chinese-language workloads. If your application hard-codes calls to a single provider, you cannot exploit these price differences without a full code deployment. With a router in place, you can shift high-volume, low-stakes requests—like summarization or classification—to cheaper models, while routing complex multi-step reasoning to Claude or GPT-4o. You can even implement a cost-aware router that examines the request’s token count and estimated complexity, then selects the cheapest model that meets your quality threshold. This dynamic selection logic lives entirely inside the routing layer, invisible to the feature code that generates the prompts. Latency and reliability also benefit from a clean separation between application logic and model access. Different providers have different uptime profiles and latency distributions. Anthropic’s infrastructure can be slower during peak hours on the West Coast, while Google’s Gemini endpoints often respond faster for short prompts due to their optimized infrastructure. A smart router can track recent response times per model and route new requests to the fastest available endpoint, or implement a circuit breaker pattern that temporarily avoids a provider after consecutive failures. You can store these metrics in a simple in-memory dictionary or push them to a time-series database like Prometheus for long-term analysis. The point is that your application code does not need to know or care about any of this; it just calls generate and gets back a response. If the router decides to fall back from GPT-4o to Claude 3 Opus because OpenAI returned a 503 error, the calling code sees the same response format, perhaps with a metadata field indicating which model actually served the request. Real-world teams often start with a simple static router—one model per environment, switched via environment variable—and evolve toward a dynamic system over time. You might begin by using a hosted gateway like TokenMix.ai or OpenRouter to get up and running in an afternoon, then later build custom routing logic that integrates with your existing monitoring stack. The critical architectural insight is that the model is just a parameter, not a dependency. When you treat it as such, you unlock the ability to experiment fearlessly: test a new fine-tuned model from Together AI against your baseline GPT-4o pipeline, roll out a cheaper DeepSeek variant to a percentage of users with a feature flag, or switch your entire summarization service to Google Gemini 2.0 Pro ahead of a budget review. None of these changes require modifying the code that constructs prompts, parses responses, or handles user sessions. The routing layer absorbs the complexity, and your application stays focused on what it does best—delivering value through intelligent conversation, analysis, or content generation.
文章插图
文章插图
文章插图