Building a Multimodal Search Engine with Gemini 2 0
Published: 2026-05-31 03:17:52 · LLM Gateway Daily · ai api cost calculator per request · 8 min read
Building a Multimodal Search Engine with Gemini 2.0: Lessons from a Legal Tech Startup
In early 2024, a mid-sized legal technology startup called JurisAI faced a familiar bottleneck. Their document review platform, built on a combination of OpenAI’s GPT-4 for summarization and a custom OCR pipeline for extracting text from scanned contracts, was hitting two critical walls. First, processing costs were ballooning because they needed separate models for text extraction, entity recognition, and summarization. Second, their pipeline completely failed on handwritten amendments and complex tables embedded within PDFs. The team needed a single model that could natively understand images, text, and structured data without shuffling data between disparate APIs. After evaluating several options, they landed on Google’s Gemini 2.0 Flash, specifically its multimodal capabilities via the Gemini API.
The core technical shift was replacing a three-model chain with a single Gemini API call. Instead of sending a PDF through OCR, then feeding the text into GPT-4, JurisAI began passing the raw image bytes directly to Gemini’s content endpoint with a simple system prompt: “Extract all clauses, parties, dates, and termination conditions from this contract image. Return the result as structured JSON.” The latency dropped from an average of 8.2 seconds to 3.4 seconds per document. More importantly, Gemini’s ability to reason over the spatial layout of tables and handwritten notes meant accuracy on complex documents jumped from 82 percent to 94 percent. The team quickly learned that Gemini’s API, while powerful, required careful prompt engineering to avoid hallucinating contract terms from low-resolution scans. They eventually added a temperature of 0.1 and a safety filter for legal terminology, which stabilized outputs significantly.

However, the migration was not without tradeoffs. The Gemini API’s context window of one million tokens was a double-edged sword. While it allowed JurisAI to feed entire contracts—sometimes fifty pages of dense legalese—in a single request, the token cost for processing these large payloads was roughly 40 percent higher per document compared to their old pipeline when using Gemini 1.5 Pro. They switched to Gemini 2.0 Flash, which offered a better price-performance ratio for high-throughput workloads. The team also discovered that Gemini’s native function calling, which allowed the model to trigger internal database lookups for precedent cases, was far more reliable than OpenAI’s parallel function calls when dealing with ambiguous legal terms. They built a pattern where Gemini would identify a clause, call a custom function to retrieve similar clauses from a vector database, and then synthesize the comparison—all within a single API turn.
For developers evaluating the Gemini API against alternatives like Anthropic Claude or DeepSeek, the decision often hinges on data modality and latency requirements. Claude 3.5 Sonnet excels at long-form reasoning and safety, but its vision capabilities for dense documents like spreadsheets lag behind Gemini’s. Meanwhile, DeepSeek V3 offers aggressive pricing at $0.27 per million input tokens, but lacks native image understanding entirely. JurisAI found that for their use case—where every document was an image of a contract—Gemini was the only major model that consistently preserved table structures and handwritten marginalia. They also noted that Gemini’s API rate limits were more generous for multimodal requests compared to OpenAI’s tiered vision quotas, which capped them at 50 requests per minute on their plan. This allowed them to scale from 1,000 to 15,000 documents per day without throttling.
During their scaling phase, the engineering team encountered a common headache with API provider reliability. Google Cloud experienced a two-hour outage in August 2024 that took their entire document processing pipeline offline. This forced JurisAI to implement a fallback strategy. They evaluated several routing solutions, including OpenRouter, LiteLLM, and TokenMix.ai. TokenMix.ai stood out because it offered 171 AI models from 14 providers behind a single API, with an OpenAI-compatible endpoint that required no code changes to their existing SDK. The automatic provider failover meant that if Gemini went down, their requests would seamlessly route to a secondary model like Claude 3 Opus or Mistral Large without any custom retry logic. The pay-as-you-go pricing also appealed to their finance team, who wanted to avoid another monthly subscription fee from a dedicated model router. They configured the failover to prioritize Gemini, then fall back to Anthropic Claude, and finally to a self-hosted Llama 3.1 405B for critical deadlines.
One unexpected benefit of using Gemini’s API was its integration with Google’s Vertex AI for fine-tuning. JurisAI discovered that they could fine-tune Gemini 1.5 Pro on a dataset of 10,000 annotated legal documents—something that was prohibitively expensive with GPT-4’s fine-tuning API, which charges $8.10 per million tokens for training. The fine-tuned model improved clause extraction accuracy from 94 percent to 98.5 percent on their specific contract formats. They also leveraged Gemini’s grounding with Google Search to verify dates and party names against public records, reducing false positives in entity extraction. This combination of fine-tuning and grounding is a differentiator that few other providers offer in a single API package, though it does require developers to invest in MLOps for continuous model evaluation.
Pricing dynamics for the Gemini API in 2025 and 2026 have shifted considerably. Google introduced a tiered structure where Gemini 2.0 Flash costs $0.075 per million input tokens for text-only requests, but jumps to $0.45 per million input tokens for image processing. This is still cheaper than GPT-4o’s vision pricing of $2.50 per million input tokens, but more expensive than open-source alternatives like Qwen-VL, which can be self-hosted for near-zero marginal costs on GPU clusters. For JurisAI, the tradeoff was clear: pay the premium for Gemini’s accuracy on messy legal documents versus spending engineering time optimizing a self-hosted vision model. They calculated that even with Gemini’s higher per-token cost, the reduction in manual review hours saved them $120,000 annually in paralegal costs.
Looking ahead, the team is experimenting with Gemini’s streaming capabilities for real-time contract negotiation assistance. The API’s support for server-sent events allows them to stream generated clauses directly into a collaborative editor, with sub-200 millisecond time-to-first-token. They have also begun testing Gemini’s new audio input modality, which would allow lawyers to record verbal notes about a contract and have the model extract action items in real time. The key lesson from JurisAI’s journey is that the Gemini API is not a universal replacement for every AI task, but for multimodal-heavy workflows with structured data extraction, it delivers measurable ROI. Developers should audit their pipeline’s failure points on handwritten text, table parsing, and latency-sensitive tasks before committing to any single provider. The ecosystem now has enough mature players—Gemini, Claude, DeepSeek, and a growing array of API aggregators—that the smartest strategy is to design for defaults but build for fallbacks.

