✦ Gemini Live Agent Challenge

Eyes, Voice, and a Wrench: What Happens When AI Can See What You See

Building FixIt Genie — a real-time multimodal AI agent for equipment repair, powered by Gemini Live API and Google ADK.

March 2026
12 min read
#GeminiLiveAgentChallenge

The Problem With Every Repair Tool That Exists Today

Picture a maintenance technician on a factory floor, hands inside an electrical panel, Ray-Ban glasses on. An AI agent is watching through the glasses camera — seeing the exact wiring configuration, the exact components — talking them through the fault diagnosis step by step. Hands-free. No manual to flip through. Just a voice that sees what they see.

Every existing repair tool — manuals, YouTube tutorials, support hotlines — shares the same fundamental flaw: they're blind. They can't see what you're looking at. They can't adapt when your setup is different from the example. They can't tell you "the valve on your left, the one with the red handle" because they have no idea what's in front of you.

Gemini Live changes this constraint. An AI that streams your camera, hears your voice, and talks back in real time isn't a better search engine — it's a different kind of tool entirely.

What Makes This Different From a Voice Assistant

The combination that makes FixIt Genie work isn't any single capability — it's all three running simultaneously: continuous vision (camera frames arriving at 1 FPS), bidirectional audio (you can speak and be heard at any moment, including mid-response), and domain knowledge retrieval (six tools firing mid-conversation without breaking the flow).

Remove any one of these and the experience degrades. Vision without voice is just image search. Voice without vision is Siri. Knowledge without real-time delivery is a PDF. Together, they produce something that behaves like a knowledgeable colleague standing next to you — one who can say "I can see the corrosion on your terminal" and mean it.


What FixIt Genie Does

FixIt Genie is a native Android app. Point your camera at broken equipment, describe the problem out loud, and the agent identifies what it sees, queries its knowledge base, and walks you through the repair — one step at a time, confirming each step through the camera before giving the next one.

Visual understanding. Camera frames arrive at 1 FPS. The agent reads error codes on displays, checks gauge levels, identifies corrosion on battery terminals, and spots issues you haven't asked about. It describes what it sees out loud — "I can see white buildup on your positive terminal" — so you know it's actually looking, not guessing.

Bidirectional conversation, not push-to-talk. Audio streams continuously in both directions. You can speak mid-response and the agent stops immediately. No button to hold, no turn indicator to wait for. The Gemini native audio model's VAD handles this automatically — say "wait" or "hold on" and it listens.

Safety checks are not optional. The system prompt enforces a hard rule: get_safety_warnings() fires before any instruction involving physical action. The agent won't tell you to open a breaker panel without first warning about lethal voltages on the bus bars. It won't guide coolant system work without warning about pressurized steam. This is enforced in code, not just guidelines.

Ray-Ban Glasses: Hands-Free Repair

For situations where holding a phone gets in the way — under a car, inside an appliance panel, deep in an engine bay — FixIt Genie supports Ray-Ban Meta glasses as a fully integrated camera source. One tap switches the live video stream from your phone camera to the glasses. The same AI agent that was watching your phone is now watching through your eyewear, hands-free.

This isn't a novelty. For professional use cases — factory floor, field service, surgical suite — it's the only viable form factor. You cannot hold a phone while your hands are inside running machinery.

Camera Source — Live Toggle
📱
Phone Camera
CameraX ImageAnalysis
Back camera · 1 FPS
ACTIVE
one tap
🕶️
Ray-Ban Meta Glasses
Meta DAT SDK v0.4
I420 → JPEG conversion
STANDBY
Integration stack
SessionViewModel · CameraSource enum
GlassesCameraManager
Meta Wearables SDK · mwdat-camera
I420 → JPEG · same ADK LiveRequest pipeline

No session restart required on switch · agent sees the new stream immediately

The integration uses the Meta DAT SDK (mwdat-core, mwdat-camera) to stream glasses frames via the Wearables SDK. GlassesCameraManager handles I420→JPEG conversion and feeds frames into the same WebSocket pipeline as the phone camera. SessionViewModel exposes a CameraSource enum (PHONE / GLASSES) — switching is live with no session restart.


Architecture: How It All Fits Together

The system is a monorepo: a native Android app and a Python backend on Google Cloud Run.

The Android App (Kotlin + Jetpack Compose)

Three real-time data streams, one WebSocket connection:

Live Streaming Pipeline
Android App
📱 Camera
CameraX · 1 FPS · 768×768 JPEG
🎙️ Mic In
AudioRecord · 16 kHz PCM mono
🔊 Speaker
AudioTrack · 24 kHz PCM · AEC
wss://
WebSocket
single connection
Cloud Run · ADK
👁️ Vision
Gemini sees frames continuously
👂 Hear
Native VAD · no start/stop signals
🗣️ Speak
gemini-2.5-flash-native-audio

Camera (CameraX)ImageAnalysis captures frames from the back camera at 1 FPS, scaled to 768×768, JPEG-compressed, Base64-encoded, and sent as ADK LiveRequest JSON blobs. When glasses mode is active, GlassesCameraManager replaces this stream with I420→JPEG-converted frames from the Meta DAT SDK.

Audio Input (AudioRecord) — Raw PCM at 16kHz mono, streamed as Base64-encoded blobs over the same connection. The native audio model handles voice activity detection automatically — no need to signal speech start/end.

Audio Output (AudioTrack) — The agent's voice responses arrive as PCM audio in LiveEvent messages. AudioTrack is configured with USAGE_VOICE_COMMUNICATION and MODE_IN_COMMUNICATION so hardware acoustic echo cancellation kicks in — critical when the agent's voice and the user's microphone are in the same room.

The Backend (Google ADK on Cloud Run)

The backend uses ADK's adk web, which exposes a /run_live WebSocket endpoint for bidi-streaming. The agent combines three ADK domain skills with six function tools:

Domain Skills (SkillToolset) — loaded on demand:

Each skill is a SKILL.md with behavioral instructions plus references/ markdown docs. The agent calls list_skills to discover available domains and load_skill to pull one into context only when needed — keeping the context window lean.

Function Tools:

from google.adk.skills import load_skill_from_dir
from google.adk.tools.skill_toolset import SkillToolset
from google.adk.tools.google_search_tool import GoogleSearchTool

_skill_toolset = SkillToolset(skills=[
    load_skill_from_dir(_SKILLS_DIR / "automotive"),
    load_skill_from_dir(_SKILLS_DIR / "electrical"),
    load_skill_from_dir(_SKILLS_DIR / "appliances"),
])

agent = Agent(
    model="gemini-2.5-flash-native-audio-latest",
    name="fixitgenie",
    instruction=SYSTEM_INSTRUCTION,
    tools=[
        _skill_toolset,
        lookup_equipment_knowledge,
        get_safety_warnings,
        log_diagnostic_step,
        GoogleSearchTool(bypass_multi_tools_limit=True),
        analyze_youtube_repair_video,
        lookup_user_manual,
    ],
)

Deployment

Cloud Run with session affinity (critical for persistent WebSocket connections). The deploy.sh script is a single-command IaC deployment: enables APIs, creates the Artifact Registry, builds the container, deploys. One discovery worth noting: gemini-2.5-flash-native-audio-latest only works with the Gemini API, not Vertex AI. GOOGLE_GENAI_USE_VERTEXAI=FALSE plus a direct API key is the required configuration.


The Knowledge Stack

Why ADK Skills + Vector Search Instead of a Hardcoded Dict?

The first prototype embedded knowledge directly in tools.py as a Python dictionary. Fast, dependency-free, and architecturally wrong for three reasons: keyword matching fails on synonyms ("engine won't turn over" misses the battery document), adding knowledge requires redeploying code, and a hardcoded dict is a weak answer to "show me evidence of grounding."

The production architecture uses two complementary layers:

Knowledge Lookup Cascade
💬 "engine oil pressure alarm"
1
ADK Domain Skills
Behavioral instructions · domain expertise · repair patterns
🚗 Automotive ⚡ Electrical 🏠 Appliances
Each skill is a SKILL.md with reference docs loaded on demand via load_skill. Defines how the agent diagnoses — what questions to ask, when to escalate, what safety checks to run. Loaded into context only when needed — keeps the context window lean.
miss → next layer
2
Firestore Vector Search
Semantic similarity · 1536-dim embeddings · COSINE index
find_nearest() gemini-embedding-001
Equipment documents embedded with gemini-embedding-001 and stored in Firestore. "engine oil pressure alarm" semantically matches the oil system document — zero keyword overlap required. Handles synonyms, paraphrases, and partial descriptions.
miss → next layer
3
Live Web Tools
Real-time search · video transcripts · manufacturer PDFs
google_search YouTube PDF manuals
Handles the long tail — the 2009 Mitsubishi Outlander P2101 that no KB covers. Google Search finds relevant pages, analyze_youtube_repair_video extracts transcripts from niche repair channels, lookup_user_manual fetches and parses manufacturer PDFs.

Click a layer to expand · layers compose, not replace

ADK Skills — Each skill package is a SKILL.md with behavioral instructions plus references/ markdown docs. The behavioral layer (how the agent diagnoses, what questions it asks, when it escalates) is separate from the knowledge layer. Agents that mix behavior and facts in a single blob become brittle.

Firestore Vector Search — Equipment documents are embedded with gemini-embedding-001 (1536 dimensions) and stored in Firestore with a COSINE vector index. lookup_equipment_knowledge calls find_nearest() — "engine oil pressure alarm" semantically matches the oil system document with zero keyword overlap.

Handling the Long Tail

Nine knowledge documents cover common scenarios well. Someone with a 2009 Mitsubishi Outlander asking about a P2101 throttle actuator code will exhaust the skills KB fast. Three web tools handle the rest — each with non-obvious implementation decisions:

google_searchGoogleSearchTool (ADK's built-in grounding tool) cannot coexist with custom function tools by default. The fix: bypass_multi_tools_limit=True on the GoogleSearchTool instance. Without it, every live session throws a 400 INVALID_ARGUMENT on deployment.

analyze_youtube_repair_video — The obvious approach (pass the YouTube URL to Gemini's REST API) works for massively popular videos but silently fails for niche repair channels — exactly the content that's most useful. The fix: fetch the transcript independently via youtube-transcript-api, pass the text to Gemini for summarization.

lookup_user_manual — ManualsLib has no public API and bot-detection blocks scraping. Instead: a grounded Gemini search query finds the manufacturer's official PDF URL, requests fetches it, pypdf extracts the text, a second Gemini call summarizes the relevant sections.


Design Decisions That Had Real Consequences

Native Android, not web. CameraX gives direct access to the camera sensor with no browser sandboxing overhead. AudioRecord with VOICE_COMMUNICATION mode activates hardware acoustic echo cancellation at the driver level — critical when the agent's voice and the user's microphone are in the same room. AudioTrack in MODE_IN_COMMUNICATION routes audio through the earpiece path, enabling AEC. None of this is available in a browser. The torch API also works — useful when pointing a phone into a dark engine bay.

Always-on audio streaming. The first implementation gated the microphone while the agent was speaking — a reasonable attempt to prevent echo feedback from triggering a second response. It also made interruption impossible, because the server never heard the user speaking. Removing the gate and letting Gemini's native VAD handle turn detection restored true bidi-streaming. The server sends interrupted: true when it detects the user speaking mid-response; the app clears the audio queue instantly. This is the difference between a live agent and a voice-skinned chatbot.

1 FPS video, not a video stream. Equipment doesn't move fast. A JPEG frame once per second is sufficient to read an error code, check a fluid level, or identify a component — at a fraction of the bandwidth and cost of continuous video. The 768×768 resolution preserves enough detail to read small text on appliance displays without saturating the connection.

Behavioral instructions separate from knowledge. The ADK skills (SKILL.md + reference docs) define how the agent diagnoses — what questions to ask, when to escalate, what safety checks to run for a given domain. The Firestore vector search holds what the agent knows about specific equipment. Mixing behavior and facts in a single prompt produces agents that are either too cautious (safety warnings for everything) or too aggressive (skips checks when confident). Separation keeps each layer tunable independently.


Non-Obvious Things We Discovered

The system prompt is load-bearing for multimodal trust. "Identify the equipment" produces an agent that gives answers. "Describe what you see out loud before identifying it" produces an agent users trust — because they can hear it noticing the same thing they're looking at. For a repair guidance app, trust is not a UX nice-to-have. A user who doubts the agent is seeing their equipment will not follow its instructions into a live electrical panel.

Function calls during live audio streaming don't break the conversation. The native audio model supports custom function calling alongside bidi-streaming. The agent hears "my washing machine shows E4", calls lookup_equipment_knowledge, gets the result, and responds — all without the user experiencing a pause or mode switch. The same applies to the three web tools: google_search for error codes and model-specific procedures not in the KB, analyze_youtube_repair_video for step-by-step instructions from niche repair channels, and lookup_user_manual for extracting error code tables from manufacturer PDFs. The agent reaches the open web mid-conversation without breaking the flow. The diagram below shows the full sequence.

Function Call Mid-Conversation
You
"My washing machine shows E4 and stopped mid-cycle"
PCM 16kHz → Cloud Run
Agent
Identifies "E4" + "washing machine" → triggers lookup
Gemini Live API · function_call detected
🔧
lookup_equipment_knowledge
query="washing machine E4 error"
TOOL CALL
↩ Result
"E4: water supply issue — machine not filling within time limit. Check: inlet valves open, inlet filters not clogged, water pressure adequate."
Agent speaks
"E4 on a Samsung typically means a water supply issue. Are both the hot and cold valves behind the machine fully open?"
PCM 24kHz → AudioTrack → speaker

The Architecture Is a Template

The camera and audio pipeline, WebSocket protocol, ADK skills structure, safety-first enforcement, and Ray-Ban glasses support are all reusable. A new vertical is a new set of tools and a new system prompt — the streaming core stays the same.

New vertical = new SKILL.md files + new function tools. The streaming infrastructure doesn't change.


Built for the Gemini Live Agent Challenge using Google ADK, Gemini 2.5 Flash Native Audio, Cloud Run, and a native Android app with CameraX, Ray-Ban Meta glasses support, and Jetpack Compose.