The Problem With Every Repair Tool That Exists Today
Picture a maintenance technician on a factory floor, hands inside an electrical panel, Ray-Ban glasses on. An AI agent is watching through the glasses camera — seeing the exact wiring configuration, the exact components — talking them through the fault diagnosis step by step. Hands-free. No manual to flip through. Just a voice that sees what they see.
Every existing repair tool — manuals, YouTube tutorials, support hotlines — shares the same fundamental flaw: they're blind. They can't see what you're looking at. They can't adapt when your setup is different from the example. They can't tell you "the valve on your left, the one with the red handle" because they have no idea what's in front of you.
What Makes This Different From a Voice Assistant
The combination that makes FixIt Genie work isn't any single capability — it's all three running simultaneously: continuous vision (camera frames arriving at 1 FPS), bidirectional audio (you can speak and be heard at any moment, including mid-response), and domain knowledge retrieval (six tools firing mid-conversation without breaking the flow).
Remove any one of these and the experience degrades. Vision without voice is just image search. Voice without vision is Siri. Knowledge without real-time delivery is a PDF. Together, they produce something that behaves like a knowledgeable colleague standing next to you — one who can say "I can see the corrosion on your terminal" and mean it.
What FixIt Genie Does
FixIt Genie is a native Android app. Point your camera at broken equipment, describe the problem out loud, and the agent identifies what it sees, queries its knowledge base, and walks you through the repair — one step at a time, confirming each step through the camera before giving the next one.
Visual understanding. Camera frames arrive at 1 FPS. The agent reads error codes on displays, checks gauge levels, identifies corrosion on battery terminals, and spots issues you haven't asked about. It describes what it sees out loud — "I can see white buildup on your positive terminal" — so you know it's actually looking, not guessing.
Bidirectional conversation, not push-to-talk. Audio streams continuously in both directions. You can speak mid-response and the agent stops immediately. No button to hold, no turn indicator to wait for. The Gemini native audio model's VAD handles this automatically — say "wait" or "hold on" and it listens.
Safety checks are not optional. The system prompt enforces a hard rule: get_safety_warnings() fires before any instruction involving physical action. The agent won't tell you to open a breaker panel without first warning about lethal voltages on the bus bars. It won't guide coolant system work without warning about pressurized steam. This is enforced in code, not just guidelines.
Ray-Ban Glasses: Hands-Free Repair
For situations where holding a phone gets in the way — under a car, inside an appliance panel, deep in an engine bay — FixIt Genie supports Ray-Ban Meta glasses as a fully integrated camera source. One tap switches the live video stream from your phone camera to the glasses. The same AI agent that was watching your phone is now watching through your eyewear, hands-free.
This isn't a novelty. For professional use cases — factory floor, field service, surgical suite — it's the only viable form factor. You cannot hold a phone while your hands are inside running machinery.
Back camera · 1 FPS
I420 → JPEG conversion
No session restart required on switch · agent sees the new stream immediately
The integration uses the Meta DAT SDK (mwdat-core, mwdat-camera) to stream glasses frames via the Wearables SDK. GlassesCameraManager handles I420→JPEG conversion and feeds frames into the same WebSocket pipeline as the phone camera. SessionViewModel exposes a CameraSource enum (PHONE / GLASSES) — switching is live with no session restart.
Architecture: How It All Fits Together
The system is a monorepo: a native Android app and a Python backend on Google Cloud Run.
The Android App (Kotlin + Jetpack Compose)
Three real-time data streams, one WebSocket connection:
Camera (CameraX) — ImageAnalysis captures frames from the back camera at 1 FPS, scaled to 768×768, JPEG-compressed, Base64-encoded, and sent as ADK LiveRequest JSON blobs. When glasses mode is active, GlassesCameraManager replaces this stream with I420→JPEG-converted frames from the Meta DAT SDK.
Audio Input (AudioRecord) — Raw PCM at 16kHz mono, streamed as Base64-encoded blobs over the same connection. The native audio model handles voice activity detection automatically — no need to signal speech start/end.
Audio Output (AudioTrack) — The agent's voice responses arrive as PCM audio in LiveEvent messages. AudioTrack is configured with USAGE_VOICE_COMMUNICATION and MODE_IN_COMMUNICATION so hardware acoustic echo cancellation kicks in — critical when the agent's voice and the user's microphone are in the same room.
- SessionViewModel
- AudioStreamManager
- AgentWebSocket
- GlassesCameraManager
- GenieAvatar · Compose Canvas
The Backend (Google ADK on Cloud Run)
The backend uses ADK's adk web, which exposes a /run_live WebSocket endpoint for bidi-streaming. The agent combines three ADK domain skills with six function tools:
Domain Skills (SkillToolset) — loaded on demand:
- 🚗 Automotive
- ⚡ Electrical
- 🏠 Appliances
Each skill is a SKILL.md with behavioral instructions plus references/ markdown docs. The agent calls list_skills to discover available domains and load_skill to pull one into context only when needed — keeping the context window lean.
Function Tools:
- 🔧 lookup_equipment_knowledge
- ⚠️ get_safety_warnings
- ▶️ analyze_youtube_repair_video
- 📄 lookup_user_manual
- 🔍 google_search (built-in)
- 📝 log_diagnostic_step
from google.adk.skills import load_skill_from_dir
from google.adk.tools.skill_toolset import SkillToolset
from google.adk.tools.google_search_tool import GoogleSearchTool
_skill_toolset = SkillToolset(skills=[
load_skill_from_dir(_SKILLS_DIR / "automotive"),
load_skill_from_dir(_SKILLS_DIR / "electrical"),
load_skill_from_dir(_SKILLS_DIR / "appliances"),
])
agent = Agent(
model="gemini-2.5-flash-native-audio-latest",
name="fixitgenie",
instruction=SYSTEM_INSTRUCTION,
tools=[
_skill_toolset,
lookup_equipment_knowledge,
get_safety_warnings,
log_diagnostic_step,
GoogleSearchTool(bypass_multi_tools_limit=True),
analyze_youtube_repair_video,
lookup_user_manual,
],
)
Deployment
Cloud Run with session affinity (critical for persistent WebSocket connections). The deploy.sh script is a single-command IaC deployment: enables APIs, creates the Artifact Registry, builds the container, deploys. One discovery worth noting: gemini-2.5-flash-native-audio-latest only works with the Gemini API, not Vertex AI. GOOGLE_GENAI_USE_VERTEXAI=FALSE plus a direct API key is the required configuration.
The Knowledge Stack
Why ADK Skills + Vector Search Instead of a Hardcoded Dict?
The first prototype embedded knowledge directly in tools.py as a Python dictionary. Fast, dependency-free, and architecturally wrong for three reasons: keyword matching fails on synonyms ("engine won't turn over" misses the battery document), adding knowledge requires redeploying code, and a hardcoded dict is a weak answer to "show me evidence of grounding."
The production architecture uses two complementary layers:
SKILL.md with reference docs loaded on demand via load_skill. Defines how the agent diagnoses — what questions to ask, when to escalate, what safety checks to run. Loaded into context only when needed — keeps the context window lean.
gemini-embedding-001 and stored in Firestore. "engine oil pressure alarm" semantically matches the oil system document — zero keyword overlap required. Handles synonyms, paraphrases, and partial descriptions.
analyze_youtube_repair_video extracts transcripts from niche repair channels, lookup_user_manual fetches and parses manufacturer PDFs.
Click a layer to expand · layers compose, not replace
ADK Skills — Each skill package is a SKILL.md with behavioral instructions plus references/ markdown docs. The behavioral layer (how the agent diagnoses, what questions it asks, when it escalates) is separate from the knowledge layer. Agents that mix behavior and facts in a single blob become brittle.
Firestore Vector Search — Equipment documents are embedded with gemini-embedding-001 (1536 dimensions) and stored in Firestore with a COSINE vector index. lookup_equipment_knowledge calls find_nearest() — "engine oil pressure alarm" semantically matches the oil system document with zero keyword overlap.
Handling the Long Tail
Nine knowledge documents cover common scenarios well. Someone with a 2009 Mitsubishi Outlander asking about a P2101 throttle actuator code will exhaust the skills KB fast. Three web tools handle the rest — each with non-obvious implementation decisions:
google_search — GoogleSearchTool (ADK's built-in grounding tool) cannot coexist with custom function tools by default. The fix: bypass_multi_tools_limit=True on the GoogleSearchTool instance. Without it, every live session throws a 400 INVALID_ARGUMENT on deployment.
analyze_youtube_repair_video — The obvious approach (pass the YouTube URL to Gemini's REST API) works for massively popular videos but silently fails for niche repair channels — exactly the content that's most useful. The fix: fetch the transcript independently via youtube-transcript-api, pass the text to Gemini for summarization.
lookup_user_manual — ManualsLib has no public API and bot-detection blocks scraping. Instead: a grounded Gemini search query finds the manufacturer's official PDF URL, requests fetches it, pypdf extracts the text, a second Gemini call summarizes the relevant sections.
Design Decisions That Had Real Consequences
Native Android, not web. CameraX gives direct access to the camera sensor with no browser sandboxing overhead. AudioRecord with VOICE_COMMUNICATION mode activates hardware acoustic echo cancellation at the driver level — critical when the agent's voice and the user's microphone are in the same room. AudioTrack in MODE_IN_COMMUNICATION routes audio through the earpiece path, enabling AEC. None of this is available in a browser. The torch API also works — useful when pointing a phone into a dark engine bay.
Always-on audio streaming. The first implementation gated the microphone while the agent was speaking — a reasonable attempt to prevent echo feedback from triggering a second response. It also made interruption impossible, because the server never heard the user speaking. Removing the gate and letting Gemini's native VAD handle turn detection restored true bidi-streaming. The server sends interrupted: true when it detects the user speaking mid-response; the app clears the audio queue instantly. This is the difference between a live agent and a voice-skinned chatbot.
1 FPS video, not a video stream. Equipment doesn't move fast. A JPEG frame once per second is sufficient to read an error code, check a fluid level, or identify a component — at a fraction of the bandwidth and cost of continuous video. The 768×768 resolution preserves enough detail to read small text on appliance displays without saturating the connection.
Behavioral instructions separate from knowledge. The ADK skills (SKILL.md + reference docs) define how the agent diagnoses — what questions to ask, when to escalate, what safety checks to run for a given domain. The Firestore vector search holds what the agent knows about specific equipment. Mixing behavior and facts in a single prompt produces agents that are either too cautious (safety warnings for everything) or too aggressive (skips checks when confident). Separation keeps each layer tunable independently.
Non-Obvious Things We Discovered
The system prompt is load-bearing for multimodal trust. "Identify the equipment" produces an agent that gives answers. "Describe what you see out loud before identifying it" produces an agent users trust — because they can hear it noticing the same thing they're looking at. For a repair guidance app, trust is not a UX nice-to-have. A user who doubts the agent is seeing their equipment will not follow its instructions into a live electrical panel.
Function calls during live audio streaming don't break the conversation. The native audio model supports custom function calling alongside bidi-streaming. The agent hears "my washing machine shows E4", calls lookup_equipment_knowledge, gets the result, and responds — all without the user experiencing a pause or mode switch. The same applies to the three web tools: google_search for error codes and model-specific procedures not in the KB, analyze_youtube_repair_video for step-by-step instructions from niche repair channels, and lookup_user_manual for extracting error code tables from manufacturer PDFs. The agent reaches the open web mid-conversation without breaking the flow. The diagram below shows the full sequence.
query="washing machine E4 error"The Architecture Is a Template
The camera and audio pipeline, WebSocket protocol, ADK skills structure, safety-first enforcement, and Ray-Ban glasses support are all reusable. A new vertical is a new set of tools and a new system prompt — the streaming core stays the same.
- Industrial maintenance — Technician wears Ray-Ban glasses, hands occupied inside machinery. Agent sees through the glasses, cross-references OEM service manuals, guides lockout/tagout before any panel opens.
- Fleet and field service — Agent reads the VIN from the camera, pulls active service bulletins for that chassis, guides the repair and escalates when it's outside safe DIY scope.
- Healthcare equipment — Biomedical tech gets cross-referenced FDA device databases, manufacturer service manuals, and safety warnings specific to radiation interlocks and high-voltage components.
New vertical = new SKILL.md files + new function tools. The streaming infrastructure doesn't change.
Built for the Gemini Live Agent Challenge using Google ADK, Gemini 2.5 Flash Native Audio, Cloud Run, and a native Android app with CameraX, Ray-Ban Meta glasses support, and Jetpack Compose.