I Built a Real-Time AI Agent That Sees Your Screen and Does the Clicking. Here's Every Bug That Nearly Broke Me.

Spectra, real-time AI agent that sees your screen and does the clicking

Spent the last three weeks building something I badly wanted during my worst screen-tired days, an agent that can see your screen, hear your voice, and do the clicking for you.

It runs in real time, talks while it works, and almost broke me with timing bugs along the way.

Meet Spectra, built for people who are shut out by “normal” interfaces.

TL;DRSolo build, three weeks. Real-time AI agent powered by Gemini Live API, bidirectional audio, multimodal vision, and browser control in a single WebSocket. Sub-second voice-to-action. Zero data stored. Apache 2.0.

“Voice audio streams in, native audio streams back, and tool calls interleave with speech mid-conversation.” One WebSocket. No turn boundaries.


There is a specific kind of tired you get from staring at screens all day where you stop reading and start scanning. You know the feeling. Your eyes move but nothing goes in. You click the wrong tab. You re-read the same sentence four times.

One night I was in that state, trying to catch up on something, and I thought: I wish I could just close my eyes and have someone read this to me. Not a podcast. Not a summary. The actual thing, on the actual page I was already on, in real time.

That thought did not leave.

Because while I was tired from a long day, there are 2.2 billion people who navigate the web like that every single day. Not because they are tired, but because the web was never built for them. 96% of the top million websites fail basic accessibility standards. Screen readers read DOM trees, not meaning. They describe. They never act.

So I built Spectra: an AI agent that sees your screen, hears your voice, and does the clicking for you. No mouse. No keyboard. No reading required.

This is how I built it, and every bug that nearly broke me along the way.


What Spectra Does

Spectra is a real-time AI agent that closes the loop between seeing and doing

  1. Sees your screen continuously via live video stream
  2. Listens to your voice in real time, no button press, no wake-and-wait
  3. Understands layout, text, images, buttons, forms, everything a sighted person would see
  4. Acts on your behalf, clicks, types, scrolls, navigates, fills forms
  5. Speaks back naturally, in 30+ languages, interruptible mid-sentence

Here's what a real interaction looks like:

You: “Go to BBC News and read me the top headline.”

Spectra: “You're on BBC News. The top story is: 'Scientists confirm water ice found beneath Mars south pole.' Want me to open it?”

You: “Yes, open it.”

Spectra: “Opening the article... The piece starts: Researchers at the European Space Agency have confirmed the largest deposit of water ice ever detected on Mars. Want me to keep reading?”

No mouse. No keyboard. No reading. A task that takes a sighted person 30 seconds, done entirely by voice, on any website, without the site needing to support any accessibility standard.

Spectra home interface showing the eye logo, wake word activation, and screen sharing controls

Spectra's home interface with wake word detection and screen sharing controls

The problem

The Accessibility Gap

Billions of people are underserved by today's assistive technology.

2.2B
People with vision impairment worldwide
96%
Websites fail accessibility standards
3x
Longer to complete basic tasks

Traditional screen readers can't see images, understand layout, or take action.

Until now.


Why This Needs a Persistent Streaming Primitive

The request-response pattern, screenshot in, text out, parse, act, repeat, has a ceiling. There is always a gap, always a turn boundary, always a moment where the AI is gone and you are waiting. For accessibility use cases that is a fundamental problem, not a performance issue.

Gemini Live API exposes that primitive: bidiGenerateContent.

It is a persistent bidirectional WebSocket: voice audio streams in continuously, native audio streams back in real time, and tool calls (click, type, navigate) interleave with speech mid-conversation. The result is that Spectra talks while it works, not waiting until it finishes clicking before responding. It sounds and feels like a person sitting next to you at a computer.

With screenshot-per-turn models, there's always a dead air gap. With a persistent audio stream, Spectra feels like sitting next to a human who's talking while they work.


Architecture

Spectra has four components: a Next.js frontend that captures your screen and mic, a FastAPI backend on Cloud Run that bridges the Gemini Live API, the Gemini Live API itself, and a Chrome extension that executes browser actions.

Spectra system architecture: User Browser connects to Next.js Frontend (audio capture, screen capture, audio playback, wake word) via WebSocket to FastAPI Backend on Cloud Run (SpectraStreamingSession, tool router, system prompt, session state) which connects to Gemini Live API (native audio I/O, multimodal vision, thinking budget, function calling) and Chrome Extension (content.js DOM executor, background.js message router)

Spectra's full system architecture, from user input to browser automation

The Agent Loop

Spectra runs a continuous observe → think → plan → act loop:

1. Observe
Screen frames (2 FPS) + audio (16kHz) stream to Gemini
2. Think
Gemini reasons over visual + audio context (suppressed CoT)
3. Plan
Selects tool call with parameters (e.g., click_element)
4. Act
Routes through WebSocket → extension → tab. Results flow back.
This loop runs continuously with no turn boundaries. The user can interrupt at any point (barge-in), and Spectra stops immediately.

Technical Deep Dive

The Core: SpectraStreamingSession

The heart of Spectra is a single class, SpectraStreamingSession in backend/app/streaming/session.py, that manages the bidirectional bridge between the browser WebSocket and Gemini Live API. It's about 1,600 lines, and honestly, it's the file I've rewritten the most.

Connecting to Gemini Live API:

python
from google import genai
from google.genai import types

# On Google Cloud: Vertex AI with service account
client = genai.Client(
    vertexai=True, 
    project=gcp_project, 
    location="europe-west1"
)

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],
    system_instruction=types.Content(
        parts=[types.Part(text=system_prompt)]
    ),
    tools=SPECTRA_TOOLS,   # 9 browser action tools
    speech_config=types.SpeechConfig(
        voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
                voice_name="Aoede"
            )
        )
    ),
)

async with client.aio.live.connect(
    model="gemini-live-2.5-flash-native-audio", 
    config=config
) as session:
    # session is now a persistent bidirectional stream
    # audio, video, and tool calls all flow through 
    # this single connection

The Audio Gate, The Trickiest Bug

This one kept me up at night. Gemini sometimes starts speaking before it decides to call a tool. It might say “Sure, clicking,” and then issue a click_element call. If I forward that premature audio, the user hears “Sure, clicking,” followed by silence (while the tool executes), then the actual result. It sounds broken, like Spectra is announcing things before they happen.

Problem: Premature audio
“Sure, clicking,”Tool callSilenceResult

Broken: Spectra narrates before acting

Solution: 200ms audio gate
Buffer200ms waitTool call? DiscardExecutePost-action audio

Smooth: Spectra acts, then speaks

My solution: buffer audio at the start of each model turn. If a tool call arrives within 200ms, discard the buffer (it was premature narration). If no tool call arrives, flush the buffer to the speaker:

python
# Audio gating: buffer briefly at start of each model turn
self._audio_buffer: list[dict] = []
self._audio_gate_open: bool = False

async def _flush_audio_gate(self):
    """Flush after 200ms hold-off. 
    Cancelled if a tool call arrives first."""
    await asyncio.sleep(0.20)
    for msg in self._audio_buffer:
        await self.websocket.send_json(msg)
    self._audio_buffer.clear()
    self._audio_gate_open = True

# When a tool call arrives: discard premature audio
if response.tool_call:
    self._audio_buffer.clear()      # discard "Sure, clicking,"
    self._audio_gate_open = True    # post-tool audio flows directly
    await self._handle_tool_calls(response.tool_call)

The root cause is timing, not model quality. The model generates narration at the right moment for a conversation but at the wrong moment for tool execution. The audio gate separates those two modes.

Transparent Gemini Reconnection

Gemini Live sessions have a 15-minute limit (go_away signal). When it fires, I reconnect to Gemini without closing the browser WebSocket. The user never notices, their session continues seamlessly:

python
while not self._client_disconnected:
    async with client.aio.live.connect(
        model=model, config=config
    ) as session:
        # Re-inject context: current URL, latest frame, 
        # extension status
        await session.send_realtime_input(text=context)
        if latest_frame:
            await session.send_realtime_input(
                video=types.Blob(
                    data=frame, 
                    mime_type="image/jpeg"
                )
            )
        
        # Process messages until go_away or error
        async for response in session.receive():
            if response.go_away:
                break  # reconnect loop handles this
            # ... handle audio, text, tool calls

Nine Agent Tools

Spectra has nine tools that Gemini can call, split between server-side and client-side:

ToolSidePurpose
describe_screenServerTrigger fresh visual analysis of the current frame
read_page_structureServerFetch DOM structure with labels and selectors (works without screen share)
read_selectionClientRead selected text, paragraph, or full page
click_elementClientClick by text label or coordinates with HiDPI scaling
type_textClientType into an input field, targeted by description
scroll_pageClientScroll in any direction
press_keyClientPress any key or shortcut (Enter, Tab, Ctrl+A, etc.)
navigateClientGo to a URL
highlight_elementClientVisual purple highlight for feedback

Server-side tools execute in the backend. Client-side tools route through WebSocket → frontend → Chrome extension → target tab, with results flowing back to Gemini.

The Chrome Extension: Spectra Bridge

The extension is the “hands” of Spectra. It runs in two contexts:

  • content.js, Injected into every tab. Receives action messages and executes them: clicking elements, typing text, scrolling, pressing keys. Shows a purple highlight overlay on clicked elements for visual feedback.
  • background.js, Service worker that routes messages from the Spectra frontend tab to whichever tab the user is browsing.

I went with a description-first element finding strategy: instead of relying on coordinates (which drift when pages re-render), Spectra matches elements by their visible text, aria-label, or title attribute. Coordinates are a fallback, not the primary targeting method. This made a huge difference in reliability. Clicking by description works across page re-renders, zoom levels, and dynamic content.

Spectra chat interface showing real-time voice interaction with the AI agent performing browser actions

Real-time voice interaction with Spectra performing browser actions and providing natural language responses


Deploying on Google Cloud

Spectra runs entirely on Google Cloud Run with a one-command deployment:

bash
./deploy.sh your-gcp-project-id europe-west1
Google Cloud Architecture (europe-west1): GitHub Actions CI/CD pushes to Cloud Build and Artifact Registry, deploying to Cloud Run Frontend (Next.js 14, 1 vCPU, 512 MiB) and Backend (FastAPI + WebSocket, 2 vCPU, 1 GiB, session affinity), with Secret Manager, Cloud Logging, and HTTPS/CORS

Google Cloud deployment architecture with pay-per-use pricing and auto-scaling to zero

This script enables GCP APIs, stores the Gemini API key in Secret Manager, deploys the FastAPI backend (2 vCPU, 1 GiB, session affinity, 3600s timeout) and Next.js frontend to Cloud Run, and patches CORS.

Both services auto-scale from 0–10 instances. The backend uses session affinity to keep WebSocket connections pinned to the same instance across the Gemini session lifetime (up to 15 minutes).


Accessibility by Design

Spectra isn't accessible as an afterthought, accessibility is the whole point. I built this because my mum needed it. Everything else followed from that.

For Blind Users

  • Wake word activation, Say “Hey Spectra” to start. No button to find.
  • Keyboard shortcuts, Q (toggle), W (screen share), Escape (stop). All work without sight.
  • ARIA live regions, Two ARIA live regions (assertive for urgent events, polite for responses) announce state changes to screen readers running alongside Spectra.
  • Skip-to-content link, Standard accessibility pattern for keyboard navigation.
  • Screen reader compatible, VoiceOver (macOS), NVDA, and JAWS tested.

For the Audio-First Experience

  • Spatial language, “Top left,” “centre of the page,” “just below the header.”
  • Natural numbers, “Twenty-three,” not “23.” “The fifth result,” not “result 5.”
  • One thing at a time, Never a wall of text. Headlines first, then summary, then detail.
  • Barge-in support, Interrupt Spectra mid-sentence. She stops immediately.
  • Never asks the user to click, Spectra does everything. “Click here” is forbidden in the system prompt.

Privacy

  • Zero data stored, Screenshots exist as a single variable in memory. Each new frame replaces the last. No files, no database, no cloud storage.
  • No accounts, no tracking, no analytics.
  • Open source, Apache 2.0 for transparency and community audit.

Key Technical Decisions

DecisionWhy
Gemini 2.5 FlashOnly model with native bidirectional audio + vision + function calling in a single streaming API
WebSocket vs HTTP/SSEBidirectional audio + video + tool calls require full-duplex communication
JPEG @ 2 FPS vs video stream~80KB per frame. Low bandwidth, high enough fidelity for UI understanding
Description-first clicking vs coordinate-onlyPages re-render and coordinates shift. Text/aria-label matching is more reliable
Cloud Run vs GKE/VMsAuto-scaling, managed HTTPS, session affinity for WebSockets, pay-per-use
Chrome Extension vs browser automationDirect DOM access on any tab. No Puppeteer process, no headless browser

The Numbers

The part I'm most proud of is not just that it works, it's that it works in real time on normal connections.

MetricValue
Screen capture rate2 FPS adaptive JPEG (~80KB/frame)
Audio sample rate16kHz PCM (input) / 24kHz PCM (output)
Languages supported30+ (Gemini native audio)
Agent tools9 (2 server-side, 7 client-side)
Data stored on diskZero
DeploymentOne command (./deploy.sh)
Source code~16,700 lines

What I Learned

Building Spectra solo taught me a few things I didn't expect:

  • Prompt engineering is real engineering. The system instruction took longer to get right than the WebSocket infrastructure. Every word matters when you're shaping real-time audio behaviour.
  • The Live API rewards careful systems thinking. Audio gating, reconnection handling, and VAD tuning are the details that separate a demo from something that actually works. This post documents what those details are.
  • Accessibility is a design constraint, not a feature. When I built for my mum first, the interface got better for everyone.
  • Good infrastructure compounds. The Gemini API, Google Cloud, and open-source tooling made it possible to build something production-grade in three weeks without cutting corners.

Spectra is part of a broader bet we're making at Aqta: AI systems should be both powerful and answerable for how they behave, whether they're governing models in production or helping someone use the web with their voice.


What's Next

Spectra is live and open source. The accessibility gap it addresses is real. What comes next:

  1. Chrome Web Store, Package Spectra Bridge as a public extension
  2. Firefox support, Port the extension to Firefox's extension model
  3. Mobile, PWA with system-level screen capture
  4. Multi-tab awareness, Spectra remembers what's in each tab
  5. Workflow learning, “Remember that 'check email' means navigate to Gmail”
  6. Community testing, Real user testing with VoiceOver, NVDA, and JAWS users

Try It

Self-host

bash
git clone https://github.com/Aqta-ai/spectra.git
cd spectra
cp backend/.env.example backend/.env
# Set GOOGLE_CLOUD_PROJECT or GOOGLE_API_KEY
docker-compose up

Open http://localhost:3000, install the Chrome extension from extension/, press Q and start talking.

Deploy to Cloud Run

bash
./deploy.sh your-project-id europe-west1

Resources

  1. Gemini Live API. Bidirectional audio + vision + tool calls in one stream. Source
  2. Gemini API. Models, SDKs, and pricing. Source
  3. Cloud Run. Deploy containers with WebSocket support. Source
  4. Chrome Extensions. Manifest V3 and content scripts. Source
  5. Web Speech API. Wake word and speech recognition in the browser. Source
  6. My mum, for the inspiration.

Spectra is open source under Apache 2.0. Star it, fork it, make it better. You can also watch a 60-second product walkthrough at aqta.ai/demo/spectra.

Share this article:
Anya Chueayen

Anya Chueayen

Founder of Aqta. Before this, I worked on integrity at social media platforms, the unglamorous side of AI where human behaviour, edge cases, and ethics collide at scale. That work convinced me that responsible AI needs infrastructure, not just good intentions. Based in Dublin, closely watching how regulation is reshaping what we build and how.

© 2026 Aqta. All rights reserved.

Request access

Choose your access type and tell us about your use case.

Access type