Tuesday, April 14, 2026

Entry 002: The Machine Learns to Listen

Amsterdam, home office Hardware: Meta Ray-Ban Gen 2 Session: 14:00 – 17:30

Entry 001 ended with a promise. The machine had opened its eyes — live video streaming from glasses I’d never touched into an app that didn’t exist two days earlier. The camera was the easy win, I wrote. The real experiment is voice.

Seventeen days later, the machine learned to listen. And to shut up when interrupted.

The architecture came first, and it came from a place I didn’t expect: a virtual boardroom. Before writing a single line of Swift, I asked ten senior women — a data scientist from Booking, a product lead from Spotify, an engineer from Stripe, a behavioral scientist from DeepMind, six others — what they thought about the design. They’re not real people, they’re personas I built to catch my blind spots, and the reason I trust them is that they disagree with each other more than they agree with me.

Mei, the Stripe engineer, said: struct, not enum. ModeConfig as a read-only value the controller consumes but never mutates. At Stripe, she said, they learned the hard way — enums with associated values look clean at two cases but become a switch-statement tax across every consumer when you add case three. Priya, from Spotify, pushed back on extensibility: don’t build for mode three, build two configs that work, and if mode three needs a new field you add it then. At Spotify they called premature extensibility “the plugin nobody plugged.”

I listened to both of them, and then I wrote the code, and the code was better because the argument happened before the implementation rather than inside it.

The v1 app was one brain talking to one pair of eyes. Camera frames flowing in, Claude sending text back, a conversation displayed on a phone screen nobody was looking at while walking. It worked. It also missed the point.

The v2 architecture treats the app as three cooperating organs, a word I borrowed from the research doc and kept because it captures something a word like “module” doesn’t — the sense that these pieces are alive in the way that a kidney is alive, doing one thing quietly and well, and the system dies if any one of them stops.

ModeController is the heartbeat. It knows what mode you’re in — presentation co-pilot, biking assistant — and it owns the clock, the session state machine, the timebox cues that count down from the end of your talk. It fires callbacks when it’s time to remind you that five minutes remain, but it doesn’t know or care what happens in those callbacks. It doesn’t know Claude exists.

VoiceServer is the ears and mouth. It wraps Apple’s speech recognition and text-to-speech in a layer that understands turn-taking — when the human is speaking, when the machine is speaking, and the critical moment when the human starts speaking while the machine is still talking. That moment has a name in the literature: barge-in. In most voice systems, barge-in is a bug they try to suppress. In ours, it’s the most important feature. The machine must learn to shut up when you start talking, instantly, without finishing its sentence, without a polite “I’ll wait.” Because the alternative — an AI that talks over you and won’t stop — is the fastest way to make someone throw a computer off their face.

ContextServer is the memory. It captures every exchange as a snippet with a salience score, builds a prompt skeleton for Claude that includes only what matters within a token budget, and after the session ends, surfaces “moments worth examining” — the snippets that scored highest, the ones most likely to be worth thinking about again.

MCPOrchestrator is the glue. It wires the organs together: user speaks → VoiceServer transcribes → ContextServer retrieves relevant context → Claude reasons → VoiceServer speaks the response → ContextServer logs it. When a timebox cue fires, it plays an earcon. When the glasses disconnect mid-session, it attempts to reconnect. When the session ends, it generates a summary.

Six stories. Thirteen story points. One hundred and three unit tests. Roughly twenty-eight hundred lines of Swift. One afternoon.

The earcons were the part that surprised me. Six pre-baked audio cues — “Session started,” “10 minutes remaining,” “5 minutes remaining,” “2 minutes remaining,” “30 seconds,” “Session complete” — generated on the command line with say and afconvert, each one a few kilobytes. They sound like what they are: a professional instrument giving you a time signal, not a chatty friend counting down your presentation. The difference matters more than it should.

Günther Anders described Promethean shame as the feeling of inadequacy humans experience when confronted by the machines they’ve created. The temptation is to either reject the machine or surrender to it. Neither works. The third path — the one this project is walking — is to use the machine as a craft tool while maintaining the judgment the machine cannot have. The earcons are a tiny example of that principle made audible. They don’t say “You’re doing great!” They say “5 minutes remaining.” The warmth is in the precision, not in the flattery.

The glasses are sitting on my desk. They have been since Entry 001. Tomorrow — or tonight, depending on how restless I get — they go back on, and I find out whether the voice pipeline actually works when a real human is wearing real hardware in a real room, speaking real sentences at a machine that’s supposed to listen, understand, respond, and most importantly, shut up when told to.

The simulator lies about Bluetooth and audio. It always has. The only truth is the device.

Entry 001 was first light: the machine opened its eyes. Entry 002 is first breath: the machine learned to listen. Entry 003 will be first contact: the machine and I walk through Amsterdam together, talking.

If it works, I’ll know because I’ll forget I’m wearing it.

Merge Diary is the build log of Anders Cyborg — a wearable AI companion project. Each entry is written after a session with the glasses. Raw logs, not polished prose. The writing is the build.