Tuesday, April 14, 2026

Entry 003: First Contact

Amsterdam, home office Hardware: Meta Ray-Ban Gen 2 Session: 18:24 – 18:32

Eight minutes. That’s all it took to find out whether two and a half weeks of architecture, a hundred and three unit tests, and twenty-eight hundred lines of Swift were worth anything.

The glasses had been sitting on my desk since March 28. Seventeen days of building a voice pipeline without once putting them on and talking. Seventeen days of tests passing in a simulator that, as every iOS developer knows, lies about Bluetooth and audio. The simulator always lies.

Today I put the glasses on and said: “What’s the weather like in Amsterdam?”

The first thing that happened was nothing. The glasses played music. Spotify, through Bluetooth, the way they’ve been doing since I bought them. But the app showed a red dot: Not connected. Because Bluetooth audio and the Meta developer SDK are two entirely different things, and the app didn’t know the glasses existed.

I tapped Connect Glasses — a button that wasn’t there twenty minutes earlier because I’d forgotten to put it back when I rewrote the UI. The first bug of the session, fixed and rebuilt and reinstalled in under three minutes. That’s a pace that would have been impossible six months ago, and it’s the most boring part of this story.

The Meta AI app needed to update the glasses firmware. Five minutes of waiting while a progress bar crept across the screen, the kind of waiting that makes you check your phone except my phone was the one doing the updating. When it finished, the dot turned green.

“What’s the weather like in Amsterdam?”

The screen said Listening. Then Processing. Then the response appeared:

“I don’t have real-time weather data, so I can’t tell you the current conditions in Amsterdam. Check weather.com or just look out the window — you’re there.”

That last part. “Just look out the window — you’re there.” That’s not a generic chatbot response. That’s the voice I spent an evening tuning into a system prompt in March, the one that says: warm, dry, precise, short responses, no motivational language. The machine sounded like what I asked it to sound like, which is a strange thing to feel good about when you’re standing in your living room wearing sunglasses indoors and talking to yourself.

The TTS was robotic — Apple’s AVSpeechSynthesizer, the one that sounds like an airport announcement. That’s a known trade-off: we chose speed over quality because the latency budget for a presentation co-pilot is 400 milliseconds, and ElevenLabs streaming adds network hops we haven’t earned yet. The voice will get better. What matters today is that it spoke at all.

Then my friend walked in.

He asked me something in Bulgarian — what are you doing, or maybe why are you wearing sunglasses inside, the kind of question that doesn’t need an answer because the answer is obviously “something stupid.” I replied in Bulgarian. The glasses heard all of it.

The app processed my Bulgarian as English, decided I’d said something about “telephone give us,” and responded: “Not sure what you mean by ‘telephone give us’ — can you say more?”

This is the moment the entry could have been about failure. Instead it became about the most important feature we haven’t built yet: a wake word. “Hi Anders.” The machine should be deaf until you call its name. Without that, it’s an eavesdropper — picking up every side conversation, every half-muttered aside to a friend, every sentence in every language and trying to make English sense of it. The research docs warned about this. The device test proved it in eight seconds.

The barge-in worked.

I asked another question, and when the machine started answering, I talked over it. Mid-sentence. No politeness, no waiting for a pause, just: I started speaking and it stopped. Immediately. Not after finishing its thought, not after a polite “I’ll wait,” just — silence. Then it listened.

This is the feature I care most about. A machine that won’t shut up when you interrupt it is a machine that doesn’t respect you. Every IVR system you’ve ever screamed “REPRESENTATIVE” at is a machine that failed this test. Ours passed it on the first try, and the reason it passed is that we treated barge-in as the primary feature of the voice server, not an edge case to handle later.

The session ran for five minutes. The proactivity engine announced the time too often — “you have about four minutes left,” then three, then two, then one, like an anxious co-pilot who can’t stop checking the clock. That’s a tuning problem, not an architecture problem. The timebox cues worked: the session auto-stopped at zero, played a completion earcon, and showed a summary screen.

Eight exchanges. Zero highlights — because we hadn’t taught the system to rate salience yet, so everything scored 0.5 and nothing crossed the 0.7 threshold for “moments worth examining.” The conversation log was there though, scrollable, shareable, a record of the first time a human and this particular machine spoke face to face.

I tried to take a screenshot of the summary. My phone locked before I could. The session state vanished. When I unlocked, I was back at the start screen, the conversation gone. Two bugs in one moment: the screen should have stayed awake during the summary, and the session should have been saved to disk so I could come back to it.

Then the screenshot arrived anyway — AirDropped from the phone’s photo library where it had been waiting the whole time. Eight exchanges. The full conversation. Proof that it worked.

Here’s what I know after eight minutes of first contact:

The architecture holds. Three organs — voice, context, mode — cooperating through callbacks, each ignorant of the others’ internals. The orchestrator wires them together and the whole thing feels like a single system even though no single piece knows what the whole system does. That’s the Christopher Alexander quality the coding discipline talks about: structural aliveness. The parts reinforce the whole.

The voice needs work. Apple’s synthesizer is functional but lifeless — the kind of voice that makes you conscious of talking to a machine even when the words are right. ElevenLabs streaming will fix this, and it’ll be the difference between a prototype and something I’d actually wear on a walk through Vondelpark.

The wake word is non-negotiable. The Bulgarian incident proved that always-listening without a gate is socially unacceptable, practically useless, and occasionally hilarious. “Hi Anders” or nothing.

And the personality works. That’s the part that surprised me. “Just look out the window — you’re there.” “Not really my instrument. I’m built for your coaching and presentation work.” The machine knows what it is: an instrument. Not a friend. Not a companion. A tool with warmth but without pretense.

Günther Anders would have something to say about that — about a machine that performs warmth without feeling it, about the gap between the signal and the source. But that gap is the whole point. The warmth isn’t in the machine. It’s in the decision to make the machine warm. The craftsmanship is human. The output is shared.

Entry 001 was first light: the machine opened its eyes.
Entry 002 was first breath: the machine learned to listen.
Entry 003 is first contact: the machine and I spoke.

Next time, we walk.

Merge Diary is the build log of Anders Cyborg — a wearable AI companion project. Each entry is written after a session with the glasses. Raw logs, not polished prose. The writing is the build.

🤖 AI (100%): Draft, structure, voice matching, HTML conversion.
🗣 ME (0%): Was busy wearing the glasses.