Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, unless the utility of this devices is large enough to override existing cultural norms, there's actually very few venues where it feels "comfortable" to voice interact with a device.

I went through this exercise with GPT voice. It's an awesome capability, but other than perhaps walking outside, or sitting in my office, there's no other space where it feels "ok" to just spontaneously talk to something.

A grey area is when you perhaps have headphones in / on and it looks like you're in a phone conversation with somebody, then it kinda feels ok, but generally you're not going to take a phone conversation in a public area without distancing yourself from others.

There's a reason most casual communication these days is text rather than voice or video calls.



> it looks like you're in a phone conversation with somebody

Even though everyone's seen AirPods by now, in those rare occasions when I'm on the phone in public, I feel compelled to have my phone out and vaguely talking at it, so it's clear I'm on a phone call and not a crazy person.

I'm curious if we would see similar usage with the pin, where voice commands in public are always performed with the hand up for the projection screen (it will still prompt looks, but hopefully be clear in context, "oh they're doing some tech thing").

Of course at this price point, it's highly dubious that we'll see anywhere near the ubiquitous market penetration of AirPods (which garner understandable complaints about the price point sub-$200, and that's with a clear value prop).


I don't mind the earphones, but often headsets are entirely impractical. Most notably, in the case of any sort of weather, wind, etc. A phone can also get rained on, but its a bit easier to keep safe.

The other reason they are mostly impractical - keeping a charge. *wired* headsets were great in this regard, but then there's the wire, and now, there's the phone (that may not even support the wire?).


The weirdness is caused by the incantation all these things have. Once you can just talk to the AI without doing anything, just talk to it, it'll catch on very easily.


"Siri, lights to half."

"Siri, lights to HALF."

"Siri, lights to HAAAAALF."

"Siri, LIGHTS TO FIFTY PERCENT!"


This fortunately is a solved problem. Or will be, once Amazon, Apple and Google get out of their asses and plug a better voice recognition model to an LLM.

Silly how OpenAI could blow all voice assistants out of the water today, if they just added Android intents as function calls to the ChatGPT app. Yes, the "voice chat mode" is that good.


I know i'm getting close to Torment Nexus territory but how do you get an LLM to run code as the response? Given that an LLM basically calculates the most probable text that follows a prompt, how do you then go from that response to a function call that flips a lightswitch? Seems like you'd need some other ML/AI that takes the LLM output and figures out it most likely means a certain call to an API and then executes that call.

With alexa i can program if/then statements, like basically when i say X then do Y. If something like chatgpt requires the same thing then i don't see the advantage.


> With alexa i can program if/then statements, like basically when i say X then do Y. If something like chatgpt requires the same thing then i don't see the advantage.

Yes, I was thinking about even something as if/then, which could be configured in the UI and manifest to GPT-4 as the usual function call stuff.

The advantage here would be twofold:

1. GPT-4 won't need you to talk a weird command language; it's quite good at understanding regular talk and turning it into structured data. It will have no problem understanding things like "oh flip the lights in the living room and run some music, idk, maybe some Beatles", followed by "nah, too bright, tone it down a little", and reliably converting them into data you could feed to your if/else logic.

2. ChatGPT (the app) has a voice recognition model that, unlike Google Assistant, Siri and Alexa, does not suck. It's the first model I've experienced that can convert my casual speech into text with 95%+ accuracy even with lots of ambient noise.

Those are the features ChatGPT app offers today. Right now, if they added a basic bidirectional Tasker integration (user-configurable "function calls" emitting structured data for Tasker, and ability for Tasker to add messages into chat), anyone could quickly DIY something 20x better than Google Assistant.


At some point you've got to get from language to action, yes - in my case, I use the LLM as a multi-stage classifier, mapping from a set of high-level areas of capability, to more focused mappings to specific systems and capabilities. So the first layer of classification might say something like "this interaction was about <environmental control>" where <environmental control> is one of a finite set of possible systems. The next layer might say something like "this is about <lighting>", and the next layer may now have enough information to interrogate using a specific enough prompt (which may be generated based on a capability definition, so for example "determine any physical location, an action, and any inputs regarding colour or brightness from the following input" - which can be generated from the possible inputs of the capability you think you're addressing).

Of course this isn't fool proof, and there still needs to be work defining capabilities of systems, etc. (although these are tasks AI can assist with). But it's promising - "teaching" the system how to do new things is relatively simple, and effectively akin to describing capabilities rather than programming directly.


> If something like chatgpt requires the same thing then i don't see the advantage.

So LLMs today can do this a few ways. One they can write and execute code. You can ask for some complex math (eg calculate the tip for this bill), and the LLM can respond with a python program to execute that math, then the wrapping program can execute this and return the result. You can scale this up a bit, use your creativity at the possiblities (eg SQL queries, one-off UIs, etc).

You can also use an LLM to “craft a call to an API from <api library>”. Today, Alexa basically works by calling an API. You get a weather api, a timer api, etc and make them all conform to the Alexa standard. An LLM can one-up it by using any existing API unchanged, as long as there’s adequate documentation somewhere for the LLM.

An LLM won’t revolutionize Alexa type use cases, but it will give it a way to reach the “long tail” of APIs and data retrieval. LLMs are pretty novel for the “write custom code to solve this unique problem” use case.


Yup, from where I see it, the only thing(s) holding llms back from generating api calls on the fly in a voice chat scenario is probably latency (and to a lesser degree malformed output)


Yea, the latency is absolutely killing a lot of this. Alexa first-party APIs of course are tuned, and reside in the same datacenter, so its fast, but a west-coast US LLM trying to control a Philips Hue will discover they're crossing the Atlantic for their calls, which probably would compete with an LLM for how slow it can be.

> and to a lesser degree malformed output

What's cool, is that this isn't a huge issue. Most LLMs how have "grammar" controls, where the model doesn't select any character as the next one, it selects the highest-probability character that conforms to the grammar. This dramatically helps things like well-formed JSON (or XML or... ) output.


Disagree. Extra latency of adding LLMs to a voice pipeline is not that much compared to doing voice via cloud in the first place. Improved accuracy and handling of natural language queries would be worth it relative to the barely-working "assistants" that people only ever use to set timers, and they can't even handle that correctly half the time.


Check out “LLM tool use”

The basic idea is to instruct the llm to output some kind of signal in text (often a json blob) that describes what it should do, then have a normal program use that json to execute some function.


Google's version could have flawless voice recognition backed with AGI. Within a couple years, it will decay and fail randomly with setting timers.


ChatGPT's voice bot vs. PI's voice bot is lacking in Pi's personality and zing. PI is completely free and Ive been using it since beginning of October. Chat GPT's i have to pay $20 and for a lesser voice (personality / tone of voice is more monotone) bot.


It's staggering to me that Apple has not improved on the UI for "try again" or "keep trying", whether the fault is with Siri itself, or just network conditions. It seems like (relatively) low-hanging fruit, compared to the challenges of improving the engine. (I don't use any other voice assistants, no idea how well they do here.)


For iOS, there's nothing more frustrating than dictating a long note only to have it come back with try again.


Feels like there needs to be more frequent feedback about what Siri is doing in cases like that instead of treating the whole input as a single unit.


If I want to ask ChatGPT about something I will, and the speech-to-text is a lot faster than typing on my phone. There's no voice incantation needed, rather a button press, but people still raise their eyebrows and make me feel self-conscious. I wish I could subvocalize to it like I remember reading about in the book series Artemis Fowl.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: