Most likely, an AI model that detects specific sounds is running directly on those devices. No need to transfer anything to Google's servers for analysis =)
From what I have heard you don't need any fancy AI models for detecting either of these sounds. Some of the older alarms were using classic signal processing to to this decades ago.
Similarly glass-break detectors for home security systems have done this for years.
The advantage of an ML model is that you can do multi-class prediction for a sound clip with somewhat arbitrary complexity and the cost of execution is more or less the same even if you add an extra class or two. It's just an extra row in your output prediction vector. By complexity I mean signals that don't have an obviously characteristic spectrum, like glass smashing. A typical CNN backbone is capable of classifying hundreds if not thousands of classes with high accuracy - even the edge architectures. Always-on detection tends to use very compact networks (kB in size) that will run on low-power ARM cores, or even specialist ASICs, but even so 20-30 common audio event types seems very feasible.
For people worrying about sending data back, there's no reason why you'd do this off-device. The exception might be that you feedback to Google that there was a false alarm, so they can use your sound clip as a negative training example. Just a guess there, but Tesla does this extensively for Autopilot - they deploy models in your car and specifically ask it to capture images of rare events (Andrej Karpathy gave an example of tree-occluded stop signs).
Well they might do it with AI, there are however some issues, constantly running sounds through ML models is power hungry which is as far as i know why devices such as these optimize a lot for commands prompts such as "OK google" which is when the ML loop actually starts.
On the other smoke alarm sound is very easy to detect with classical methods so might be extremely cheap to run it without any ML.
> constantly running sounds through ML models is power hungry which is as far as i know why devices such as these optimize a lot for commands prompts such as "OK google" which is when the ML loop actually starts.
How power hungry is it actually? I'd only have to run it on the big model if a pretty dumb model thinks that the voice is even in the audio stream.