Hacker Newsnew | past | comments | ask | show | jobs | submit | veunes's commentslogin

The manipulation risk is real, but it usually comes from pretending there is a painless answer

The "old car" analogy seems right, with the extra complication that the car is supplying a non-trivial chunk of the country's electricity and replacing it is not quick

The interesting part will be whether Belgium can turn this into a coherent long-term plan

I bet if you could look at the hidden reasoning tokens at the exact moment the DB was dropped, there were zero thoughts about safety rules in there. The model simply hit an access error > searched for a token > found one > ran the command. That whole "I am violating my instructions" vector only fired up after the pissed-off user fed it a prompt full of accusations. So yeah, it's not a confession at all, it's just the model adapting to the user's context

"backups in the same volume" aren't backups, they’re just snapshots in the same blast radius fwiw. If your DR plan hinges on a single physical volume ID, you have zero resilience

This needs to be a lesson for everyone: real backups belong in an independent store (S3/GCS) in a different region with object lock enabled. It’s the only way to make sure even a compromised root token can’t nuke your data for 30 days


Sure, 26B models on beefy desktop silicon are finally nipping at the heels of commercial APIs, but this is a mobile thread. On a phone with 8GB of RAM and passive cooling, your tokens per second (t/s) are going to fall off a cliff after the first minute of sustained compute


It’s likely a llama.cpp backend issue. On the Pixel, inference hits QNN or a well-optimized Vulkan path that distributes the SoC load properly. On the iPhone, everything is shoved through Metal, which maxes out the GPU immediately and causes instant overheating. Until Apple opens up low-level NPU access to third-party models, iPhones will just keep melting on long-context prompts


This article is all fluff because real benne marketing. If they mentioned that a 4B model on an iPhone 16 drains 15% of the battery for a single long prompt and triggers hard thermal throttling after 20 seconds, nobody would be clicking on headlines about "commercial viability" fwiw


I ran several Gemma 4 quants on my 24gb mac mini, and with proper context size tuning they're quick enough I guess, but I would really love to see them working well on an iphone with 2/3gb of ram...


I noticed the inference is routed through the gpu rather than the Apple neural engine. Google’s engineers likely gave up on trying to compile custom attention kernels for Apple’s proprietary tensor blocks iirc. While Metal is predictable and easy to port to, it drains the battery way faster than a dedicated NPU. Until they rewrite the backend for the ANE, this is just a flashy tech demo rather than a production-ready tool


Are the Apple neural engines even a practical target of LLMs?

Maybe not strictly impossible, but ANE was designed with an earlier, pre-LLM style of ML. Running LLMs on ANE (e.g. via Core ML) possible in theory, but the substantial model conversion and custom hardware tuning required makes for a high hurdle IRL. The LLM ecosystem standardized around CPU/GPU execution, and to date at least seems unwilling to devote resources to ANE. Even Apple's MLX framework has no ANE support. There are models ANE runs well, but LLMs do not seem to be among them.


It is possible but requires a very specific model design to utilize. As this reverse engineering effort has shown [0] "The ANE is not a GPU. It’s not a CPU. It’s a graph execution engine." To build one requires using a specific pipeline specifically for CoreML [1].

[0] https://maderix.substack.com/p/inside-the-m4-apple-neural-en... [1] https://developer.apple.com/documentation/coreml


That's the best "what is ANE, really?" investigation / explanation I've seen. Directly lays out why LLMs aren't an ideal fit, its "convolution engine" architecture, the need for feeding ANE deep operation sequence plans / graphs (and the right data sizes) to get full performance, the fanciful nature of Apple's performance claims (~2x actually achievable, natch), and the (superior!) hard power gating... just _oodles_ of insight.


More info on specific design choices needed to run models here [1]. I mean it is possible given that apple themselves did it in [2], but it's also not as general purpose or flexible as a GPU.

[1] https://news.ycombinator.com/item?id=43881692 [2] https://machinelearning.apple.com/research/neural-engine-tra...


There is a project on github named ANEMLL. Was discussed here a month ago, running LLMs on iPhone - https://news.ycombinator.com/item?id=47490070


It will be interesting to see how things change in a couple of months at WWDC, when Apple is said to be replacing their decade old CoreML framework with something more geared for modern LLMs.

> A new report says that Apple will replace Core ML with a modernized Core AI framework at WWDC, helping developers better leverage modern AI capabilities with their apps in iOS 27.

https://9to5mac.com/2026/03/01/apple-replacing-core-ml-with-...


ANE is OK, but it pretty much needs to pack your single vector into at least 128. (Draw Things recently shipped ANE support inside our custom inference stack, without any private APIs). For token generation, that is not ideal, unless you are using a drafter so there are more tokens to go at one inference step.

It is an interesting area to explore, and yes,this is a tech demo. There is a long way to go to production-ready, but I am more optimistic now than a few months back (with Flash-MoE, DFlash, and some tricks I have).


I'm certainly fine with it drawing some power.

Running background processes might motivate the use of NPU more but don't exactly feel like a pressing need. Actively listen to you 24/7 and analyze the data isn't a usecase I'm eager to explore given the lack of control we have of our own devices.


> Google’s engineers likely gave up on trying to compile custom attention kernels for Apple’s proprietary tensor blocks iirc.

The AI Edge Gallery app on Android (which is the officially recommended way to try out Gemma on phones) uses the GPU (lacks NPU support) even on first party Pixel phones. So it's less of "they didn't want to interface with Apple's proprietary tensor blocks" and more of that they just didn't give a f in general. A truly baffling decision.


Edge Gallery does have NPU support, it needs you to install the beta of AICore on the Play Store, the Edge Gallery app has instructions.


Huh I didn't see those instructions when I tried it last week. Must not have looked closely enough. I do remember it not having NPU support (confirmed by other people) back at the Gemma 3 launch a while ago.


Yes they added it for Gemma 4. Maybe it detects whether your phone has an NPU or not too. I have a OnePlus 15 which does have it.


It won't even let you try Gemma4 until you install a beta update to AICore as of today.


Where do you see that? In my app I can see both AICore and non-AICore versions.


Edge Gallery app on Android has NPU support but it requires a beta release of AICore so I'm sure the devs are working on similar support for Apple devices too.


Isn’t Apple paying Google billions of dollars to license these things? Surely they should make it easier to compile for their native engines…


On my iphone i can choose CPU or GPU in edge gallery. What would be the difference if I used CPU?


The ANE is not a fast or realistic way to infer modern LLMs.


Yeah, a lot of it only becomes obvious in hindsight because each individual signal is easy to rationalize away


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: