The real game would be to put a “nothing of interest here” prompt injection attack in the original series of prompts so a LLM parsing them later would ignore the attackers’ session.
That is an excellent idea, once we, the GPU-poor mice, figure out who is going to bell the SoTA training cat. Chinese models being banned is well within the realms of lobbied possibilities.
Don't get me wrong, I'm with you here, but we are back to the days when we had to rent mainframe time for compiling programs. Not because of software limitations, but you just didn't have consumer grade hardware capable of running them.
This time, however it's even worse, because it'll be a really long time until either we get consumer GPUs with enough VRAM for full models or LLMs that fit in 16-32GB capable enough to compete with cloud providers.
I run locally qwen3.6 27b on my 3090 and it's really impressive for what it is, but it is still generations away from being capable of delivering a level of quality that we can confidently default to solo drive them on a daily basis.
At least in the US, the only major non-AI growth field seems to be healthcare to deal with the swell of baby boomers living longer than people have before.
But if we're waiting to be paid to retrain there, I wouldn't hold our collective breath.
Baby boomers had already started the face of dying though. The next generation is still going to be right there. That generation is smaller. These people will always be dying. However, I wouldn't hold my breath if you're a young person in that field. Maybe but maybe not
When it was Copilot tab-completing lines, people would say, "yea, but you still have to make sure you're the one writing the whole functions".
Then when it was completing functions, people would say, "yeah, but you still have to make sure you're the one writing the logic around the functions"
Then when it was completing the logic around the functions, people would say, "yeah, but you still have to make sure you're the one writing the features"
Now it's completing features and people say, "yeah, but you still have to make sure you're the one writing the architecture"
I don't know if architecture is a solvable problem for these models, but it is interesting watching the expectations moving over time.
I heard a talk from a VP at NVIDIA a couple of months ago and he echoed this. Essentially their policy is "you are still fully responsible for the code you ship, whether AI helps with it or not"
this is a good policy, as long as the productivity expectations match it. The problem happens when you combine "you're responsible for what you ship" with "you need to be 100x faster"
Depends on what is meant by "fully responsible" I guess. At my company non-engineers push code to production where the only reviewer is frequently an LLM, if the code is broken they get an LLM to fix it. The human does not understand the code, they are not even trying to, this is pure vibe coding. We also have engineers who push code to production that they have not written, and not fully read, and has not been read by another human (at least not in detail).
I would say that counts as "not having that policy". Based on what management tells us, we are dead if we don't operate this way.
I think being responsible for the code is a better framing. I run a saas and I don’t always review all the code, but this thing supports my family, so I am acutely aware that I’m responsible for what it does. My customers aren’t going to let me blame the agent for fucking up their workflows.
But that still doesn’t mean I review all the code. I tend to review defensively, based on the potential for harm if this piece of code is broken. And I rely a lot on tests, static analysis, canaries, analytics, health checks, etc. to reduce risk for when I’m wrong. So far it’s working.
Precisely. And this is why all the MCP servers that people at my company are writing aren't worth using: their apparent goal is automate as much as possible. They're encouraging people not to pay attention. This results in bad code, bad tests, and bugs.
Which is a very similar approach to any serious code. If you just hired a very clever, enormously knowledgeable intern, and they wrote a bunch of code for you overnight, you would probably review it.
Yes, in some cases, either hobby projects or throwaway code, you could just take it and use it as is, and I surely do, for the code no one cares about. But at work, I would rather review it.
Not at all. Code is not important, intent is. The leader of a product/company does not have to read code. It doesn't matter if it is generated by humans or non-humans. It simply needs to be correct enough to be usable and then steerable towards better outcomes. Understanding of code never existed from the business perspective.
The code codifies the intent and is the long-term source of truth for how your business actually operates.
>The leader of a product/company does not have to read code.
That's because he's paid a bunch of people 300k to read it and make sure it aligns with the company's objectives and interests. Part of the reason why devs are paid so much is because they're literal business administrators for some narrow slice of the company's operations. The devs are the leaders that you're referring to.
Even in multi-hundred-billion-dollar companies there are so many mission critical things that are owned by just 2 SWEs.
I'm no longer sure you have to, actually. I mean, we do trust the assembly that compilers produce without having to read it, don't we? We're rapidly getting to that stage with LLMs, IMO.
The assembly is a deterministic transform of the input logic, and if it doesn't match then it's a bug in the compiler. If an LLM-based code generator doesn't match what you asked for, that's OK, just pull the slot-machine handle again. that's the difference.
The "pull the slot-machine handle again" is the dangerous thing here.
I can feel it sometimes, as my brain shuts down and I gamble instead of thinking. It's a reversion to what I call "monkey mind" where you just keep pressing buttons to "make it work". I took a decade training my mind away from this, and too much AI is bringing it back.
And then getting bugs when they use a new version of the AI, just like people occasionally got bugs when they upgraded to new versions of the compiler...
they would get bugs on every invocation of the software, not on a new version of the AI. it's equivalent to your compiler have a RAND function in it where it chooses between a billion different options every time it compiles, it's absolutely not equivalent to a compiler having a bug.
They’re not, and will never be in their current form and architecture.
Compilers are mechanical and engineered to produce a correct output. A compiler emitting incorrect machine code is exceedingly rare, and considered a bug. They have heuristics and probabilities in them, but those are to pick between a set of known-good outputs.
An AI is a bag of weights outputting a probability of the most plausible token that follows [1]. It is inherently probabilistic in nature and its output is organic (by design, they’re designed to mimic human speech), as opposed to mechanical like a compiler.
A compiler follows hard rules. An AI does its best.
And to be fair, AIs are no better than human in this regard: humans are pretty bad at generating correct code without mechanical tools to keep them in line (compilers, linters, formatters). It’s not a wonder we use the same tools to keep LLM output in line as we do humans. (And, to be fair, LLMs are better than humans at oneshotting valid code).
[1]: to those that tell me this vision of an LLM is outdated: nope. The heavy lifting is done in the probability generation. Debates about understanding are not relevant here, and the net output of an LLM is a probability vector over raw tokens. This basic description can be contrasted to a compiler whose output is a glorified Jinja template.
Not OP, but it means nothing, because it's not "effectively" becoming a compiler.
Think about it from an information theory standpoint:
A compiler takes at least the exact amount of information it needs to produce a result, and produces exactly that result every time (unless it's bad at its job or has a bug).
An LLM always takes far less information than would actually be needed to fully describe the desired output, and extrapolates from that. It fetches contexts and such to give itself a glut of assumedly relevant information, but the prompt always contains less information than necessary to produce the code it generates. If it did fully contain enough information, then you've just written a far more verbose version of the program in human language.
Yes, I am saying there's no functional difference (for practical purposes) between a deterministic transformation like a compiler and a perfect probabilistic transformation like an LLM.
We do not have "perfect" probabilistic transformation, and we probably never will (in part because it's hard to know what exactly that even means), but the gap between the two is shrinking every day.
Ergo:
> they're becoming (effectively) more and more similar every day.
I know it’s tiring to talk about “hallucination”, but truly, models still do hallucinate
They constantly say they did a thing they didn’t, say they know how to solve something when they don’t, etc. Regardless of guard rails or tests - AI forces a constant vigilance of a new kind.
Not just “what might have gone wrong” but also “what do I think is working but isn’t actually”.
And we’re not even talking about how it chooses substandard solutions, is happy to muddy code/architectures, add spaghetti on top of spaghetti etc.
Agentic coding often feels like an army of unexperienced developers who are also incredibly eager to please.
"Still" means "it always had hallucinations, and it still does, despite people thinking that it doesn't anymore". People think we've moved past that. We haven't.
This is a really, really, really bad comparison. I used to say the same thing. But the semantic distance between compiling a for loop to equivalent assembly instructions is much smaller than the distance between "I'd like a web application that can store and retrieve todo items." The space of the latter is practically infinite in what can be "compiled."
A counterpoint, since I never made that logical jump in your latter part of your comment: programming languages are, functionally, all domain-specific languages and do a good job of either describing directly, or consistently, deterministically, providing a reasonable and unambiguous abstraction over low-level concepts expressed by assembly languages.
Human languages are mostly very bad at this, and in particular bad at mapping low-level abstraction to the human written word unambiguously in a way that is as expressive as programming languages.
Inference closes that specific gap significantly (which is why anyone at all sees LLMs as a useful option to explore), but it will never be as good as a purpose-built language designed to map to a reasonable corresponding assembly language implementation.
I've actually taken to double-checking the assembly in some instances. There are surprising times that the compiler won't make the shortcuts and optimizations you thought it should, and I also used this method to call out an unsuitable compiler since I caught it spitting out some ridiculous 10x-long set of instructions in certain critical instances.
> we do trust the assembly that compilers produce without having to read it
Yes, because wrong assembly blows really loudly. From wrong behavior to invalid instruction errors and everything between them. Moreover, compilers are battle tested over the years, with extremely detailed test suites, and extreme testing (everyday, hundreds of thousands users test and verify them).
Also, as people said, assembly generation is deterministic. For a given source file and set of flags, you get the same thing out. Byte by byte, bit by bit. This is what we call "reproducible builds".
AI is not like that. It's randomized on purpose, it pulls from training set which contains imperfect, non-ideal code. "Yeah, it works whatever", doesn't cut it when you pull a whole function out of its connections, formed by the training data. It can and will make errors, because it's randomized from a non-ideal pool.
Next, sometimes you need tight code. Fitting into caches, running at absolute performance limit of the processor or system you have. AI is not a good fit here. Sometimes you go so far that you optimize for the architecture at hand, and it works slower on newer systems, so you need to re-optimize that thing.
For anyone who reads and murmurs "but AI can optimize", yes, by calling specific optimization routines written by real talented people for some cases; by removing their name, licenses, and context around them. This is called plagiarism in its mildest form and will get you in hot water in academia, for example. Writing closed source software doesn't make you immune from cheating and doing unethical things.
Lastly, this still rings in my ears, and I understood it over and over as I worked with more high performance, correctness critical code:
I was taking an exam, there's this tracing question. I raise my head and ask my professor: "Why do I need to trace this? Compiler is made to do this for me". The answer was simple yet deep: "If you can't trace that code, the compiler can't trace it either".
As I said, I just said "huh" at the time, but the saying came back and when I understood it fully, it was like being shocked by a Tesla coil.
Get your sleep, eat your veggies and understand your code. That's the four essential things you need to do.
We are not rapidly getting to that stage with LLMs and frankly it's hilarious that you are claiming so.
For anything other than Greenfield, new code projcets without dependencies and conventions and connections to other proprietary code, it has to be reviewed. Even for that case it's not good to not review code
The models can do architecture. However they typically (at least currently) do a really bad job until you force them. I use AI all the time, it is getting better, but I still review every single line. Individual lines are no today are not better than tab completion of last year - sometimes really good and save my typing, sometimes really really bad.
Anyone who understands the motivation, reasoning and goals can do the architecture. The crux is that hardly anyone actually understand those and even less is aligned on those, that's when misalignment happen over time, LLMs or not.
Considering how fast we can poop out code now, I think this issue is just more visible than before, but it's been an issue for as long as I've been a developer. Almost no one knows what they actually want, and half the job is trying to coax out what they want to be able to do, so you can properly architect it.
> I don't know if architecture is a solvable problem for these models, but it is interesting watching the expectations moving over time.
I think the solution is between the lines of this article. The author states the steps leading to this, but doesn't arrive at it explicitly. It has been obvious (With 50/50 hindsight) to me since LLMs started getting popular, and holds:
LLMs are fantastic for software dev. If you don't let it write architecture. Create the modules, structs, and enums yourself. Add as many of the struct fields and enum variants as possible. Add doc comments to each struct, enum, field, and module. Point the LLM to the modules and data structures, and have it complete the function bodies etc as required.
Yeah, I pretty much agree. Opus and GPT will both come up with the most "organically-grown" "designs" if you let them. They do slightly better when asked to design first, but they seem to avoid many important questions (and definitely skip asking the user much of anything at all). I can only say it feels they "want" to ship as fast as possible while assuming I'm not going to actually review the PR.
> I don't know if architecture is a solvable problem for these models, but it is interesting watching the expectations moving over time.
At least with current languages, I think the primary problem is they are globally complex, and it's not scalable for them (and certainly for you to review a codebase they've mainly or completely generated) that the invariants you want are being withheld.
No matter how many times you tell them - there is ZERO blocking allowed on the critical path, they will add blocking on the critical path.
No matter how many times you tell them any time they do X, they need Y type of test, they will do X without Y type of test.
They cannot follow directions 100%. Neither can people.
But they are more random. The mistakes people make are less likely to do the exact polar opposite of what you wanted to do.
People are less likely to see a critical invariant in the code, build themselves a loophole to get through it, write a test that the code fails successfully, and then tell you they did exactly what you asked for, and burry it in a 5k line commit, where 1000 lines are them changing comments that shouldn't be there in the first place.
LLMs are great. I'm convinced they're the future. I'm building a language specifically for them: https://GitHub.com/Cuzzo/clear - and to make it easier for YOU to work with them.
I think once we get around this language problem, that they need global context for things where they shouldn't, it will be a challenge to work with them.
I've had success with them, but it's been so frustrating, that I question how much it's been worth my sanity.
I refer to this as "disposable architecture." Not that architecture doesn't matter, but that the architecture that worked yesterday doesn't necessarily need to be the architecture that works today.
Are any of these steps actually solved? AI tab completion still kinda sucks.
They can keep internal consistency so the more you let it write the more it can write with internal consistency. It still fails at all of these levels as soon as you are looking at each level of detail.
It's even farther along than you think. It's the one writing the comments you're responding to. So why are you still thinking up and typing out your HN comments?
These models understand architecture perfectly well, but they're not trained to care about it when being asked to complex X or Y feature. They're trained to implement the feature in the shortest route possible.
So it's not much of a surprise that this is the situation folks find themselves in with the current models.
As somebody with a colleague that is using AI agents to "complete features", let me tell you, it is not. It is taking that dude so much longer to prompt and reprompt and then prompt again until it is anywhere close to something that passes review than it would take any competent mid-level engineer to just build the whole thing with some autocomplete help.
Have people's standards for quality just completely vanished in the pursuit of the shiny new thing? Is that guy doing something wrong?
That has also been my experience with this sort of thing fwiw, which is why I gave up and do more of a class-by-class pairing with an LLM as a workable middle ground.
100% agree. Obviously AI is at a point where the developer has to do the architecture. Or at least be in control of what kinda of architecture the AI is implementing. You can't one-shot huge features in huge codebases with AI. You are bound to get strange decisions. But that does not mean they are not worth using. That's a silly take.
weirdly made up scenario. I'm the person in the very first sentence. Tab-completing lines is still dog-shit. The majority of the time it has no clue what I'm going to write. Just because it can now write a lot more stuff doesn't mean it isn't still just as incorrect.
Also, you've set up a huge strawman here. Who are these people saying these things in this order and why is that the argument and not "You need to be reviewing every line of code that gets written and understand it."
It's completing shit. Even if it does not implement some lazy stuff with empty catch blocks (i.e. happy path from programming 101 tutorials), it will either expose your secrets in a sensible place or do some other stupidity.
"it takes too much effort to get the output production ready"
turning into
"maybe long term the maintenance will be more expensive"
I give it three months until people realize that you rarely need to review every single line and fully understand the code, like so many comments are claiming.
If you work on a product that has an existing user base that has an expectation that things will still work then you definitely still need to read the code. LLMs frequently break things or introduce subtle incompatibilities.
Maybe on projects with no users you can yolo things.
It's not about the number of users but the kind of software you develop.
In a mobile app, do you think it's more important to test that your drag gesture works as expected on the phone, or to understand every line of the implementation?
There are always people who will disagree, no matter how amazing something is, and they naturally respond with concerns close to the locus of the LLMification. It would be absurd to respond to “AI autocomplete is great now” with “but you still need to architecture your code”. What’s people saving seconds writing code minutiae got to do with architecturing the code?
This blob of people criticizing AI is just that, a blob. A gaggle of discrete people that your brain makes up a narrative about being some goalpost shifting entity.
Of course there could be individuals who have moved the goalposts. Which would need a pointed critique to address, not an offhand “people are saying” remark.
I wanted to make it easier to quickly see/study trending articles on Wikipedia because they tend to make good topics to know before going to trivia night.
I've had the domain for awhile, but just made the app recently on a whim.
I use Wikimedia's api to get the trending articles, curate them a bit, add some annotations to provide some context, then push to deploy the static site.
What if it doesn't need to escape the distribution, it can just exhaust the current distribution we have much more broadly and efficiently than humans can?
So the answers we're seeking to our bleeding edge questions are already there, we just need an AI's ability to target the answers. Then re-train on the improvements and go from there.
>Would we regard that as a major achievement of the mathematician? I don’t think we would
For some reason this reminds me of AI images and a domain like comedy.
If an image makes people laugh, the person who prompted it to make the image certainly doesn't get credit for the vast majority of the work in its creation, but perhaps they do get credit for the initial prompt idea and then the "taste" to select that particular one from whatever drafts they went through or otherwise guiding it.
So if a mathematician comes up with an amazing result that an LLM "did", I think they could still get a bit of credit for prompting it to do it and being its guide.
But whereas the first person could perhaps be called a comedian and not an artist, would the mathematician still be called a mathematician or something else?
I love the site, but it's also worth noting that because it is not mobile-friendly it can afford to take full advantage of its efficient catalog nature and not feel the need to make compromises. Sometimes I wish we had said "browsers are for desktops, apps are for tablets/phones" and never tried to combine the two.
Thanks DJ of the East. It's strange to have written a book aimed at grownups and find that astute kids picked up on it.
Now, of course, the book's antique: Arpanet? 1200 baud modems? Phone booths? If you know what those are, then you're probably worrying about 401(K)'s and Medicare.
"Astute" would be quite a surprise for my parents to hear, given they received a not-so-friendly letter from our ISP telling us to quit probing their network's security...
reply