Hacker Newsnew | past | comments | ask | show | jobs | submit | jp57's commentslogin

I think he could have gotten away with naming his daughter Photophone, but he would have to pronounce it as if she were a character from an ancient greek epic: phoTOphoNEE.

Actually seems absurdly simple now, but sometime last year I was trying to figure out what I'd need to tow my daughter's car cross country with my truck: what are the trailer/dolly options, what do they cost, can my truck actually tow the combined weight, etc.

I started out prompting ChatGPT kinda how I would with Google, one small prompt at a time, asking about various details. But after one or two of those I just tried "I want to tow a car of make A with my truck model B, from point C to point D, what are my options?" And it wrote me a report with comparison tables and computed towing weights and other details for different options.

At that point, I was like "Oh. This is different. And it's just the beginning."


It very plausibly might have been totally wrong.

Out of laziness I several times asked Claude and ChatGPT each some torque figures and other simple, hard data related to my dirt bike. They often got it completely wrong, but full of confidence every time. I never trust LLMs with hard data, unless you RAG the PDF into the context and even then it's sketchy.


Dates matters. Questions I asked about my Mazda a year ago that were total hucillunations were answered very well this year. To me it feel like the early days of computing. What was not possible one year became possible when a new generation CPU or GPU came out and you have to consistently re-evaluate your expectations or else you'll miss the things that others are discovering with fresh eyes.

I made this personal 'benchmark' of odd and strange questions a few years back when this took off and I would keep re-running these questions whenever some big news came out about a new model and also going back and fourth between the different companies to see where they all stood. (Obvioulsy with clean cache/new accounts)

10 questions: In 2023 it could only get past question 3-4 to reaching the last question and still hacillunating(last year) to providing sources pulled from really obscure books(this year).

For example, one of the harder questions was about the transition of a particular 30 second portion of a background song used in a 30+ year old Bond film that was only played once in the entire film. Went from totally making up nonsense to accurately describing the music theory defintiion of the transition(called a 'stinger') to also explaining why it was done in that particular scene of the film and also providing sources from a snippet of a unrelated interview with the composer explaining his mindset at the time.

Maybe this isn't considered a real benchmark as its not reproducable but for a 'personal benchmark' I came away impressed. I would consider everyone to define their own benchmarks and 'tests' and to consistantly challenge the models to see if there are any meaningful improvements. Now I treat the AI as something to keep skeptical but to also to always consider what it proposes as an answer(ie. dont ever dismiss it outright). I sometimes wonder if this is slowly messing up my biases and maybe thats what Altman, Amodei and others want.


Use the latest models, set effort a bit higher, and try again. It probably won't be wrong.

It wasn’t wrong, though, in my case.

Hard numbers, no. Even high level concepts and theory you need to triangulate and prompt in different angles, across different models, and figure out what overlaps to build a mental mode that’s - even then - roughly 80% correct. It’s better than google, but the information isn’t free

Similarly, I used gen ai to review a real estate purchase. I provided Zillow listing photos and serial numbers of all appliances, the electric panel, and a few additional not pictured areas that I took during the walk through.

I prompted the AI to write a report as if it were a home inspector and it actually did a better job and identified some issues the paid 750 usd inspector missed.


From pictures alone? What are some examples?


It noticed a flooding area due to low grass by the walkout door. It noticed mixed 15 and 20a receptacles on the same circuit. It noticed warped siding and recalled circuit breakers still in use.

15A and 20A receptacles on the same circuit sounds fine as long as it's a 20A circuit? And how could it tell which outlet is on which circuit?

It can’t, but it’s read reports before so it sure can simulate an answer.

To give it the benefit of doubt, it's possible it saw a circuit labelled "kitchen" in the panel, and then in photos of the kitchen saw mixed outlets.

(I'm not in the US - would a 'home inspector' actually go around buzzing out outlets anyway?)


They won't necessarily map out all the circuits but they will generally test them all with a tester to find wiring problems.

Yes, most will at least test GFCI receptacles especially in the kitchen. I bought one to test my basement after a renovation.

What, the Zillow listing of you home doesn't have pictures of mixed 15 and 20a receptacles on the same circuit that an AI caught but that an inspector missed?

Is that what you're telling us??


Good thing you didn't want to wash the car on your way.

Fascinating; you used a non-deterministic tool - one that disclaims its own accuracy - to calculate critical information that could result in serious damages or physical injury? Did you like, double-check the results?

One must imagine how many claims have been denied by insurance companies for doing something like this...


Just recently it occurred to me that the sage parenting advice, "Don't try to make a happy baby happier," applies to so many other things. Once I had this idea, it seemed like everywhere I looked were people trying to make happy babies happier. Improving tools that work fine, optimizing things where the available margin for improvement is small, etc.

I really think a majority of NYTimes and ABCnews consumers don't know the difference between a 2/3 chance (super close) of winning and 2/3 of the vote (a landslide).


Apparently neither do a big chunk of HN readers.


Wait. There were 10000 elevator attendants in the USA in 1990?


Supposedly, in 1990, there was somewhere between 132,000 and 270,000 travel agents. Consider that.


That’s far more believable than 10,000 elevator attendants. I was an adult in 1990 and used travel agents. But I can’t remember ever encountering an elevator attendant.

Well, he is standing still, the camera is stationary. For just that last segment, it is easy to answer "how did he do it?" Write out his remarks, rehearse with a timer, then figure out at what point in the countdown to begin speaking.

The main thing is that he has say basically one sentence right in a single take, but he is a seasoned television announcer, so that in itself is not too surprising.

The much longer segment, including walking with a moving camera at exactly the right timing, would have been much harder to get in a single take. (Not to mention that that Saturn V lying on its side is probably not even in the same location.)


Yes and no. The Saturn V he walks past was on it's side next to the Vehicle Assembly Building, which is close to launch complex 41 from which the Voyager missions launched.

But I think it's about 120 degrees to the left from where they shot the shot of Burke walking. They absolutely had to set up a different shot to get Burke and the Titan III in the same shot.

As an aside... That Saturn V is no longer at that location. Several years ago they moved it a mile to the north and built a building around it to create the Apollo / Saturn V center. Or at least I think it's the same artefact.


They mention Meta's layoffs, which probably have more impact on employee morale than the AI stuff.

My current theory of tech layoffs is that over the last decade or so, churn-inducing practices like stack-ranking have gone out of vogue. One can speculate as to why this happened. Perhaps generational made middle management unwilling to do the dirty work? Nevertheless it happened.

However, companies still want to, and some would argue need to, eliminate low performers, so now they periodically do a companywide reduction in force and frame it with whatever justification is handy, macroeconomic conditions, AI, whatever.

This hypothesis would explain phenomena like companies hiring aggressively during or after a layoff, and why the layoffs keep happening year after year.


Not sure about other tech cos, I think Meta and Amazon currently do stack ranking.

It seems to be a thing that comes and goes as the job market is weaker or stronger


Oh don't worry your pretty little head, stack-ranking and churning are still _very_ in vogue with the tech companies.


But are they actually serious people? I had corporate astroturf accounts arguing with me on my otherwise-ignored blog as early as 2004. All this time later, I just assume that every serious corporation employs PR firms using sock-puppet accounts to shill in favor of whatever dark shit they're doing, acting like it all just really great and good for us.


We've seen this on HN before as well. Companies targeting blogs and reddit with LLM generated content that "subtly" name drop products or services, fake praise, and even meaningless "support" requests on discussion boards.


Claude is a mediocre programmer that can do great things with great supervision, but it can't make mediocre human programmers into good ones, because they can't provide great supervision.

It will try and try and try, though.


id bet its the LLM doom loop: vaguely ask it to do something, tab to news.ycombinator.com for 30 minutes, tab back, noticed it misunderstood the prompt. Restart with new improved prompt, tab back to HN.

So yeah, probably the same thing people do anyway, just not compile time its now generating time.


We opened the Cloud Code floodgates all at once in my org. After a few months we looked at stats, and asked managers for impressions on performance changes. The API cost per engineer doesn't correlate with the apparent increases in performance, but it sure seems that the vast majority of people that used to have good reviews got a lot better, while the bottom third just didn't, even though they use the LLMs about as much. It makes the performance differences in teams look like an abyss. Someone appears stuck in a task, and we see what they've been prompting, and then one of the best seniors comes in, actually asks the questions well, and the LLM does all the debugging and all the fixing in 20 minutes.

It's not that the best performers are magical prompt engineers providing detailed instructions: They ask better questions that the LLM knows how to try to answer, and provide the specific information that the LLM would take a while finding. It's as if some people just had no "theory of mind" of the LLM, and what it can know, and others just do. It's not a living thing or anything like that, but it's still so useful to predict it, put yourself in it's shoes, so to speak. Just like you'd do with a new hire, or a random junior.


This comment is buried deep but I think it's actually quite important. In 2005 you had the elderly googling "Can I have a recipe for an apple pie? Thank you." vs kids doing "apple pie recipe" and clicking the first result. Some (most?) people just weren't capable of conceptualizing the abstract idea of "internet search" so they talked to the machine like they'd talk to other humans.

Until coronavirus virtually anyone regardless of mental skills could get a high-paying job as a coder, there was no filter at all.

What we'll observe now is the split between those who can conceptualize the idea of an AI, and those who cannot. The latter group will be stuck talking to AI in a way that doesn't leverage how it actually works.


I saw a talk by Boris where he said, basically that Claude codes itself now. They have it automatically writing features and reviewing PRs, apparently. I suspect that much of the code has never been seen by human eyes within Anthropic.


What are all their SWEs doing, if Claude is coding itself? And why are there hundreds of open SWE positions on their careers page?


There’s all kinds of stuff there that’s not the Claude Code app.


Why isn't Claude coding that stuff too? Is Claude only good at coding Claude?


IDK. I was just reporting what Boris said. In any case the litany of reports of slop inside Claude Code speaks for itself.


lol so they aren't even good at using Claude


These are people that lucked into working at FAANG 10 years ago and been riding the coattails since. Highly incompetent people dictating how we should all work.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: