Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> No. It did not. Of course it didn't. Why would they do that?

I feel like this misses the forest through the trees. Sure, they could fast-path the specific problem of the day into the dataset but it's not really an approach to making a better overall tool it's a temporary and one off hack you have to add in to an ever growing context of specific task steps. An approach of trying to make a better general tool, such as the new o1-preview, is a "real" path forward.



The point still is that considering they 1) named the model strawberry, 2) released a promo video showing the model solving this exact problem successfully and 3) put a strawberry joke as an example prompt on the landing page, you'd have expected them to actually have fixed it. Otherwise why even bring it up so many times?


People who understand where the problem comes from know to ignore it and understand the irony of the naming. Others have a few seconds of fun. The problem and solving it is almost irrelevant.


An AI that people consider to be AGI, or at least on a path to AGI, not being able to count the number of letters in a word is "almost irrelevant"?


Yes. Nobody is seriously going to ask the system to count letters in a word they just typed. It could be interesting for crosswords and similar things, but adding "use python" solves those already.

Bringing up the strawberry is like saying: "people consider computers to be great at dealing with numbers, but they can't even add 0.1 and 0.2 correctly". You learn about this limitation once, understand how to deal with it, get on with your life.


"Nobody is seriously going to..."

People will throw any problem at it they want to. That's the entire point of general intelligence.


Answering simple computational questions correctly is not almost irrelevant.


If the question is simpler than any reasonable usecase, then it's basically irrelevant.

It's as irrelevant as asking a math genius what's 1+1 and getting answer "a bazillion". Was it wrong - yes. Was everyone's time wasted - also yes.

Openai people know this is something models don't answer right. And they barely care to attempt fixing it - it's basically a joke at this point. But it's irrelevant because nobody pays OpenAI to ask about spelling of words they just typed. If they actually had that need, then "use python" or similar approaches work just fine. The model could be also taught to call a function to get the token-to-spelling mapping it needed.


Right, I also thought that was a strange take.


The author's battle is that it doesn't solve the class problem correctly 100% of the time like a one off hack does, not that it does nothing for it at all. https://i.imgur.com/Ar3rlJ1.png


This was the case with 4o as well. It sometimes solved it, sometimes didn't.


One could say that down to a markov chain with noise, the perspective is the new model named after the problem solves it significantly more reliably than the previous without a problem specific hack.

It's also worth noting the current model is the lower scoring o1-preview, not o1.


Well they did in the testing I have seen.


Sometimes I wonder if our brain is some heuristic tool or just an infinite amount of cobbled together fast paths.

Like, there are so many adults that when faced with a new task, they really struggle to pick it up. Is this because tangentially related fast paths weren't learned in their "training phase"?


I think a particularly useful "feature" of the brain is it can often identify when it doesn't have a fast path (or the fast path is broken) for something and then revert to trying inefficient and generic approaches, even after the optimal learning period has passed.

This is a start difference to LLMs where it's either learned or not, "just add noise". Models like o1 take things in a very small step in that kind of direction.


Ya, basically LLMs just always "go for it" even if they've got no chance of getting it right. Just needs a feature where it can identify when it's interpolating vs extrapolating.. since it's not so great at the latter :).


A century of neuroscience has said "yes" every time and every way that has been checked.


The OP went a little too far with the specific prompt, but "generally intelligently decide when to use an exact computation (programming) module" was part of the LLM plugin model that was solved last year. Why the regression?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: