More

in-silico · 2026-06-20T02:21:40 1781922100

Additionally, maybe it's easier for a model to realize that it doesn't know the answer when the question is easier.

If Opus gets all but the hardest questions right, it might have a higher hallucination rate because the questions it gets wrong are the questions where verification or hallucination detection are the most difficult

in-silico · 2026-06-16T20:44:06 1781642646

Neither of these strike me as particularly groundbreaking.

The first idea (as I understand it as retrieving token ids rather than hidden states) is going to really struggle to do useful compositional reasoning and contextual recall.

The second idea has been been done a million times, with Linear Attention being maybe the first modern example. Hyena, state-space models, DeltaNet, and LaCT also lie in different regions of the performance-parallelizability spectrum of fixed-size models.

in-silico · 2026-06-16T17:22:43 1781630563

> LLMs cannot do math

This is plainly not true anymore

razorbeamz · 2026-06-17T00:03:57 1781654637

No, they fundamentally cannot do math. They are next token predictors, not calculators.

in-silico · 2026-06-20T19:48:56 1781984936

Why can't a next token predictor do math? Humans aren't calculators either, but we can do math.

If you want proof just look at the benchmarks. Modern frontier models can get basically perfect accuracy on American Invitational Mathematics Examination tests: https://matharena.ai/?comp=aime--aime_2026

If you want an explanation of how they do math, we've found geometric calculators inside their neural networks: https://www.goodfire.ai/research/a-geometric-calculator#

in-silico · 2026-06-04T23:42:58 1780616578

These types of ablation studies are always good. However, I'm not sure how generalizable the language model findings here are.

Their 1.2B model was trained on only 10B tokens, which is less than half of the chinchilla compute optimal number. Modern overtrained 1B LLMs are trained on the order of 10T tokens (1000x more).

This is important because, from my own experience, simplifications and alternatives to standard attention can look fine in the under-trained regime but lag after over-training. This happens because attention has very little out-of-the-gate inductive bias, so it takes a lot of training for the expressiveness to really shine through.

I can't fault the authors since longer training runs cost money, but it warrants pointing out.

I'm also disappointed that they didn't report reasoning benchmark results for the Q=K-V case, since that is by far the most theoretically interesting case (in my eyes).

janalsncm · 2026-06-05T02:40:30 1780627230

It’s a data point. I could imagine in a hardware constrained setting we might not care about training on enormous token counts, and on smaller devices it’s great if we can simplify the architecture.

I agree that this isn’t proof that it scales to trillions of tokens, but this does show a scaled up experiment would be worth a shot.

Philpax · 2026-06-05T03:32:49 1780630369

The Chinchilla scaling laws give you a minimum for the number of tokens you should be using for a given size: if you can't meet what they suggest for that size, you should shrink the size, as, otherwise, the capacity of the model is going to waste.

I do agree that it is a datapoint, but GP's point is that this model was undertrained, so it's hard to draw the same conclusions from it that we would from other research.

ACCount37 · 2026-06-05T02:11:54 1780625514

I wonder if some of those synthetics that specifically burn in attention inductive bias could help there - i.e. by getting attention to converge faster than it normally would?

in-silico · 2026-06-04T23:27:19 1780615639

> The problem is "public schools". The idea itself is wrong, and it can't be made to work.

Do you have an alternative idea in mind?

in-silico · 2026-06-04T01:24:22 1780536262

> It's not changed by the experience

The entire file is not changed, but the KV cache is.

> It doesn't remember anything

The model definitely remembers previous exchanges within the same conversation.

rmunn · 2026-06-04T02:09:04 1780538944

> The model definitely remembers previous exchanges within the same conversation.

No it doesn't. They get added to its context, and it reads them afresh when answering the next question. That's not remembering.

If your short-term memory completely malfunctioned one day, so you had no ability to remember what was said to you a minute ago, then you would have to find workarounds. For example, you could write down everything someone says to you, then read your notes of the previous exchanges in that conversation in order to continue the conversation. That would be a good way to work around the fact that your short-term memory was broken. And if your notes were invisible to other people and you could read them really fast, then you could even make most people believe that you remembered what they said a minute ago. But you don't actually have a working memory, you're just writing down what they said and re-reading it while coming up with your next response.

That's exactly what LLMs do. That's not memory.

ACCount37 · 2026-06-04T17:38:29 1780594709

Continuous learning allows past behavior and past inputs to influence future inputs and future behavior. In humans.

Attention over KV cache allows past behavior and past inputs to influence future inputs and future behavior. In LLMs.

Until the cache runs out, that is. But even then, you could totally use any of 9000 methods of cache compression, truncation, dropping or streaming and get away with it.

The difference between continuous learning and in-context learning seems to be in capacity, not in principle. Both are doing a similar thing, but one has more length and depth to it.

nomel · 2026-06-04T18:42:32 1780598552

Maybe, every night, you send the AI off to "sleep" where it uses those in cache "memories" to influence the long term weights [1].

[1] https://www.pnas.org/doi/10.1073/pnas.2220275120

ACCount37 · 2026-06-04T19:46:21 1780602381

Context self-distillation does exist, but as is, it's used mostly in training rather than as a part of a continuous learning mechanism.

in-silico · 2026-06-04T03:19:53 1780543193

This is really semantics, but I wouldn't call attending to the KV cache re-reading the context.

The model takes in the context, encodes it into a "memory" (the KV cache), and accesses that memory later. That fact doesn't change just because the KV cache grows in size with the context.

I don't know what memory would look like other than an encode-retrieve loop.

Relevant: Transformers are Multi-State RNNs - https://arxiv.org/abs/2401.06104

CommieBobDole · 2026-06-04T02:19:47 1780539587

Right, but that's still external to the LLM, it's just a KV cache that's stored on the provider side for performance reasons, so that the client doesn't have to re-send the whole chat history with every subsequent call in the conversation.

It still generates every response using the model's pristine state with every new API call; whether the context is provided from the client or from a colocated cache server doesn't really change that.

fipar · 2026-06-04T02:41:56 1780540916

Not the model though. The model really only takes input text and produces output text. Memory within a conversation is achieved by the harness adding the conversation (or parts of it) to the input text. The LLM itself has no memory, it’s the augmented system of several orchestrated LLM calls that does.

nomel · 2026-06-04T18:45:04 1780598704

Wait until you hear about the hippocampus!!! [1]

I don't think physical integration within one contained is relevant to system level behavior.

[1] https://en.wikipedia.org/wiki/Neuroanatomy_of_memory

fipar · 2026-06-04T19:47:26 1780602446

I had heard (o rather, read) about the hippocampus before, but I don’t understand how that relates to my claim that the models have no memory.

nomel · 2026-06-04T20:15:54 1780604154

> The LLM itself has no memory, it’s the augmented system of several orchestrated LLM calls that does

Your own long term memory is the orchestration of systems that make it long term.

fipar · 2026-06-04T22:29:05 1780612145

You seem to be arguing as if I'm saying AI can't think or have memory.

Now, my opinion is it currently can't think, but it certainly has memory.

However, LLMs don't have memory. That's what I (and others on this thread) responded to, which is unrelated to how my own memory works.

nprateem · 2026-06-04T02:57:58 1780541878

> The model definitely remembers previous exchanges within the same conversation.

Christ HN isn't what it used to be

in-silico · 2026-06-04T03:34:34 1780544074

Care to elaborate?

in-silico · 2026-06-01T09:17:27 1780305447

> Would be interesting whether it is possible to write a LLM-like program just using compression and function interpolation algoritms.

gzip can be used as a (not very good) LLM-like text and image generator: https://arxiv.org/abs/2309.10668

in-silico · 2026-05-28T17:04:34 1779987874

We know how the models are built and trained, but we have a very limited understanding of how the final products work.

That is to say, we don't know why they give the outputs that they do.

If we did know how they worked, AI interpretability would not be an open and growing field.

in-silico · 2026-05-21T04:36:02 1779338162

> But it is another good example that "AI" is just glorified search and there is not reasoning or thinking going on behind the covers

A bold claim given that the current top post on HN is "An OpenAI model has disproved a central conjecture in discrete geometry": https://news.ycombinator.com/item?id=48212493

in-silico · 2026-05-16T20:38:28 1778963908

While there is a limit to the amount of information you can fit in a fixed-size state, the theoretical ceiling is pretty high.

A Hebbian associative matrix (one of the simplest and weakest memory constructions) can store about 0.7 bits of information per parameter. If you have a state with 300M parameters (the size of a Llama 3 8B KV cache at 10K context length), and a context with 2.1 bits of entropy per token (a reasonable estimate), then the state can encode 100M tokens worth of information.

Real models obviously aren't powerful enough to operate at the limit, but you can see why this is a promising research direction.

RandomBK · 2026-05-17T04:40:36 1778992836

> context with 2.1 bits of entropy per token

Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter, and tokens encode a lot more than that - sometimes full words, with multimodal even more. If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.

in-silico · 2026-05-17T04:57:36 1778993856

> Can you elaborate on this? I'm seen estimates of ~1.5bit per English letter

The reference I always go back to is the GPT-3 paper. The cross-entropy loss (an upper bound for entropy) got down to 1.75 nats (2.5 bits). I took 2.1 because 2.5 is an upper bound and I wanted the estimate to end up as a round number.

> If KV cache embedding are storing more than just simple tokens but entire concepts with context and nuance, that'll bump the entropy up quite quickly.

Here's the thing: the concepts that the model stores in the KV cache are a deterministic function of the input tokens. Similar to the data processing inequality, this implies that no entropy is actually added.

Looking at it mechanically, a sufficiently powerful model only needs to encode the tokens and can recompute concepts later as needed.

usernametaken29 · 2026-05-16T21:33:55 1778967235

While 100 million tokens sounds a lot, think about it for a bit, and you’ll see why it is basically nothing. Try to cram a human lifetime of sounds, smells, video and more sensory data into 100 million tokens. Heck, try to process the video plot of a single series into that window. It just won’t work, it won’t scale, and is laughable compared to contextual memory. I’m not saying that to belittle the authors of the paper but the reality is that this has very little to do with transient long term memory.

ltbarcly3 · 2026-05-16T21:59:15 1778968755

You don't remember a lifetime of smells. You don't have any memories from huge swaths of time. There are entire years of your life compressed down to vibes and a handful of events you largely misremember.

usernametaken29 · 2026-05-16T22:43:29 1778971409

That’s a very weak argument. Memories are not exact replica of experiences. We know that many memories are retained through a lifetime, particularly the ones from early childhood. Unlike computers we always reconstruct memories from several modalities. Even if we remember largely on vibes as you say (which is not true when you look into neuroscience), the sheer amount of information is overwhelming. Again, try to run a 90 minute movie through an LLM memory system. It won’t be able to tell you the plot. That’s before you even feed it sound. Even 100M tokens is not enough for that. You on the other hand will largely remember the movies you liked and their major plot lines and from there be able to reconstruct its scenes. I think the engineers working on memory vastly underestimate the capacity problem of discrete states.

ltbarcly3 · 2026-05-18T17:10:34 1779124234

blah blah we know that blah neuroscience blah blah blah.

This isn't an argument you are making, it's just an assertion that you could make an argument if you are so inclined, but you won't be doing so at this time, but "science" is obviously on your side, but you can't be bothered to say how or even enough detail for someone to check what you are referring to. I can do that to, see my first sentence in this reply.

I don't know how LLM memory systems work. I do know that you don't have a lifetime of remembering everything with high precision. Not only do most people not remember the plot of most of the movies they have seen, they can't reliably list most of the movies they have seen. Not everyone has a good memory. My point is that it's not valid to reference a false model of how human memory works as a reason some specific LLM memory implementation isn't useful for solving some problems.

kami23 · 2026-05-16T22:37:06 1778971026

Exactly, and for a given task you don't need to recall what your friend's brother's name is to do a git commit and push. There's a pull for more context to make these things better, but also the pull to make these execute in such a small context effectively when appropriate.

I'm more on team small tasks because of my love of unix piping, I keep telling folks, as a old Linux dude, seeing subagents work together for the first time felt like I was learning to pipe sed and awk for the first time. I realized how powerful these could be, and we still seem to be going that direction.

in-silico · 2026-05-16T22:47:43 1778971663

I think you underestimate just how much information 100M words-ish of information is. It's like a 300,000 page novel. That's a 50 foot (~15 meter) thick book.

Surely with (much less than) 300K pages you could describe every meaningful detail of a video series' plot. You don't need to remember the exact pixel values.

You can also scale the numbers up. I specifically chose a relatively small model and short context length as a reference, so 100x bigger is not out of question. At that point, with a 10B token capacity, you are looking at all of English Wikipedia in a single state.