Certainly that’s part of it, but also just NIMBYism. Los Angeles were able to defeat the Owen’s Valley farmers back then, I don’t think they would be now.
Hmm, that's interesting. The code as written only has one branch, the if statement (well, two, the while loop exit clause as well). My mental model of the branch predictor was that for each branch, the CPU maintained some internal state like "probably taken/not taken" or "indeterminate", and it "learned" by executing the branch many times.
But that's clearly not right, because apparently the specific data it's branching off matters too? Like, "test memory location X, and branch at location Y", and it remembers both the specific memory location and which specific branch branches off of it? That's really impressive, I didn't think branch predictors worked like that.
Or does it learn the exact pattern? "After the pattern ...0101101011000 (each 0/1 representing the branch not taken/taken), it's probably 1 next time"?
Your mental model is close. Predictors generally work by having some sort of table of predictions and indexing into that table (usually using some sort of hashing) to obtain the predictions.
The simplest thing to do is use the address of the branch instruction as the index into the table. That way, each branch instruction maps onto a (not necessarily unique) entry in the table. Those entries will usually be a two-bit saturating counter that predicts either taken, not taken, or unknown.
But you can add additional information to the key. For example, a gselect predictor maintains a shift register with the outcome of the last M branches. Then it combines that shift register along with the address of the branch instruction to index into the table: https://people.cs.pitt.edu/~childers/CS2410/slides/lect-bran... (page 9). That means that the same branch instruction will map to multiple entries of the table, depending on the pattern of branches in the shift register. So you can get different predictions for the same branch depending on what else has happened.
That, for example, let’s you predict small-iteration loops. Say you have a loop inside a loop, where the inner loop iterates 4 times. So you’ll have a taken branch (back to the loop header) three times but then a not-taken branch on the fourth. If you track that in the branch history shift register, you might get something like this (with 1s being taken branches):
11101110
If you use this to index into a large enough branch table, the table entries corresponding to the shift register ending in “0111” will have a prediction that the branch will be not taken (i.e. the next outcome will be a 0) while the table entries corresponding to the shift register ending in say “1110” will have a prediction that the next branch will be taken.
So the basic principle of having a big table of branch predictions can be extended in many ways by using various information to index into the table.
Yeah, the "two-bit saturating counter" thing is pretty much exactly how I thought it worked (which would be terrible for the example in the article), but I had no idea about the fact that it also kept track of the branch history thing, and how that maps to different branch predictor entries. Thanks for the link, that's really fascinating!
It seems like that would struggle with detecting how many layers of branching to pay attention to. Imagine the two nested loops surrounded by a randomized one. Wouldn't that implementation keep hitting patterns it hadn't seen before?
Obviously that must be a solved problem; I'd be curious to know what the solution is.
Can you walk through why you think the random outer loop would interfere?
If the inner loop's behaviour is predictable no matter the outer loop, then because the branch predictor is keyed by instruction address, it can be predicted. Only the inner loop's history is considered.
Or maybe I'm misunderstanding what code you're imagining?
It seems likely I've fundamentally misunderstood gselect or some other aspect here.
If you only look at the history of a single address independently then don't shallow nested loops with dependent behavior completely break the entire scheme?
Whereas with global history I think it mostly works. Maybe? But in that case what happens when the inner branch is trivially predictable but shallow and enclosed by ones that aren't?
In the linked article the branch always has the same address. AMD starts gradually degrading at 30k. Doesn't that indicate 16 bits of history? So for a single trivially predictable inner branch wouldn't you expect an initial 14 mispredictions? That seems like a lot.
My (I suspect very flawed) mental model here is global branch history with a 16 bit shift register and a 16 bit table, the latter keyed on hash( register ) ^ hash( address ) to explain the observed behavior.
Looking at the linked slides again my mental model is what it refers to as "gshare". A tournament predictor between that and "local" addresses the two scenarios I posed but still has obvious failure modes.
Local: if( rng() ){ if( true ){ ... }}
Global: if( f ){} if( !f ){}
Trashed state: if( rng() ){} if( f ){} if( !f ){}
But notice that the third happens naturally (ie no need for an RNG) any time the history depth doesn't match up nicely with the looping pattern. Hence my initial question about how real world implementations determine how many layers to pay attention to. You could solve it with a tree structure but do hardware implementers go that far?
It even gets much more sophisticated than that. Even the first Ryzen had something perceptron-based (yes, neuronal network!) and there are several predictors + logic to pick the best one for a branch.
Check out [1]: it has the most thorough description of branch prediction I've ever seen (chapter 3), across a lot of historical and current CPUs. It is mostly empirical, so you do have to take it with a grain of salt sometimes (the author acknowledges this).
Supposedly the branch prediction on modern AMD CPUs is far more sophisticated, based on [2] (a citation pulled from [1]).
> My mental model of the branch predictor was that for each branch, the CPU maintained some internal state like "probably taken/not taken" or "indeterminate", and it "learned" by executing the branch many times.
I always figured the algorithm was much simpler, it would just use the same branch as last execution — should work fairly well.
Didn’t realize it used the input value as well, which to me makes no sense — the whole point is to avoid having to inspect the value. This article raises more questions than answers, I’m intrigued now.
It does not and cannot use the input value. The branch predictor runs like a dozen cycles before execution, it generally cannot possibly see values in registers. It also runs like 3-4 cycles before decode, so it cannot even use any bits from the instruction that is being executed.⁰ Branch predictors strictly use branch history, but this is more complex than just looking at the probability of branching on a single branch, there are things like tables that maintain all branches over the past tens of thousands of cycles, and try to cover common patterns.
0: This is why the first prediction is always "don't branch", because the first time executing code the predictor has literally no information at all. Every now and then people ask for hint bits on branches, but, er, how are you planning to do that when the instruction with the branch hasn't arrived from L1 when the prediction is due?
> I always figured the algorithm was much simpler, it would just use the same branch as last execution — should work fairly well.
Sure, that would work significantly better than no predictor at all. But you'd agree that a better predictor would work better, right? The missing detail might be how expensive mispredicted branches are compared to other costs. If you can go from 50% accuracy to 90% accuracy, it wouldn't be surprising to more than double your performance.
> Didn’t realize it used the input value as well, which to me makes no sense — the whole point is to avoid having to inspect the value.
It doesn't, and can't for the reasons you hint at. The reason branch prediction is necessary is that the value often isn't available yet when the branch is taken. Was there something in the article that implied the opposite?
--
I wonder if Daniel's tricksy approach using a random number generator to simulate a complex pattern is misleading people here.
One of the main benefits of branch prediction is predicting the end of a loop, particularly, a loop within a loop. In assembly, a loop is just a comparison at the end and a branch back to the beginning. Assume you had a loop that always executes 8 times, or some other small fixed value. Also assume there is some reason you can't unroll that loop, and that loop is inside another loop that executes millions of times. It's a real boost to performance if you can consistently predict the end of the inner loop.
If you predicted just on the last time the loop closing branch was taken, you'd always miss the ending. But if you can remember a pattern that is longer than 8, you can always get it right. This is obviously valuable. The bigger question is how much more valuable it is to predict a loop (where "loop" might actually be a complex execution pattern across multiple branches) that is thousands long rather than just 8. But quantifying how long this pattern can be on different processors is part of the groundwork for analyzing this.
No, branch predictors are really important and even small improvements in them are extremely valuable on real loads. Improving branch prediction is both a power and a performance optimization.
There are many branch prediction algorithms out there. They range from fun architecture papers that try to use machine learning to static predictors that don’t even adapt to the prior outcomes at all.
The idea here is about maintaining a "path history"!
When looking up a register that tracks the "local" history of outcomes for a particular branch, you want to have a hash function that captures enough context to distinguish
the different situations where that branch might be encountered.
Apart from folding a long "global history" of recent outcomes and mixing in the current program counter, I think many modern machines also mix in the target addresses of recently-taken branches.
Typical branch predictors can both learns patterns (even very long patterns) and use branch history (the probability of a branch being taken depends on the path taken to reach that branch). They don't normally look at data other than branch addresses (and targets for indirect branches).
But the predictor would need to be almost as bad as pure chance for the odds to tilt away from one box. Unless you think it is materially worse than chance, it's a bad bet.
Conceptually it's quite similar to how C++ templates work (not including C++20 concepts, which is the kind of constraining you're talking about).
I quite like it when writing C++ code. Makes it dead easy to write code like `min` that works for any type in a generic way. It is, however, arguably the main culprit behind C++s terrible compiler-errors, because you'll have standard library functions which have like a stack of fifteen generic calls, and it fails really deeply on some obscure inner thing which has some kind of type requirement, and it's really hard to trace back what you actually did wrong.
In my (quite limited) experience, Zig largely avoids this by having a MUCH simpler type system than C++, and the standard library written by a sane person. Zig seems "best of both worlds" in this regard.
my hypothesis is that chatgpt was trained on the internet, and useful technical answers on the internet were posted by autistic people. who else would spend their time learning and then rushing to answer such things the moment they get their chance to shine? so chatgpt is basically pure distilled autism, which is why it sounds so familiar.
Just as bad if it's human. No information has been shared. The writer has turned idle wondering into prose:
> Once threads actually run concurrently, libraries (which?) that never needed locking (contradiction?) could (will they or won't they?) start hitting race conditions in surprising (go on, surprise me) places.
It was an essentially pointless platitude about the GIL from a very new account not really related to the article, and all comments from this account are the same: top level comments with lots of em-dashes that are just a vague piece of pablum somewhat related to the subject. If it was just this comment, sure, it could be possible it's a rather uninteresting human. But given the history, this account is pure AI slop.
Great interview, really fun to read. I love that they were both so technically involved, the section of the great interview where they just start discussing how a range literal would work and why it's a good/bad idea is just great. A wonderful example of how really talented and smart engineers talk to each other.
It's interesting to see how LLMs make mistakes sometimes: replacing `->` with `-λ` because arrow sort-of has the same meaning as lambdas in lambda calculus. It's like an LLM brain fart replacing something semantically similar but nonsensical in context.
> Setting aside the LLM topic for a second, I think the most impactful way to preserve these 2 goals is to create torrent magnets/hashes for each individual book/file in their collection.
Not sure that's the case. I fear it would quickly lead to the vast majority of those torrents having zero seeders. Even if Anna's Archive is dedicated to seeding them, the point is to preserve it even if Anna's Archive ceases to exist, I think. Seems to me having massive torrents is a safer bet, easier for the data hoarders of the world to make sure those stay alive.
Also: seeding one massive torrent is probably way less resource intensive than seeding a billion tiny ones.
My favorite example of this is Audacity, which stores their entire project file, including all the audio data, as a SQLite database. It's really amazing, you can stream audio data from many input sources into a SQLite database at once, it wont break a sweat, it's flexible and extremely safe from data loss due to crashes or power outages. As well as trivially mutable: many of these kinds of formats the "export" is a heavy-duty operation which serializes everything to JSON or whatever. But in SQLite, it's just a question of updating the database with normal INSERTs, very cheap and simple operations. We've put a similar system into production, and it's incredible how well it works.