Whenever I read about formal specs it always seems to me like “write the same tests just in a different way”, or worse, “write the same implementation but in a different way”.
I guess doing things twice can help catch errors, but I fail to see what’s so special about formal specs if they can suffer from the exact same bugs as the tests/implementation.
I guess the root of the problem is if you want to formally prove that a program does something, you have to be very specific (heh) about what that something is, at which point you are basically just writing tests/implementation all over again.
I have been looking into this on and off for years now, and I keep feeling like I’m missing something, but I don’t know what it is.
There's a famous quote from Dijkstra: "Program testing can be used to show the presence of bugs, but never to show their absence." The flaw of testing is that it can only test the behaviors that you think might be problematic. To actually reach into the category of proactively fixing behaviors that you didn't know could go wrong, you have to reach for more exotic tools in the toolkit. Fuzz testing is a start down this path; formal verification is a different start down this path.
Obviously, the quality of these sorts of tools is dependent on your ability to write a formal model that is comprehensive in allowing behavior you want to be acceptable and disallowing behavior you want to be unacceptable, and we're still surprisingly far from that in many fields.
> Whenever I read about formal specs it always seems to me like “write the same tests just in a different way”, or worse, “write the same implementation but in a different way”. [...] Can anyone enlighten me?
A big difference is that formal methods allow you to use the "for all" quantifier.
For example, you might write a unit test that says "foo('abc') returns a string with no trailing whitespace".
But with formal methods, you can prove that "for any input x, foo(x) returns a string with no trailing whitespace".
This is a trivial example, but you could imagine something more complicated, such as "for any program P, compile(P) has the same behavior as P".
Of course, you have to define what "has the same behavior as" means!
>Of course, you have to define what "has the same behavior as" means
And that's really my issue, for example when you define "has no trailing whitespace", you are basically writing a piece of the implementation. Cover all behaviors, and you have basically re-implemented the function, no?
In other words, if I have the full formal spec of f(), isn't that the same thing as having f()?
> In other words, if I have the full formal spec of f(), isn't that the same thing as having f()?
In some cases, however quite often, the spec is much simpler. For instance, it's easy to say that after running sort on some list, that the result is sorted. However, it is very hard to come up with the algorithm to do that from the specification. Sometimes that is even a point. Bubble sort, quick sort, tim sort, we can go on and on. There's a huge number of different sorts that computer science have discovered over the years. They all should have the same result and so you should be able to prove they do the same thing. However, in the real world there are often reasons you would prefer one to another despite all meeting the same spec.
Another obvious example are cryptographic hash functions: if you have a function f(s) = h, you can trivially specify a function inverse_f(h) = s st f(s) = h, and if you can infer a non-brute-force algorithm for that, you’ve just inferred a cryptographical weakness!
I think the key is that, while you may think you have a full formal spec of f(), you actually do not. You have a program written in some language, and that language has its own spec, and the language is compiled to asm which has its own spec, and the asm executes on an architecture that has its own spec, and so on.
So when you write a function like:
func hypot(x, y):
return sqrt(x*x + y*y)
You might think you have "fully specified" hypot, but this is far from true! You have said nothing about what registers will be used, for example. This is not a problem; quite the opposite. It's the whole point of using high-level languages: they let you focus on what you care about. A spec is just a program in a very-high-level language.
Well, ideally you proof should've failed because your axioms should be aware that arithmetic operators in the programming language of your choice are operating modulo some large power of two, and thus don't produce correct results for some particular inputs.
Admittedly, this one peeves a lot. Remember when Java's binary search and mergesort (and implementations in many other languages' standard libraries) turned out to have a bug of this kind [0]? Admittedly, the proof was informal, but if you are trying to prove some properties of a program X written in a certain programming language Y, you can't "just assume" that the semantics of Y are different from what they are, right? The integers do overflow in Java, that's explicitly stated in the Java's spec... and that means that a lot of even the most simplistic code has some very non-obvious correctness preconditions, which most of the times, I believe, are simply hoped to be true.
I am not sure that the perspective you have taken is the same as what I understood from the parent post; what I took from it is that things like registers, memory locations, ways to implement square root, and so on, are all _implementation choices_ that are not important properties of the specification. You specify that the hypotenuse is the square root of the sum of squares, but whether square root is implemented using Newtonian approximation or a fast inverse square root is irrelevant; whether your first argument is passed in a register or on the stack is irrelevant.
Often, things like resource usage are not specified: running time, memory consumption, etc, aren't relevant enough to appear in a behavioural specification.
If your spec says "f(a, b) returns a + b", but it's just a high-level document you can use to help guide your implementation, integer overflow is just one of many ways your implementation might be inconsistent with the specification. It's still likely that the existence of a formal specification you reference during implementation means that more edge cases have been considered ahead of time than if you just had an informal spec.
If, on the other hand, you prove it but it turns out not to be true (ie, you overflow integers), your proof is wrong. If a machine verified your proof and gave you a big thumbs up, your machine verification is wrong.
If, in Idris, I write "f : (a : Nat) -> (b : Nat) -> (c : Nat * c = a + b)", then I cannot compile an implementation for which I can't show a proof that the result is _always_ the addition of a and b, for all a and b, unbounded by anything but the resources at hand with which to run the program. An implementation subject to integer overflow won't compile.
Or, I could write "f : (a : Bits32) -> (b : Bits32) -> (c : Bits32 * c = a + b)" and implement something where , but then modulo arithmetic on overflow is _part of the specification_, because "+" in there is doing the heavy lifting of being defined as addition modulo 2*32 already; by specification, 4 billion plus 4 billion is ~3.7 billion.
Everything the parent comment mentioned were implementation details that did not affect the correctness of the code.
I just wanted to point out that there are implementation details that DO affect the correctness of the code.
And, of course programs need to run on multiple architectures. So it's hard to do what people seemed to be talking about in this thread and verify code just from the source code.
If you have the luxury of proving the correctness of the CPU, compiler and OS, that should be a big win. Otherwise, it seems to just be another type of testing. Still useful, but calling it verified or proven seems a bit much.
From my perspective, it seems more to be writing another, more complicated program, with more opportunities for bugs, and seeing if the results agree
Non trailing whitespace means the string doesn't end with a space. But foo is a function that converts an AST to a string, that's totally different. Or it's a function that loops until \0 and changes all spaces to +
The spec should be a summary of what the impl is supposed to do. You'd want more than just doesn't end with whitespace of course.
> there are 17 space characters in Unicode and another eight whitespace characters, like newline.
And of course, those 25 characters don't include ZERO WIDTH {SPACE,NON-JOINER,JOINER,NON-BREAKING SPACE} and WORD JOINER, which gives you yet another 5 arguably "it's kinda space, right" codepoints which definitely should not be trailing in any reasonable text string.
Property testing is stochastic, which may be fine, but only gives you a statistically (hopefully) high chance of discovering a problem. If you use something like SPARK/Ada, you can actually embed a proof in the code so that you actually know that the code is correct (for what you've proven). PBT scales better than embedding proofs, though, and is highly effective in practice along with fuzzing.
Unless you literally try every possible combination of inputs (which is usually infeasible), property testing can't give you mathematical guarantees about correctness. You can think of it as a halfway house between classic testing and formal verification:
Classic testing: A human comes up with some concrete example inputs for which they know the "right answers" (corresponding outputs). They write code that runs the code under test, gets its actual outputs, and compares them to the desired outputs.
Property testing: A human comes up with a precise way of randomly generating concrete example (input, desired output) pairs. They write some code to describe how to generate the pairs, often using a declarative DSL that describes only constraints on the inputs and outputs, with the understanding that anything not expressly forbidden is permitted, like "The input can be any list of between 0 and 100 integers each between -500 and 500" and "Every integer in the input must appear the same number of times in the output". They then write some more code (often a single line) to ask the computer to use this "spec" to randomly generate, say, 1000 such pairs, or as many pairs as can be checked in 1s. The computer generates the pairs itself, runs the code under test on each input and and checks its output matches the desired output.
Formal verification: A human comes up with a spec that typically describes conditions that must hold for all (input, output) pairs. This may look very similar to, or even exactly like the DSL used for property testing, though in general there are other conditions that can be expressed that cannot be checked with property testing even in principle -- for example, checking that the program always eventually terminates. The main difference is that the code under test is never actually run; instead, the computer analyses the source code itself to attempt mathematically prove that the stated conditions hold. How to actually accomplish this is a field of active research, but one basic approach is called "symbolic execution". To greatly simplify, if we forget about loops and conditionals for a moment, the idea is that we can write down things we know must be true after each statement executes, based on the things we knew must be true before it executed. So for example if x is a variable initially containing any integer (and we ignore overflow) then after the line
x = x * x
runs, we know that x >= 0. To handle conditionals like
if x > 50:
x = 42
something_afterwards(x)
the prover "forks" into two cases: One in which we know for certain that x > 50, one in which we know for certain that x <= 50. At the end of the if statement it then has the task of recombining what is known about the two cases. In this example, the first case lets us conclude that x = 42 by the end, while the second case lets us conclude that x <= 50 by the end, so it could conclude that x <= 50 either way by the time execution reaches something_afterwards(x). Handling loops is trickier but generally involves looking for invariants.
I don’t write software, but in asic and fpga design, formal method specifications enable the use of things like SAT solvers to determine if the underlying test article meets the specification.
You specify properties of the design and the tooling tests the design in a variety of ways to see if it can violate those properties.
Let’s say you have a system that controls turn signals. You can specify properties that say things like lanes that cross each other can’t both have a green light at the same time or even within some period of time of each other.
The tooling can exhaustively check the design for this and surface code traces that violate it.
It's more than just "write the same tests just in a different way", because each test is largely independent, or grows to unmanageably large, and you have to do the work of figuring out the completeness of your test suite by e.g. branch coverage or other relatively crude overlap methods.
Statically proven type systems let you do this in a way that each contributing piece is checked in advance against all the other pieces, guaranteeing not just "this test passes" but, in combination, "all these tests create a reasonable, coherent whole", and it disallows incoherent models from being implemented in the code. An example of this is like Rust's match, which requires complete coverage of all the possible branches, but writ large across an entire codebase.
You're right that it does nothing for you if you make a fundamental logic or theory error, it can't catch that kind of error, but you'd be surprised how much more robust it is than "merely" e.g. Haskell's type system + ad hoc functional testing + property testing - which I would consider a quite strong system overall, but formally proven e.g. cryptography is a much higher bar.
The powerful difference is between specific testing and exhaustive proving. If you fail to think of a test for an edge case, whoops. If you have a good model, you can know that it exists—and fix it—before you ship.
This is particularly valuable IMO in concurrent/distributed/multi-threaded situations where race-conditions and deadlocks are extremely hard to test and reason about. "Can A, B, C happen in the order 'A, C, B'" types of things.
I have a rough hypothesis that maturity in this space looks something like:
1) LLMs will allow much faster learning and usage of formal methods, even if initially just for "do it a second time" post-hoc verification
2) From there you move towards automation of LLM-checking "hey, the implementation code changed, does it look like it broke the model" - though this is still not deterministic, but it's likely a lot better than a human review requirement of something that only changed infrequently would've been
3) And then the real jump will be taking tooling for "write just the formal spec, let the implementation get generated" to the next level. There's various mostly-not-fully-baked-for-what-most-companies-would-want projects already out there for codegen like this, what if the LLM tools can accelerate the year-or-two of manual work it would take to make one of them fit your needs?
Lamport phrases this as "Thinking doesn't guarantee we will be right; not thinking guarantees we will be wrong" - specifications are for reasoning about systems in the deepest sense
And you'll rapidly return to proofs when your "function input" is something like a sequence of, say, ieee floating point numbers coming over the wire of possibly unbounded length. State machines with proofs that all the cases are handled are great.
The same with unit tests, they're best suited for problems where the specification is much simpler than the implementation. A sorting algorithm can be long and complicated, but a unit test just needs to present unordered input and assert that the output matches ordered output. A sorting algorithm is also well suited for formal verification.
If you're rewriting the implementation, I think it's probably not a good use for unit tests or formal verification.
As another commenter said, unlike unit tests which cover a specific case, and where you need multiple tests for edge cases and both positive and negative results, formal verification will cover all cases.
For trivial functions they can be. Function: isPrime. Impl: loop to sqrt(n) and check divisibility. Spec: returns true if nothing divides. Impl: return true at the end if we didn't return false.
It's not exactly the same because your spec says nothing divides, not just up to sqrt(n). It's definitely not a test if you do it right.
That can definitely be a problem, and I have definitely written formal specs (for hardware in SVA) where I'm like "I'm just implementing this again". And actually that's a totally valid formal verification task to do - formal equivalence checking of an implementation against a model (which is just another implementation). I can feel (and sometimes is) tautological.
However you usually try to not do that. If possible (and it usually is possible) your formal spec will be either:
1. Another implementation, but a much simpler one. For example you can formally verify the equivalence of a pipelined dual issue CPU, with a combinatorial single cycle model.
2. More general properties that much be true about the implemention, rather than exactly what the implementation must be. The classic example is compression: decompress(compress(x)) == x.
I gave it a task of scanning 6 markdown files and finding issues in the prose (contradictions etc). It ran for over 2h, exhausted my max plan session limit and crashed. I did not get any issues back.
About 2 years ago, my then girlfriend, who is a medical doctor (read: likely not an idiot), had ChatGPT plan a trip for us in a foreign country (naturally I had to nip it in the bud). And this was 2 years ago, when accuracy was way worse.
You would be surprised what normies are willing to delegate to AI.
It always seemed very natural to me that AI will move “down the stack”, where Open AI and Anthropic don’t really have a foot in the door.
Who makes consumer devices? Google
Who makes operating systems? Google
Who makes browsers? Google
Who makes the world’s most popular websites? Google
By the time 90% of average internet users get to chatgpt.com or whatever, they already went through several Google chokepoints, each layer is one more place Google can answer their questions.
And that’s not even getting into the chips, the data centers, the data, the talent, the consumer apps, the enterprise apps, the cloud platform, the brand, and of course the biggest cash printing machine in human history.
You would honestly have to be insane to bet against G.
Part of the pitch of AI companies is that they mediate and provide a new surface for ads, for taking an affiliate cut of sales, etc.
But it isn't like this hasn't been the long-running strategy for Google as well - provide more results on search so that people don't go to the site with ads, provide paid product results for shopping, to offer more services to keep people providing personal/behavioral queues to Google and more opportunities for ad placement.
If anything, AI turned up the heat such that the frog noticed what temperature the pot was. But that doesn't really put them in a better position to execute than Google.
There was a point in time when this would have been untrue of Google. They didn't manifest into this market position, they rose there over the years after starting with a relatively modest product offering. Nothing saying someone else can't do the same.
> You would honestly have to be insane to bet against G.
Unless you are confident that Google will be antitrust forced to weakening their hold in some of those markets. A crack is all you need for the water to rush in and widen the crack.
I think most of Google's deep research projects were done in the pursuit of pure science, not monetization or productization. In hindsight sure, it looks like they missed an opportunity. But not everything needs to be about money.
Unfortunately, companies often measure developers by their own PRs.
An unfortunate outcome of this is that writing docs for other developers isn’t really incentivized properly (and with stack ranking, you might even say it’s disincentivized). Writing docs for Claude, however…
most issues in my company are just automatically solved now, we are at a point where the issue is just writing the issue (that's what requires mental effort).
We just perceive it as such, and this should be fairly hard to argue against with all the scientific advances we have made up to this point, at least as long as you assume consciousness involves biology and physics at least somewhat.
Otherwise, you can't explain e.g. smooth perceptions of low FPS stimuli, delayed reaction times, and must ignore obvious limits on various biological and neurological rhythms, or other possible limits on continuity (e.g. quantum stuff) and rates generally.
At least in the tulip case, they actually had some minor value. You owned a pretty flower. You could also make the case for crypto money, I guess.
Any person with common sense and basic technical understanding could tell you NFTs were an incredibly dumb and useless idea from the very start. All you “own” is an entry on some ledger, which doesn’t inherently give you ownership over anything else.
Exact same argument for crypto though. It is all just supply demand. BTC has much more demand currently and likely more sustainably. Alt coins are just less popular. It is all just supply vs demand.
Not really though, in crypto the thing you own is the ledger entry, the record that says you hold N BTC. You own it because you hold the keys, and only the keys can change it. The token isn't a pointer to some asset sitting elsewhere, the on-chain entry is the asset.
NFTs use the same machinery but the premise is that you own something else, e.g. an image (or real estate!) but nothing on-chain actually grants that ownership. To the extent real ownership exists at all, it lives entirely off-chain, e.g. in a legal contract (that would hold with or without the blockchain).
I am not a fan of crypto either way but NFTs are just ridiculous.
I'm not really convinced that people thought there was "anything else", it's just that people thought that the entry on the ledger was going to increase in value, even from some of the stupifying initial values.
I own several NFTs that are important to me, and they're worth every penny I paid. I never had any illusions that I owned anything other than a historical footnote; I think that this sort of ownership is meaningful and important.
It's much more realistic to me than "buying a song" from one of the corporate music distributors. "Owning" a song seems to be much more of a misunderstanding of how data works in a digital world than owning an entry in a ledger.
>I own several NFTs that are important to me, and they're worth every penny I paid.
The problem with the NFTs is that you don't actually own the art they represent and have zero copyright claim to them. In the absolute very best of cases, if you squint hard enough, you could see them as roughly equivalent to the signature of the original creator of the work of art and you're effectively buying a signed digital print of the work. In the worst and more common cases, you're buying nothing at all except a hash on a blockchain.
> The problem with the NFTs is that you don't actually own the art they represent and have zero copyright claim to them.
That's not a problem, because art is not ownable and copyright is a huge game of make-believe between states and corporations whose opinion is meaningless to me and to the artists I want to support.
> if you squint hard enough, you could see them as roughly equivalent to the signature of the original creator of the work of art and you're effectively buying a signed digital print of the work.
It doesn't take any squinting though. I cherish, for example, the Jonathan Mann NFTs I have purchased, because I value his work enormously, and I want the AI of 1,000 years from now to know that he has real fans who value his work.
I presume this is the same reason that my fans purchase my NFTs.
Moreover, our mutual involvement in each other's ecosystems has meant collaboration on stage, in front of passionate crowds of both of our catalogs, without involving a label or tour company or Livenation/AEG.
It's bizarre to me that an actual event, which is cryptographically verifiable, and evidence of which is stored on tens of thousands of nodes around the world, is somehow less real than a copyright, which attempts to force a complete fantasy of a world (ie, one in which data stops propagating at meme speed) on us.
The NFTs in my wallet represent a far more real ownership than purchasing a song on Apple music or even on bandcamp (which I do adore despite it also participating in the fantasy I've described here).
When you say NFTs in your wallet, what do you mean? Links that click through to images are real but their endpoint is mutable and philosophically has the same artistic value as temporary graffiti, not as a store of value like oil paintings.
How did you think about the links themselves vs the destination? That is the rub I feel like. Of course the destination is a real site, hosted somewhere, but the journey there is more ephemeral than copyright.
Yeah, now I agree with everything you've said here. In fact, I think that the entire notion of "I own what's at the other side of this tokenUri field" is just totally unserious.
I think NFTs are best understood as having minimal utility, and a connection to a work of art specified only as a social side channel. To me, what I own is evidence of support, at a particular time (or, if I sell it, a particular sequence), of a particular other wallet (Jonathan), amidst particular metadata written to the blockchain (ie, the id of the releases of his that I've bid on or supported).
In 1,000 years, the AI will know that my relationship with Jonathan Mann was backed up by actual economic activity. I think that's meaningful.
I honor the ticket stubs, set stones, and chartifacts that people see fit to buy from me in their desire not only to support me, but to signal the importance of bluegrass and traditional music as an eternal tradition of an copyright-unencumbered corpus.
Many of my shows are free to enter, yet people will still buy a ticket stub because they want to record their support in a public place. That seems real to me in a way that copyright isn't.
Using your reasoning a large number of collectible items should be worthless. What really makes an NFT different from a Pokemon card, a Birkin bag, or even an original Monet? My guess is that the seller has to have some sort of authority and established reputation for these kinds of artificially scarce luxury goods to maintain value.
Clearly not, the point being made was that you owned a thing, e.g. a Pokemon card. To own an NFT is to, bafflingly, claim to hold a token of ownership of some asset represented by the NFT - where that representation is indicated by the NFT immutably containing, typically, a thoroughly mutable Google Drive link to a picture. The whole thing was always farcical.
Again, at least you actually own the Pokemon card at the end of the day.
The value of pokemon cards or birkin bags is not because they are physically owned. This should be obvious from the fact that I could cheaply reproduce them and my identical reproductions would have 0 value compared to the original. I still own them though so again, according to your reasoning they should have the same value.
Some pokemon cards are worth so much i could reproduce them with gold instead of cardboard and it would be worth less than the cardboard version (assuming the same weight)
Things can have value beyond their physical substance. A Pokemon card isn't just paper and ink. I'm not arguing about whether the asset has value, I'm arguing about whether you actually own it.
Obviously, you can sell one for a lot of money. Now assuming you couldnt resell it, would you spend the majority of your wealth to buy a monet? (Assuming you arent broke)
Assuming I could not resell the Monet, which sounds strange, I would still prefer it over the Bored Ape Yacht Club NFT, which is more likely to be hard or impossible to sell, and which is pure crap.
I guess doing things twice can help catch errors, but I fail to see what’s so special about formal specs if they can suffer from the exact same bugs as the tests/implementation.
I guess the root of the problem is if you want to formally prove that a program does something, you have to be very specific (heh) about what that something is, at which point you are basically just writing tests/implementation all over again.
I have been looking into this on and off for years now, and I keep feeling like I’m missing something, but I don’t know what it is.
Can anyone enlighten me?
reply