When I got fed up with the way the OS handled writes 15 years ago in one of my (ordinary) applications, I wrote to a temp file in the same directory, closed the file, did the atomic filename swap, and then verified the integrity of the new file under new name by opening it again and comparing it byte for byte to the content to be stored. Only after that did the operation succeed. On failure, the file names were swapped again and an error was signaled to the user. The temp file remained on disk to avoid further data loss.
Are you guaranteed that the data you read back didn't come from kernel caches of the file system blocks? How do you know the data actually came back from physical storage?
Yes, but why can't we add an API, all the way down to the lowest levels of the hardware, that says "once I say this write is done, it will be readable again even after power-cycle".
Achieving that guarantee is not impossible, but it needs to be explicit. It can't be inferred from other API calls.
The behavior of fsync is a battle of OS developers and OEMs versus application and database developers. Gaming the implementation of fsync to do less than fully fsync makes benchmarks look amazing, improves user perceived latency, and reduces flash wear. On the other hand, it corrupts data - but that's rare, "the hardware is failing anyway", etc.
That's how you end up with stuff like OS X's "no really, fsync" param [1], or Motorola shipping nobarrier on their phones. [2]
What you call "broken" others call "performance choices". Except SSD controllers, they are big fat phonies.
Look at how much slower CPUs are without those speculative execution tricks. I can buy gigabytes of RAM with a 20 I found in an old jacket. You can explore entire worlds consisting of gigabytes of high res textures and mesh data in near real-time, while downloading 4 new albums off the internet.
Yes, it is breaking the promise of what it is supposed to do. If fsync() was defined as "will ensure your data is on disk, unless that's kinda slow, then who knows", then the behaviour would not be broken, just potentially useless for many applications. But if you promise to ensure something is stored on disk, and then don't, that's the definition of being broken.
This. It's like those counterfeit memory cards you could get on eBay, that had a 128 Mb chip but reported themselves as 8 Gigs or so. "Works perfectly if you stay below 128 Mb!"
On windows FILE_FLAG_WRITE_THROUGH ("Write operations will not go through any intermediate cache, they will go directly to disk").
It's all there.
I agree with other poster, just cos you read back the file and it compared byte-for-byte, unless large it's likely to have come from the OS's RAM file cache.
I suppose that's not fast enough for a DB.