As someone currently fighting a battle with an LoB application written without foreign-key constraints with hundreds of thousands of rows of corrupted data because of bugs in sprocs that assigned the wrong value to the wrong foreign key column because they were similarly named - THIS!
The reply in the GitHub thread we’re talking about makes it clear that they still perform FK validation - it’s just performed in the application code rather than the DBMS.
I note there is another alternative: deferred constraints - or just run a query to check for invalid rows at 3am every morning.
I’m pretty confident that GitHub doesn’t use foreign keys because it was built as a Rails app. And the “Rails Way” is to create these constraints in the model. Foreign key constraints weren’t a first-class member in Rails until v4 (if memory serves correctly).
I was once a full-time Rails dev and really loved the framework (I don’t write as many user facing applications these days). Most of the Omakase trade offs didn’t bother me. But I never understood the disdain for foreign keys. For 99.99% of web apps, you want them (dare I say, need them).
Even in Rails 3 I would add them by hand in the migration files. Very few applications will ever actually care about sharding. We were pulling in millions at the last Rails company I worked for and were just fine with a master and a single replica.
If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition. Your goal should be to outgrow what MySQL (or Postgres) can do for you in master/replica mode. If you do, you’ll likely be independently wealthy...
If data integrity matters at all, model-based checks (or periodic queries for orphaned data) will not suffice. Just put a foreign key in the table where it belongs and let your DB do what it does best. ACID is an amazing thing...
> I’m pretty confident that GitHub doesn’t use foreign keys because it was built as a Rails app
Maybe originally, but lack of foreign key usage is certainly not Rails specific today. Large MySQL shops generally don't use foreign keys, full stop, for the exact reasons Shlomi described in the original comment.
Facebook does not use foreign keys either. In my experience, same thing is true at all the other large MySQL-based companies. And those companies make up a majority of the largest sites on the net btw -- including most of the consumer-facing social networking and user-generated content products/apps out there.
This does not mean that, for example, Facebook is full of data that would violate constraints. There are other asynchronous processes that detect and handle such problems.
> If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition.
Do you mean, stay with your current RDBMS of choice and shard while also removing the FKs? Or do you mean transition to some other system? (and if so, what?)
The former path is bad: sharding alone is very painful even without introducing a ton of new application-level constraint logic at the same time.
The latter path is also bad: only very recent NewSQL DBs have sharding built-in, and you probably don't want to bet your existing rocket ship startup on your ability to transition to one under pressure. Especially given much higher latency profiles of those DBs, meaning a very thorough caching system is also required, and now you have all the fun of figuring out multi-region read-after-write cache consistency while also transitioning to an entirely new DB :)
“[S]harding alone is very painful even without introducing a ton of new application-level constraint logic at the same time.”
I typically espouse app logic to check state and use foreign keys to ensure enforcement (eg, race conditions that are very hard to ensure at the app level but are built into many RDBMS). Foreign key failures are just treated like any other failure mode.
But honestly, I haven’t been part of a company that have really hit those upper limits that require sharding. They exist, yes. But most companies will never need to worry about it. Which is my point.
I agree that most companies won't ever need to shard, and pre-sharding (or worrying about extreme scalability challenges in general) is usually unwise premature optimization. But it does depend on the product category.
Social networking and user-generated content products (including GitHub) need to be built with scale somewhat in mind: if the product reaches hockey-stick growth, a lot of scalability work will need to be completed very quickly or the viral opportunity is lost and the company will fail. I'm not saying they should pre-shard, but it does make sense to skip FKs with future sharding in mind.
This was nearly a decade ago, but in one case I helped shard my employer's DBs (and roll out the related application-level changes) with literally only a few hours to spare before the main unsharded DB's drives filled up. If we also had to deal with removing FKs at the same time, we definitely wouldn't have made it in time and the product literally would have gone read-only for days or weeks, probably killing the company. Granted, these situations are incredibly rare, but they really do happen!
It’s at its core a case by case decision, I think FK are also a net negative in data ingestion scenari where the data set is big enough.
Trying to make sure everything is where it needs to be at any given time, everything is inserted in the right order and the data is always consistent brings exponential amount of conplexity when it could all be checked at the end and pruned for invalid data. And usually DB integrity will not be enough, you’ll want business level validation that all is OK, so there will be app level checks anyway.
This looks nice, it’s still limit to a single commit though. That’s where it’s a PITA for anything that won’t (or we don’t want to) fit a single commit.
In particular splitting commits allows to ingest data in parralel (for instance if we import stores and store owners, both could be ingested separately without caring at first if each references a valid entity)
At least with mysql, and probably with postgres as well, you can temporarily turn off foreign key checks for a set of statements. So you can still get the benefits of foreign key constraints by default but when they do more harm than good you can turn them off. With the added benefit that turning off FK constraints screams "I am doing something unusual and dangerous - this requires extra caution."
App level data integrity is usually critical. Ingestion into a reporting database is an entirely different construct (where I agree FK constraints are burdensome).
It's been a while, but IIRC at the time Rails got started MySQL actually did not even support foreign key constraints. Since that was the DBMS of choice, it wasn't much of choice.
InnoDB's foreign key support predates the existence of Rails by several years. However, InnoDB wasn't the default storage engine for MySQL at the time, so that may be a factor.
I wonder how different the state of database application development would be today if all those cheap whitelabel webhosts powered by cPanel or Plesk (where most of us got started, I imagine) opted for PostgreSQL instead of MySQL - which would have influenced the major MySQL adopters like phpNuke, phpBB, WordPress, etc.
The application code isn’t enforcing FK constraints because that is impossible for the application to do. Their database is almost 100% guaranteed to be corrupt as a result.
Application code has bugs. Application code can fail in ways that result in corruption. A primary job of the database is to keep itself from getting corrupted. Enforcing foreign key violations is only something the database can do correctly. Punting the responsibility to a higher layer will result in corruption.
> A primary job of the database is to keep itself from getting corrupted
That would be where the road splits. If you handle the database as a very fast and structured storage application, a lot of these assumptions go away, with then a different set of tradeoffs.
Some data corruption could be fine if you can guarantee the critical cases, just as bugs in the code are fine as long as the useful cases are covered.
I remember a database with scheduling entries in it that could get duplicated depending on the sharding, but it didn’t matter because the app handled the case gracefully.
I think understanding the tradeoffs that match the best the use case is the most important point, always assuming that a DB has to guarantee integrity can be a burden preventing from looking at all the options.
There are 100% foreign key violations in their database. That is not the same as their database being corrupt.
They have engineered for, and understand the implications of, foreign key violations. Typically, it's as simple as "This row can be deleted", and that can cascade - at a totally different rate than you'd find in a database and with totally different performance characteristics.
Foreign key violations are data corruption!!! It violates the rules of how the data relates and can and will screw up any number of things that depend on the rules being enforced.
Reporting and bi data might get hosed.
Account management might get hosed.
Who knows what happens when FK rules are violated because by definition they should never be violated. It puts all applications on top into a undefined state, leading to bugs and god knows what else.
You're missing an important point, it violates a certain set of rules of how the data relates. That certain set of rules typically makes it easier to write most general data applications.
If your rules define a foreign key as optional, you can easily engineer an application around it.
Great examples include NoSQL and Ruby duck-typing. In both cases, you infer actions based on the data structure, rather than making explicit assumptions.
The whole fact that you made a foreign key implies that that is now a rule of your database. If anything is in that column that does not adhere to the foreign key, that means your database is corrupted.
Then you have an implicit rule that is enforced by a hope and a prayer that some junior dev never makes a mistake, your senior engineers are clairvoyant and understand every single aspect of your systems 100% with zero off-days, your code review process catches every single possible edge case (especially the edge cases that you never knew existed), your QA process is 100% and never makes mistakes, your servers never crash in ways that leave things in an inconsistent state, etc.
Or you could, you know, simply add a foreign key constraint and never, ever, ever have corrupt table relations.
Why people fight their tools is beyond me. There is almost zero reason to defend not using foreign keys.
i suggest being open to the concept that other perfectly capable humans may, in fact, design systems with different underpinning assumptions than those you appear to presuppose always must apply to everyone and everything.
It's important to remove the DBA or Developer hat and realize that you are working together on a singular system (or maybe it's even the same person, etc.)
"corruption" implies that the desired results have not been achieved, that essential data has been lost or compromised, and this clearly is not the case with such a deliberate design decision. one should be prepared to accept this possibility in order to not be merely a fanatic and unreasonable.
These are self-imposed rules, so breaking them is wholly up to yourself. And there are trade-offs involved that may often make it acceptable relax them.
You seem to be rather invested in one set of arbitrary definitions. If people do fine even in situations where they "by definition" shouldn't, it's the definitions that have been found lacking, not the people.
You can keep saying that, but it does not make it true. Properly modeled data does not need those constraints. Well written software handles these correctly. Mediocrely written software fails and complains loudly. Badly written software might get hosed. You're at just as much risk (or more, in my opinion) of that with bad schema changes as with not having foreign key constraints.
Do you not check input from your javascript front-end before you save it? Even if that front-end does its own validation? (No, you see, but there is only one web front-end.... we don't need the backend to validate the input!!!!)
In what way is letting the database ensure it isn't getting fed crap any different?
Why do developers constantly think it is okay to let unvalidated user input hit their database? Any client calling your database is a hostile client that will feed your database bad data. Arguing otherwise is complete ignorance.
> In what way is letting the database ensure it isn't getting fed crap any different?
Because one canonical validation layer is generally enough. You can see how having two separate partial validation layers could cause problems, right? And you can't put all the validation in the database, for non-trivial apps.
And "clients" shouldn't be talking to the database, no matter if you have foreign keys or not. That's a totally separate issue.
LOL, if I read the data directly from the db instead of via the application's API then sure, I lose the application's guarantees. But, y'know, same might be said if I just go and read the DBs files from a sidecar shell script or something.
> It puts all applications on top into a undefined state
If you want to operate assuming your data is always corrupt because your engineers don't understand how to use the tools provided by their database.... Seems like an awful lot of work to re-invent a wheel that your DB server can solve for you. I guess that is on you though....
1. Over my career almost none of it has been debugging problems due to a lack of foreign keys.
Further to that, instead of having lazy cascading deletes moving the goal posts on you. Data missing its parent for example is an indication something is wrong. It's a useful diagnostic.
2. Most string data is safe to represent as empty. Handling null on the other hand is a different situation and often induces warnings or other side effects depending on the language. Forcing the null checking to your boundaries is much like forcing your state/mutation to the boundaries in FP. Hell eveb though I think it's insane, hexagonal arch works on this idea as well.
> or just run a query to check for invalid rows at 3am every morning
One of the nice things you get from database FK constraints is when something messes up the data, you get an exception thrown in the code that's messing it up, complete with a backtrace and whatever inputs and timestamps you care to log.
The reply in the GitHub thread we’re talking about makes it clear that they still perform FK validation - it’s just performed in the application code rather than the DBMS.
I note there is another alternative: deferred constraints - or just run a query to check for invalid rows at 3am every morning.