> Whenever a server failed and restarted, it would pick up processing from the last successful commit. If a server did not restart, the worker count would be updated and remaining workers would redistribute the partitions among themselves based on the shard keys of the records.
You don't describe how you come to a global view of which is the "last successful commit". Do you mean that the co-ordinator recovers (such as from its own log) and uses that information, or do you mean you use local information? In the latter case, what you describe doesn't sound like it preserves transaction atomicity across node crashes.
In RethinkDB, a table can be split into shards and each shard has at most one master and multiple replicas. With one master to perform all writes for a shard, there is no need for a co-ordinator. If the master of a shard goes down, a replica can be promoted as the master. For writes, RethinkDB prioritizes consistency over availability so a server failure may carry some downtime in terms of writes but reads can be configured to be highly available (though they could potentially be out of date).
But there is definitely a downside that if a RethinkDB master server fails immediately after a write (before it propagated to a replica) and does not recover, then there can be some data loss (but only the most recent writes on that shard).
So if I'm reading this right, the master process can lose writes (if there's permanent failure). So taking your 2PC scheme above with the 'settled' flag, you can respond to the client that the txn has been committed once you've written the flag, but this commit marker can also just be entirely lost? (again, permanent failure)
If this 'settled' flag exists on each 'shard' instead of just the coordinator, any random subset of those too can be lost? I don't understand what's going on here, or what guarantees this 2PC implementation provides.
Actually, I may be mistaken about my previous commment. I'm not completely sure if this loss of recent data would happen as I've described. It depends on client implementation. For example, a client could wait for a write to propagate to at least 1 replica before telling the caller that the data was inserted successfully. This is an implementation detail I'm not sure about.
Also the settled flag exists on each record, not each shard. A shard is typically made up of multiple unsettled records. Each worker is assigned to a shard using a hash function so it's deterministic and the worker only processes unsettled transactions from their own shard.
Also I said something else misleading in one of my previous comments. In my case, the shard key of each record (which determines which shard a record belongs to) was not based on its own record ID but on the account ID of the user who owns that record. So effectively the sharding was happening based on user accounts and it was designed so that the records created by an account could be processed independently of records created by a different account.
You don't describe how you come to a global view of which is the "last successful commit". Do you mean that the co-ordinator recovers (such as from its own log) and uses that information, or do you mean you use local information? In the latter case, what you describe doesn't sound like it preserves transaction atomicity across node crashes.