There's a lot of real-world input situations in which I want NULL to indicate that a record field hasn't yet been collected, versus a record in which the answer to that field was "" – i.e. left blank by the respondent/data source. Sure I could have the front-facing data app auto-convert blank values to "N/A" or "(blank)", but that's unnecessary complexity.
I've been really enjoying using Rust because it has no NULL. You have to instead use the semantics of Some(x) / None()and handle both cases. I'm not sure how you work around the fact that an empty string is Null.
The problem with Null is that it can be either a value or show the lack of presence, and in JS because both undefined and null are values you can't depend on them being used for their semantics either (you can try to enforce their use but it that's no guarantee).
SQL suffers this exact same problem. I wonder what SQL would look like without Null.
Select id, Option (key)
From table
Insert... Some(5), None::Int
From this perspective it still seems that empty string should not be None, but I think you're right that many people do use a code like "N/A", but you can just add a supplemental Boolean field which is cleaner for set / not set.
> SQL suffers this exact same problem. I wonder what SQL would look like without Null.
As a base table value, it's easy to mechanically avoid NULL in SQL. Every nullable column is, in the most general case, simply a column that doesn't belong in the table but in a different table whose primary key is alsona foreign key referencing the original table’s primary key.
Your solution works, but I would hardly describe having a new table for every nullable column is "easy". It also would not be nice for maintenance -- often I want to add a new column as null initially, then soon after write a migration to convert it to non-null once it's populated. Having to create / remove a new table every time would be a pain.
>The problem with Null is that it can be either a value or show the lack of presence
No, the trouble with null is that it's a supertype that breaks the type systems -- if you actually wanted to have safe code in the sense you expect typesafety to give you, you should be checking for NULLs everywhere. Because a codebase can be changed from never-null to sometimes-null, and the type system is happy to let it go through, because never-null was not actually a thing -- we just imagined it for our convenience.
The problem with null is that, in general, if you didn't implement an explicit handling for it, then it's presence will do the wrong thing.
The primary benefit of rust's Option is that you can safely not check for null.
SQL is in fact the same problem -- we just don't bother to handle the null until it bites us in the ass. For example, booleans don't actually exist, because really a bool has values (True, False, Null). Which breaks your Boolean logic because you're using trinary values (with unintuitive results, because there isn't really a good answer). Which leads you to innocent queries including both false positives and false negatives, silently, when nulls are present in your data [0]
SQL does however allow for the column constraint NOT NULL, but there's nothing that actually enforces your queries to handle the possibility of nulls correctly (because, by default, everything is nullable, and your queries would be horrificly verbose) so queries tend not to handle the case correctly until they run into the problem.
The problem people tend to not understand is that we should be checking for nulls everywhere, and we only don't do so because it would be insanity. The primary benefit of moving nulls into an Enum is that we can now safely specify that thing is NOT NULL (and our compiler will enforce it). And as it turns out, most of your data does not need, or want, to be nullable
This still doesn't fix the user input Null, versus this value is not set. I think in that case you'd want a nullable boolean companion field that can then express set, actively not set and passively not set (when that matters) but it's a bit ugly.
Empty string isn't enough for this situation since it doesn't work for data types other than string, I actually prefer null vs undefined in javascript / json here, xml also has the concept of null vs undefined.
Emtpy string vs null is not that useful if you have null and undefined.
Undefined is very underappreciated invention. I wish I had undefined in Java, throwing exception on read. That would make a lot of software much more reliable. I'm going to write my own AutoValue library just to generate java beans which will allow uninitialized values.
PS undefined in JavaScript is not that useful IMO. But it could be useful with more strict language.
"undefined" in this context means that the JSON object has had the relevant key omitted rather than explicitly set to null. This is valid JSON (it just so happens that JavaScript handles this scenario by returning "undefined" as with all undefined variables).
In my defense, JavaScript treats a value of `undefined` differently from an unset field / variable (e.g. when testing `object.hasOwnProperty('field')`).
Yes omitted value vs provided with null which is easy to work with in JS because of null and undefined and them being falsey.
Working with xml or json in more typed language like C# is less fun because there is not distinction in the language instead you end up using dom like checks to see if the key was provided or was set to null.
.Net code generator use to add an extra property called [property name]Specified for xml serialization so you could do null values or not send the node. Definitely prefer more direct language support for no value provided vs a null value.
I've never needed such in 3 decades of systems design. I'd like to hear the details. An explicit flag or time-stamp should be used to indicate when or if a record as been updated. To be frank, heavily reliance on Null strings usually means somebody is doing something wrong or awkward in my opinion. Null strings cause 10 problems for every 1 they solve. I stand by that and will and have defended it for hours in debates. Bring it on! (Granted, most RDBMS don't offer enough tools to easily do it correctly.)
A common situation I've run into is with "default" values and overrides, especially for configuration-type settings. NULL indicates use parent record value, while non-NULL, including empty string, means to use that value. By allowing empty string, you explicitly allow a user to basically say "don't inherit". Think along the lines of `COALESCE(userValue, tenantDefaultValue, "Default Value")`.
One way of implementing the UI for this is to have a checkbox labelled "Inherit" or "Use default", and if it's checked, disable the actual textbox (and put NULL in the field).
I've also run into similar patterns with data telemetry. I want to accurately represent the data I received, and so if some older version doesn't include a particular value, I want to store that as NULL, because empty string is a potentially valid value. If you "normalize" NULLs to empty string, and then it makes it impossible to answer a question like "What % of systems have a value for this data item?" since it includes versions that can't get it at all.
What exactly does "not see" mean? Some use tab, others use the mouse. That doesn't tell us much and there are better ways to track user hand/mouse movements if you need such telemetry. UI api's often handle nulls/blanks different such that you don't want to over-rely on how one of them does it. Multi-screen tasks should track when each sub-screen is finished, not just the final transaction. I'd like to see a more specific use-case.
Re: Did you want to update the field to null or not update it at all is another one.
Who, the user or DBA? And why? Users don't know a Null from a horse; you shouldn't make them think about that, that's an internal guts thing.
I have to say I'm a bit amused at the insistence on tabs/mice/telemetry :)
In at least one project I worked on, there was a lot of survey data entered from paper surveys used in the 90s and early 2000s. The structure of the survey included pass-through questions: "If you have ever smoked a joint, please check yes and answer questions 42 and 43. If you have not ever smoked a joint, please check no and TURN THE PAGE."
One can certainly build a logic to process these replies (check answer to question 41, tally blanks in questions 42, 43 according to answer to question 41) but since these questions and answers were also entered into the computer in the olden days, NULL was used if questions 42, 43 were passed through, while blank was used if questions 42, 43 were left blank (and 41 was yes).
Was the user prompted to enter the value but left it blank or where they never prompted. Yes you could always have more data, but this extra two value distinction is common and practical in my experience.
On updating its related to prompting typically, again communicating user intent, did the user update a field from having a value to not having a value (they purposely blanked it out) or they didn't touch the field and maybe it wasn't even shown so don't modify it. Basically am I going to generate a update statement with set field = null or no set at all for that field. This is trivial in json to send through application due to null vs undefined.
Re: but this extra two value distinction is common and practical in my experience.
I'm still not following. What exact hand or keyboard/mouse movements constitute "left it blank"? "Leave" means to "move away from". What is moving away from what?
Usually one stores a time-stamp for input prompts (screens). If the prompt never appears, you then have no time-stamp and/or record for it. Why would a broken prompt create a record to begin with? I'm missing some context you seem to be assuming. You may be assuming behavior of a specific UI product or stack.
Not sure I can explain more simply, just think multi part wizard interfaces or forms with button that open up sub-forms.
I think you get it since you talk about time stamps. Just as you could record all mouse movements and keyboards to get higher fidelity you could break a single record into multiple with times stamps to record each step in a form, then the lack of record would be distinct from a record with all nulls along with time stamps. You could also do a record per field (which I have seen more than once) with metadata per field as other columns.
But without all that a system that supports null and some empty value gives you more fidelity than just null that again in my experience is practical.
Empty string and null work fine with strings for this purpose but for other datatypes you start needing to pick special values like the minimum value for a 32 bit int or 0 date etc.
Usually you don't get the entire sub-screen's data such that a blank versus null distinction wouldn't help. You can't "half submit" a dialog or sub-screen on the vast majority of systems I ever worked with it, and one adds time-stamps to critical sub-screens if they can.
Maybe you encountered a specific product's edge-case where it helps, but I'd rather trade away 9 other headaches to get your edge case headache. It may be the case that Product X doesn't allow time-stamps, but should we gum up all strings to make just Product X happy? I don't see how the tradeoff math favors that. Plus, it encourages bad designs by plugging problems with funny non-printable codes. Let's not enable slop-heads.
Re: but for other datatypes
For the purpose of this discussion I'm limiting it to strings, where the majority of null-related headaches are found in my experience.
I just want to know if a boolean has been explicitly set, without having to check the value of another more complex data type that would never be used for anything else.
I’d go further than you and say they should be removed by default on all fields.
Want to know if a Boolean field is unset? Well it’s no longer Boolean because you now have 3 states for that field. So why not use a char, U/Y/N with the default being U?
NULL might have made more sense 30+ years ago when systems were more resource constrained but it doesn’t make sense now for most of the problems people are trying to solve day to day. If anything, it creates more problems.
Just to be clear, I’m not saying they should be removed entirely but rather that they shouldn’t have to be explicitly disabled on every CREATE.
I will say one use case for NULL that is hugely helpful is outer joins and nested queries. However these don’t generate high performance queries so if you’re having to rely on them then you might need to rethink your database schema anyway.
So essentially I don’t disagree with you, I just think you’re being too nice limiting your complaint to string fields.
> Well it’s no longer Boolean because you now have 3 states for that field. So why not use a char, U/Y/N with the default being U?
Well because instead of using a type that exactly encodes the concept of "yes/no/unset" (nullable boolean), you'd be using a type that encodes "any 1-character text, with arbitrary meaning and most of the values being nonsensical"
The problem is you need a boat load of additional code to cover unset. Not just in SQL (syntax is different for NULL than it is for comparing any other type) but often also in your importing language too (eg some languages will cast NULL to a nil value that can actually raise exceptions or even crash your application if not handled correctly).
Capturing those edge cases is non-trivial compared checking the value of a char.
In an idea world your unit tests and CI pipelines would catch all of those but that depends on well written tests. Thus in my experience having fewer hidden traps from the outset is automatically a better design than one that perfectly fits an academic theory but is harder to support in practice.
It'd probably be more sane than trying to stuff a 3VL into bunch of 2VL operations, because you refuse to acknowledge that you don't actually have a 2VL type
That's entirely a schema/data policy question. Null could indicate no name, or it just as easily could indicate tha data was not collected, and an empty string would indicate it was collected but is empty.
Both are entirely valid methods of encoding the data, and require knowing how the data is intended to be collected, entered and used to determine the best choice.
If that's the data policy of the group collecting the data, then an empty date (which some systems support, 0000-00-00 is valid in MySQL for certain modes of strictness IIRC), or a sentinel value could be used (1111-11-11, depending on expected values).
If the data needs to differentiate between those cases and a separate boolean bit of data to track whether it was collected is no feasible (as it so often isn't in the real world), then you do what you must. How that works out in the real world would be the policy for how to interpret the data in that field. It's really no different than any time you've seem a field that is named something counter-intuitive to what it holds in some schema. Something tracks that difference, whether it be institutional knowledge, come conversion code, or a label prior to display. That's what I mean by it being a "schema/data policy question".
Yes absolutely a data schema question. And yes, you can set whatever arbitrary policy you want. eg "nmn" for no-middle-name. Or require that a horse's birthday is recorded as 1st August.
But SQL and the relational model is meant to be a logical system, and I'm interested in preserving some conceptual integrity in the domain modelling.
I would say there's a spectrum. My guess is that whatever schema you could come up with could have an extra layer of metadata applied to describe the data to some benefit, but at some point you have to make a decision about how something is best represented.
To keep with the example of asking questions, if a sruvey is asked and each question is represented by field in a table for that survey, but they also want to track whether any particular question is answered, there are multiple ways to track this information. It can be done through a separate table (or multiple), it can be done with an extra boolean field per original question field, or it could be done by making each question field nullable, and ensuring that no question answered will ever be left null.
Which of those cases breaks the conceptual integrity of the domain modeling? I would argue that as long as they are applied consistently and accurately, none of them break it.
If your SQL query tells you the most popular name is "nmn" or that that a whole lot of people died before they were born (DOD = 1111-11-11) then yes, I think there has been a loss of integrity.
That only matters if your SQL data is ever meant to be used in isolation. If the only appropriate way to access the data and retain integrity is either through the same set of applications that insert it, or through fucntions written to mediate access within the SQL instance, than how the data is stored on disk is mostly irrelevant.
If all your access is mediated by a layer that takes care of this, then complaining about how it's stored in SQL is no different than complaining about how your DBMS is storing the data on disk. It doesn't matter, you aren't supposed to be doing anything with it there anyway.
Data integrity is not something a database can ensure, for any non-trivial use. ANSI SQL provides tools to help ensure data integrity, through data types, referential integrity checks if you define them and they are supported, but ultimately, those tools can only account for what the database supports. Will they prevent the favorite_color field from receiving the data that should have been entered into the pet_name field? Unlikely?
It's no different in programming languages and bugs (as a data integrity problem is a bug in data domain). Using C, Java, Rust and Haskell will not prevent all your bugs, and Perl, Python, JavaScript and Bash do not mean you will have bugs. A bug free Perl script is possible, and so is a bug riddled Rust script, as not all bugs are (or even can be) caught by the assurances the more strict languages provide.
Unless all your SQL in a schema use is through a directly connected SQL client that loads data directly from native SQL formats, runs queries, and exports data as the output of those queries, the applications that have ingress and egress to that database are what really matters for data integrity, and worrying that a date field might contain 1111-11-11 as a special sentinel value that means something other than what null value in the same field means is being unable to see the forest for the trees.
> No, what you suggest is counter intuitive and should never pass a code review.
You're assertion is that it's counter-intuitive to use the database's ability to encode a lack of information to denote a lack of knowledge about the information, and should only be used to denote the information does not universally exist? And additionally that this is so well accepted that to do otherwise would not be accepted by any peers?
If so, that's a fairly bold assertion to make when you haven't provided any any evidence or reasoning to back it up.
The problem is a bit more subtle. NULL is often used with two distinct meanings:
- I know that Theodore Roosevelt has no middle name.
- I don't know what Theodore Roosevelt's middle name is
The semantics of each case are different in a subtle but important way. Consider the following two predicates
'Jim' = [no value]
vs
'Jim' = [I don't know]
The former predicate is obviously false. 'Jim' is not equivalent to a non-existent value. The latter predicate however, can't be assigned a truth value. It's entirely possible that the unknown value could later turn out to be 'Jim'.
The key issue is that SQL treats NULL values according to the latter ('I don't know') rule and ternary (Kleene) logic, as TFA discusses. That, in turn means that using NULL to represent [no value] will lead to painful and hard to diagnose logic bugs (e.g., WHERE x = 1 OR x != 1 not returning all records).
I do recognize (1) there is a logical difference between "missing because absent" and "missing because unknown", (2) empty string is not an element from the domain of personal names.
In a sense, empty string is available as a marker, just as NULL is available as a marker, to satisfy either scenario. I think the possibility of three-valued logic applies in either scenario.
I'm not at all convinced empty (non-null) string is appropriate for many real world applications, just as empty (non-null) date is appropriate for similar scenarios.
Eg. Date of death: unknown - NULL; not dead yet - NULL