Representing Null in CSV
The comma-separated value (CSV) format is ubiquitous. Even though there exists a multitude of implementation variations, the basic idea remains: each line of the format contains values separated by commas (or tabs in a parallel format, TSV). The first line may be a header "row" describing the values in the "column" beneath it. Any value that contains the delimiter must be surrounded by double quotes, and if the double quote itself appears, it must be escaped by repeating it. These core rules were recently codified in RFC 4180, but neither the specification nor most implementations know how to handle the absence of any value in a field.
The idea of null
is one that has stoked great debate, especially in relational database circles. As some have argued convincingly, the concept of null
isn't really part of a relation (the real name for a "table") in the relational model. But a CSV file is not a relation, nor does it use the relational model. CSV is just a data storage format, the contents of which can be interpreted in various ways based upon the application.
In particular, I want to interpret a CSV file as containing logical assertions, such as the triples of RDF. In this view, each CSV record indicates a series of propositions, with each column indicating the predicate and each field representing the object. In other words, each row indicates the property values of a single resource, with each column indicating the name of each respective property. But in a typical knowledge representation framework such as RDF, not all resources will have the same properties.
For the properties for which a resource does not specify a value, I want to indicate "no value", which we can refer to out of convenience as null
. If we assume that each column has a particular type (such as the convoluted typed literals of RDF), some types have lexical forms that make it easy to indicate null
. An integer, for example, could indicate null
by the lexical form of the empty string. For example, a CSV record might contain [1,2,,4]
to indicate that there is no third value.
A problem arises when representing null
for strings. How can one distinguish between an empty string value ""
, and no value at all? That is, if we have the CSV record [one,two,,four]
, how can we know if the third value is ""
or if there is no third value?
CSV requires that certain values be quoted, and in fact allows any value to be arbitrarily quoted without changing its decoded value. The preceding record, therefore, could be represented as ["one","two","","four"]
. This suggests a possible solution: always represent the empty string as a quoted empty string ""
. If a value in a string column is empty but not quoted, it represents null
; only a quoted empty string is considered a true empty string value. This approach has also been suggested by Liu Junfeng.
Philosophically this is distateful, though, because it "reach[es] down a layer" and applies special semantics to quoted values, as noted by Dominic John Repici. In other words, decoding of a value is done before any type-based or higher-level interpretation is applied to the value, and one should be able to quote any CSV value without changing its decoded form. This could be worked around by pushing the idea of null
down to the encoding level, so that every value has a null
form (and only ""
indicates a non-null
empty string), but remember that one can quote all types, even those that will eventually be interpreted, not as strings, but as integers, for example. This would require the strange situation in which both the empty string or null
from the lexical level yield null
(a missing/absent value) at the typed value level. Such an interpretation would also mean that applications such as Excel, which does not have a distinction between null
and the empty string, would effectively replace all empty strings with null
when round-trip saving as CSV.
But for the serialization of a knowledge framework, which requires the ability to represent the empty string without requiring every resource to have the same properties, the concept of null
or an absent value is essential. Perhaps imputing the concept of null
to the lexical level, though ugly, may be the only resort when using the very simple format of CSV, the caveat of corruption by null
-unaware applications such as Excel notwithstanding.