Individual data values follow the JSON representations where possible. JSON exposes the following types: number, string, array, object, true, false, null. Crunch adds additional types with special syntax (see Types, below). Examples:
- 13
- 45.330495
- “foo”
- [3, 4, 5]
- {“bar”: {“a”: [12.4, 89.2, 0]}}
- true
- null
- “2014-03-02T14:29:59Z”
Because a single JSON type may be used to represent multiple Crunch types, you should never rely on the JSON type to interpret the class of a datum. Instead, inspect the type object (see below) to interpret the data.
Missing values
Crunch provides a robust “missing entries” system. Anywhere a (valid) data value can appear, a missing value may also appear. Missing values are represented by an object with a single “?” key. The value is a missing integer code (see Missing reasons, below). Examples:
- {“?”: -1}
- {“?”: 24}
Arrays
A set of data values (and/or missing values) which are of the same type can be ordered in an array. All entries in an array are of the same Crunch type.
Examples:
- [13, 4, 5, {“?”: -2}, 7, 2]
- [“foo”, “bar”]
Enumerations
Some arrays, rather than repeating a small set of large values, benefit from storing a small integer code instead, moving the larger values they represent into the metadata, and doing lookups when needed to encode/decode. The “categorical” type is the most common example of this: rather than store an array of large string names like [“Internet Explorer”, “Internet Explorer”, “Firefox”, …] it instead stores integer codes like: [1, 1, 2], placing the longer strings in the metadata as type.categories = [{“id”: 1, “name”: “Internet Explorer”, …}, …]. We call this encoding process enumeration, and its opposite, where the coded are re-expanded into their original values, elaboration.
Enumeration also provides the opportunity to order the possible values, as well as include potential values which do not yet exist in the data array itself.
Enumeration typically causes the volume of data to shrink dramatically, and can speed up very common operations like filtering, grouping, and almost any data transfer. Because of this, it is common to:
- Enumerate a data array as early as possible. Indeed, when a variable can be enumerated, the fastest way to insert new data is to send the new values as the integer codes.
- Elaborate a data array as late as possible. As long as the metadata is shipped along with the enumerated data, the transfer size and therefore time is much smaller. Many cases do not even call for a complete elaboration of the entire column.