Since Crunch employs a structural type system rather than a nominative one, the variable definition includes more knowledge than just the type name (numeric, text, categorical, etc); we also learn details about range, precision, missing values and reasons, order, etc. For example:
{
"type": "categorical",
"name": "Party ID",
"description": "Do you consider yourself generally a Democrat, a Republican, or an Independent?",
"categories": [
{
"name": "Republican",
"numeric_value": 1,
"id": 1,
"missing": false
},
{
"name": "Democrat",
"numeric_value": -1,
"id": 2,
"missing": false
},
{
"name": "Independent",
"numeric_value": 0,
"id": 3,
"missing": false
}
]
}
This section describes the metadata of a variable as exposed across HTTP, both expected response values and valid input values.
Variable types
The “type” of a Variable is a string which defines the superset of values from which the variable may draw. The type governs not only the set of values but also their syntax. (See below.)
The following types are defined for public use:
- text
- numeric
- categorical
- datetime
- multiple_response
- categorical_array
Variable names
Variables in Crunch have multiple attributes that provide identifying information: “name”, “alias”, and “description”.
name
Crunch takes a principled stand that variable “names” should be for people, not for computers.
You may be used to domains that have variable “name”, “label”, and “description”. Name is some short, unique, machine-friendlier ID like “Q2”; label is short and human-friendly, something like “Brand awareness”, and description is where you might put question wording if you have survey data. Crunch has “alias”, “name”, and “description”. What you may be used to thinking of as a variable name, we consider as an alias: something for more internal use, not something appropriate for a polished dataset ready to share with people who didn’t create the dataset (See more in the “Alias” section below). In Crunch, the variable’s “name” is what you may be used to thinking of as a label.
All variables must have a name, and these names must be unique across all variables, including “hidden” variables (see below) but excluding subvariables (see “Subvariables” below). Within an array variable, subvariable names must be unique. (You can think of subvariable names within an array as being variable_name.subvariable_name, and with that approach, all “variable names” must be unique.)
Names must be a string of length greater than zero, and any valid unicode string is allowed. See “Identifiers” above.
alias
Alias is a string identifier for variables. It must be unique across all variables, including subvariables, such that it can be used as an identifier. This is what legacy statistical software typically calls a variable name.
Aliases have several uses. Client applications, such as those exposing a scripting interface, may want to use aliases as a more machine-friendly, yet still human-readable, way of referencing variables. Aliases may also be used to help line up variables across different import batches.
When creating variables via the API, alias is not a required field; if omitted, an alias will be generated. If an alias is supplied, it must be unique across all variables, including subvariables, and the new variable request will be rejected if the alias is not unique. When data are imported from file formats that have unique variable names, those names will in many cases be used as the alias in Crunch.
description
Description is an optional string that provides more information about the variable. It is displayed in the web application on variable summary cards and with analyses.
Type-specific attributes
These attributes must be present for the specified variable types when creating a variable, but they are not defined for other types.
categories
Categorical variables must contain an array of Category objects, each of which includes:
- id: a read-only integer identifier for the category. These correspond to the data values.
- name: the string name which applications should use to identify the category.
- numeric_value: the numeric value bound to each name. If no numeric value should be bound, this should be null. numeric_values need not be unique, and they may be null.
- missing: (optional) boolean indicating whether the data corresponding to this category should be interpreted as missing.
- selected: (optional) boolean indicating whether this category should be treated as a “true” value for logical operations. Defaults to false if omitted. Multiple response variables are essentially logical categorical arrays, and therefore must have at least one “selected” category. More than one Category may be marked “selected”.
Categories are valid if:
- Category names are unique within the set
- Category ids are unique within the set
- Category ids are integers less than 2 ** 63
The order of the array defines the order of the categories, and thus the order in which aggregate data will be presented. This order can be changed by saving a reordered set of Categories.
All categorical variables must have a missing category named “No Data”; the Crunch system will add one if not supplied. The “No Data” missing category (whatever its code) will be used for default values when appending partial rows or collapsing timeseries data.
subvariables
Multiple Response and Categorical Array variables contain an array of subvariable references. In the HTTP API, these are presented as URLs. To create a variable of type “multiple_response” or “categorical_array”, you must include a “subvariables” member with an array of subvariable references. These variables will become the subvariables in the new array variable.
Like Categories, the array of subvariables within an array variable indicate the order in which they are presented; to reorder them, save a modified array of subvariable ids/urls.
subreferences
Multiple Response and Categorical Array variables contain an object of subvariable “references”: names, alias, description, etc. To create a variable of type “multiple_response” or “categorical_array” directly, you must include a “subreferences” member with an object of objects. These label the subvariables in the new array variable.
The shape of each subreferences member must contain a name and optionally an alias. Note that the subreferences is an unordered object. The order of the subvariables is read from the “subvariables” attribute.
{
"type": "categorical_array",
"name": "Example array",
"categories": [
{
"name": "Category 1",
"numeric_value": 1,
"id": 1,
"missing": false
},
{
"name": "Category 2",
"numeric_value": 0,
"id": 2,
"missing": false
}
],
"subvariables": [
"/api/datasets/abcdef/variables/abc/subvariables/1/",
"/api/datasets/abcdef/variables/abc/subvariables/2/",
"/api/datasets/abcdef/variables/abc/subvariables/3/"
],
"subreferences": {
"/api/datasets/abcdef/variables/abc/subvariables/2/": {"name": "subvariable 2", "alias": "subvar2_alias"},
"/api/datasets/abcdef/variables/abc/subvariables/1/": {"name": "subvariable 1"},
"/api/datasets/abcdef/variables/abc/subvariables/3/": {"name": "subvariable 3"}
}
}
Other definition attributes
These attributes may be supplied on variable creation, and they are included in API responses unless otherwise noted.
format
An object with various members to control the display of Variable data:
- data: An object with a “digits” member, stating how many digits to display after the decimal point.
- summary: An object with a “digits” member, stating how many digits to display after the decimal point.
view
An object with various members to control the display of Variable data:
- show_codes: For categorical types only. If true, numeric values are shown.
- show_counts: If true, show counts; if false, show percents.
- include_missing: For categorical types only. If true, include missing categories.
- include_noneoftheabove: For multiple-response types only. If true, display a “none of the above” category in the requested summary or analysis.
- geodata: A list of associations of a variable to Crunch geodatm entities. PATCH a variable entity amending the view.geodata in order to create, modify, or remove an association. An association is an object with required keys geodatum, feature_key, and optional match_field. The geodatum must exist; feature_key is the name of the property of each ‘feature’ in the geojson/topojson that corresponds to the match_field of the variable (perhaps a dotted string for nested properties; e.g. ”properties.postal-code”). By default, match_field is “name”: a categorical variable will match category names to the feature_key present in the given geodatum.
- transform: An object that can optionally contain insertions, a list of insertion objects.
discarded
Discarded is a boolean value indicating whether the variable should be viewed as part of the dataset. Hiding variables by setting discarded to True is like a soft, restorable delete method.
Default is false.
private
If true, the variable will not show in the common variable catalog; instead, it will be included in the personal variables catalog.
missing_reasons
An object whose keys are reason strings and whose values are the codes used for missing entries.
Crunch allows any entry in a column to be either a valid value or a missing code. Regardless of the class, missing codes are represented in the interface as an object with a single “?” key mapped to a single missing integer code. For example, a segment of [4.56, 9.23, {“?”: -1}] includes 2 valid values and 1 missing value.
For non-categorical variables, the missing codes map to a reason phrase via this “missing reasons” type member. Users may define their own missing reasons.
The “No Data” missing reason (whatever its code) will be used for default values when appending partial rows or collapsing timeseries data. All non-categorical variables must have a “No Data” missing reason; the Crunch system will add one if not supplied.
In the above example, the code of -1 would be looked up in a missing reasons map such as:
{
"missing reasons": {
"No Data": -1,
"type mismatch": -2,
"my backup was corrupted": 1
}
}
Categorical variables do not require a missing_reasons object because the categories array contains the information about missingness.
Values
When creating a new variable, one can also include a “values” member that contains the data column corresponding to the variable metadata. See Importing Data: Column-by-column. This subsection outlines how the various variable types have their values formatted both when one supplies values to add to the dataset and when one requests values from a dataset.
Text
Text values are an array of quoted strings. Missing values are indicated as {"?": <integer>}, as discussed above, and all integer missing value codes must be defined in the “missing_reasons” object of the variable’s metadata.
Numeric
A “numeric” value will always be e.g. 500 (a number, without quotes) in the JSON request and response messages, not “500” (a string, with quotes). Missing values are handled as with text variables.
Categorical
Insert an array of integers that correspond to the ids of the variable’s categories. Only integers found in the category ids are allowed. That is, you cannot insert values for which there is no category metadata. It is, however, permitted to have categories defined for which there are no values.
Datetime
Datetime input and output are in ISO-8601 formatted strings.
resolution
Datetime variables must have a resolution string that indicates the unit size of the datetime data. Valid values include “Y”, “M”, “D”, “h”, “m”, “s”, and “ms”. Every datetime variable must have a resolution.
Arrays
Crunch supports array type variables, which contain an array of subvariables. “Multiple response” and “Categorical array” are both arrays of categorical subvariables. Subvariables do not exist as independent variables; they are exposed as “virtual” variables in some places, and can be analyzed independently, but they do not have their own type or categories.
Arrays are currently always categorical, so they send and receive data in the same format: category ids. The only difference is that regular categorical variables send and receive one id per row, where arrays send and receive a list of ids (of equal length to the number of subvariables in the array).
Variables
A complete Variable, then, is simply a Definition combined with its data array.