Variable Definitions

Since Crunch employs a structural type system rather than a nominative one, the variable definition includes more knowledge than just the type name (numeric, text, categorical, etc); we also learn details about range, precision, missing values and reasons, order, etc. For example:

{
    "type": "categorical",
    "name": "Party ID",
    "description": "Do you consider yourself generally a Democrat, a Republican, or an Independent?",
    "categories": [
        {
            "name": "Republican",
            "numeric_value": 1,
            "id": 1,
            "missing": false
        },
        {
            "name": "Democrat",
            "numeric_value": -1,
            "id": 2,
            "missing": false
        },
        {
            "name": "Independent",
            "numeric_value": 0,
            "id": 3,
            "missing": false
        }
    ]
}

This section describes the metadata of a variable as exposed across HTTP, both expected response values and valid input values.

Variable types

The “type” of a Variable is a string which defines the superset of values from which the variable may draw. The type governs not only the set of values but also their syntax. (See below.)

The following types are defined for public use:

text
numeric
categorical
datetime
multiple_response
categorical_array

Variable names

Variables in Crunch have multiple attributes that provide identifying information: “name”, “alias”, and “description”.

name

Crunch takes a principled stand that variable “names” should be for people, not for computers.

You may be used to domains that have variable “name”, “label”, and “description”. Name is some short, unique, machine-friendlier ID like “Q2”; label is short and human-friendly, something like “Brand awareness”, and description is where you might put question wording if you have survey data. Crunch has “alias”, “name”, and “description”. What you may be used to thinking of as a variable name, we consider as an alias: something for more internal use, not something appropriate for a polished dataset ready to share with people who didn’t create the dataset (See more in the “Alias” section below). In Crunch, the variable’s “name” is what you may be used to thinking of as a label.

All variables must have a name, and these names must be unique across all variables, including “hidden” variables (see below) but excluding subvariables (see “Subvariables” below). Within an array variable, subvariable names must be unique. (You can think of subvariable names within an array as being variable_name.subvariable_name, and with that approach, all “variable names” must be unique.)

Names must be a string of length greater than zero, and any valid unicode string is allowed. See “Identifiers” above.

alias

Alias is a string identifier for variables. It must be unique across all variables, including subvariables, such that it can be used as an identifier. This is what legacy statistical software typically calls a variable name.

Aliases have several uses. Client applications, such as those exposing a scripting interface, may want to use aliases as a more machine-friendly, yet still human-readable, way of referencing variables. Aliases may also be used to help line up variables across different import batches.

When creating variables via the API, alias is not a required field; if omitted, an alias will be generated. If an alias is supplied, it must be unique across all variables, including subvariables, and the new variable request will be rejected if the alias is not unique. When data are imported from file formats that have unique variable names, those names will in many cases be used as the alias in Crunch.

description

Description is an optional string that provides more information about the variable. It is displayed in the web application on variable summary cards and with analyses.

Type-specific attributes

These attributes must be present for the specified variable types when creating a variable, but they are not defined for other types.

subvariables

Multiple Response and Categorical Array variables contain an array of subvariable references. In the HTTP API, these are presented as URLs. To create a variable of type “multiple_response” or “categorical_array”, you must include a “subvariables” member with an array of subvariable references. These variables will become the subvariables in the new array variable.

Like Categories, the array of subvariables within an array variable indicate the order in which they are presented; to reorder them, save a modified array of subvariable ids/urls.

subreferences

Multiple Response and Categorical Array variables contain an object of subvariable “references”: names, alias, description, etc. To create a variable of type “multiple_response” or “categorical_array” directly, you must include a “subreferences” member with an object of objects. These label the subvariables in the new array variable.

The shape of each subreferences member must contain a name and optionally an alias. Note that the subreferences is an unordered object. The order of the subvariables is read from the “subvariables” attribute.

{
    "type": "categorical_array",
    "name": "Example array",
    "categories": [
        {
            "name": "Category 1",
            "numeric_value": 1,
            "id": 1,
            "missing": false
        },
        {
            "name": "Category 2",
            "numeric_value": 0,
            "id": 2,
            "missing": false
        }
    ],
    "subvariables": [
      "/api/datasets/abcdef/variables/abc/subvariables/1/",
      "/api/datasets/abcdef/variables/abc/subvariables/2/",
      "/api/datasets/abcdef/variables/abc/subvariables/3/"
    ],
    "subreferences": {
        "/api/datasets/abcdef/variables/abc/subvariables/2/": {"name": "subvariable 2", "alias": "subvar2_alias"},
        "/api/datasets/abcdef/variables/abc/subvariables/1/": {"name": "subvariable 1"},
        "/api/datasets/abcdef/variables/abc/subvariables/3/": {"name": "subvariable 3"}
    }
}

Other definition attributes

These attributes may be supplied on variable creation, and they are included in API responses unless otherwise noted.

format

An object with various members to control the display of Variable data:

data: An object with a “digits” member, stating how many digits to display after the decimal point.
summary: An object with a “digits” member, stating how many digits to display after the decimal point.

view

An object with various members to control the display of Variable data:

show_codes: For categorical types only. If true, numeric values are shown.
show_counts: If true, show counts; if false, show percents.
include_missing: For categorical types only. If true, include missing categories.
include_noneoftheabove: For multiple-response types only. If true, display a “none of the above” category in the requested summary or analysis.
geodata: A list of associations of a variable to Crunch geodatm entities. PATCH a variable entity amending the view.geodata in order to create, modify, or remove an association. An association is an object with required keys geodatum, feature_key, and optional match_field. The geodatum must exist; feature_key is the name of the property of each ‘feature’ in the geojson/topojson that corresponds to the match_field of the variable (perhaps a dotted string for nested properties; e.g. ”properties.postal-code”). By default, match_field is “name”: a categorical variable will match category names to the feature_key present in the given geodatum.
transform: An object that can optionally contain insertions, a list of insertion objects.

discarded

Discarded is a boolean value indicating whether the variable should be viewed as part of the dataset. Hiding variables by setting discarded to True is like a soft, restorable delete method.

Default is false.

private

If true, the variable will not show in the common variable catalog; instead, it will be included in the personal variables catalog.

missing_reasons

An object whose keys are reason strings and whose values are the codes used for missing entries.

Crunch allows any entry in a column to be either a valid value or a missing code. Regardless of the class, missing codes are represented in the interface as an object with a single “?” key mapped to a single missing integer code. For example, a segment of [4.56, 9.23, {“?”: -1}] includes 2 valid values and 1 missing value.

For non-categorical variables, the missing codes map to a reason phrase via this “missing reasons” type member. Users may define their own missing reasons.

The “No Data” missing reason (whatever its code) will be used for default values when appending partial rows or collapsing timeseries data. All non-categorical variables must have a “No Data” missing reason; the Crunch system will add one if not supplied.

In the above example, the code of -1 would be looked up in a missing reasons map such as:

{
    "missing reasons": {
        "No Data": -1,
        "type mismatch": -2,
        "my backup was corrupted": 1
    }
}

Categorical variables do not require a missing_reasons object because the categories array contains the information about missingness.

Values

When creating a new variable, one can also include a “values” member that contains the data column corresponding to the variable metadata. See Importing Data: Column-by-column. This subsection outlines how the various variable types have their values formatted both when one supplies values to add to the dataset and when one requests values from a dataset.

Text

Text values are an array of quoted strings. Missing values are indicated as {"?": <integer>}, as discussed above, and all integer missing value codes must be defined in the “missing_reasons” object of the variable’s metadata.

Numeric

A “numeric” value will always be e.g. 500 (a number, without quotes) in the JSON request and response messages, not “500” (a string, with quotes). Missing values are handled as with text variables.

Categorical

Insert an array of integers that correspond to the ids of the variable’s categories. Only integers found in the category ids are allowed. That is, you cannot insert values for which there is no category metadata. It is, however, permitted to have categories defined for which there are no values.

Datetime

Datetime input and output are in ISO-8601 formatted strings.

resolution

Datetime variables must have a resolution string that indicates the unit size of the datetime data. Valid values include “Y”, “M”, “D”, “h”, “m”, “s”, and “ms”. Every datetime variable must have a resolution.

Arrays

Crunch supports array type variables, which contain an array of subvariables. “Multiple response” and “Categorical array” are both arrays of categorical subvariables. Subvariables do not exist as independent variables; they are exposed as “virtual” variables in some places, and can be analyzed independently, but they do not have their own type or categories.

Arrays are currently always categorical, so they send and receive data in the same format: category ids. The only difference is that regular categorical variables send and receive one id per row, where arrays send and receive a list of ids (of equal length to the number of subvariables in the array).

Variables

A complete Variable, then, is simply a Definition combined with its data array.

Help Center