Identifiers
Datasets, variables, and other resources are always identified by strings. All identifiers are case-sensitive, and may contain any unicode character, including spaces. Examples:
- “q1”
- “My really useful dataset”
- “变量”
Data Values
Individual data values follow the JSON representations where possible. JSON exposes the following types: number, string, array, object, true, false, null. Crunch adds additional types with special syntax (see Types, below). Examples:
- 13
- 45.330495
- “foo”
- [3, 4, 5]
- {“bar”: {“a”: [12.4, 89.2, 0]}}
- true
- null
- “2014-03-02T14:29:59Z”
Because a single JSON type may be used to represent multiple Crunch types, you should never rely on the JSON type to interpret the class of a datum. Instead, inspect the type object (see below) to interpret the data.
Missing values
Crunch provides a robust “missing entries” system. Anywhere a (valid) data value can appear, a missing value may also appear. Missing values are represented by an object with a single “?” key. The value is a missing integer code (see Missing reasons, below). Examples:
- {“?”: -1}
- {“?”: 24}
Arrays
A set of data values (and/or missing values) which are of the same type can be ordered in an array. All entries in an array are of the same Crunch type.
Examples:
- [13, 4, 5, {“?”: -2}, 7, 2]
- [“foo”, “bar”]
Enumerations
Some arrays, rather than repeating a small set of large values, benefit from storing a small integer code instead, moving the larger values they represent into the metadata, and doing lookups when needed to encode/decode. The “categorical” type is the most common example of this: rather than store an array of large string names like [“Internet Explorer”, “Internet Explorer”, “Firefox”, …] it instead stores integer codes like: [1, 1, 2], placing the longer strings in the metadata as type.categories = [{“id”: 1, “name”: “Internet Explorer”, …}, …]. We call this encoding process enumeration, and its opposite, where the coded are re-expanded into their original values, elaboration.
Enumeration also provides the opportunity to order the possible values, as well as include potential values which do not yet exist in the data array itself.
Enumeration typically causes the volume of data to shrink dramatically, and can speed up very common operations like filtering, grouping, and almost any data transfer. Because of this, it is common to:
- Enumerate a data array as early as possible. Indeed, when a variable can be enumerated, the fastest way to insert new data is to send the new values as the integer codes.
- Elaborate a data array as late as possible. As long as the metadata is shipped along with the enumerated data, the transfer size and therefore time is much smaller. Many cases do not even call for a complete elaboration of the entire column.
Variable Definitions
Crunch employs a structural type system rather than a nominative one. The variable definition includes more knowledge than just the type name (numeric, text, categorical, etc); we also learn details about range, precision, missing values and reasons, order, etc. For example:
{
"type": "categorical",
"name": "Party ID",
"description": "Do you consider yourself generally a Democrat, a Republican, or an Independent?",
"categories": [
{
"name": "Republican",
"numeric_value": 1,
"id": 1,
"missing": false
},
{
"name": "Democrat",
"numeric_value": -1,
"id": 2,
"missing": false
},
{
"name": "Independent",
"numeric_value": 0,
"id": 3,
"missing": false
}
]
}
This section describes the metadata of a variable as exposed across HTTP, both expected response values and valid input values.
Variable types
The “type” of a Variable is a string which defines the superset of values from which the variable may draw. The type governs not only the set of values but also their syntax. (See below.)
The following types are defined for public use:
- text
- numeric
- categorical
- datetime
- multiple_response
- categorical_array
Variable names
Variables in Crunch have multiple attributes that provide identifying information: “name”, “alias”, and “description”.
name
Crunch takes a principled stand that variable “names” should be for people, not for computers.
You may be used to domains that have variable “name”, “label”, and “description”. Name is some short, unique, machine-friendlier ID like “Q2”; label is short and human-friendly, something like “Brand awareness”, and description is where you might put question wording if you have survey data. Crunch has “alias”, “name”, and “description”. What you may be used to thinking of as a variable name, we consider as an alias: something for more internal use, not something appropriate for a polished dataset ready to share with people who didn’t create the dataset (See more in the “Alias” section below). In Crunch, the variable’s “name” is what you may be used to thinking of as a label.
All variables must have a name, and these names must be unique across all variables, including “hidden” variables (see below) but excluding subvariables (see “Subvariables” below). Within an array variable, subvariable names must be unique. (You can think of subvariable names within an array as being variable_name.subvariable_name, and with that approach, all “variable names” must be unique.)
Names must be a string of length greater than zero, and any valid unicode string is allowed. See “Identifiers” above.
alias
Alias is a string identifier for variables. It must be unique across all variables, including subvariables, such that it can be used as an identifier. This is what legacy statistical software typically calls a variable name.
Aliases have several uses. Client applications, such as those exposing a scripting interface, may want to use aliases as a more machine-friendly, yet still human-readable, way of referencing variables. Aliases may also be used to help line up variables across different import batches.
When creating variables via the API, alias is not a required field; if omitted, an alias will be generated. If an alias is supplied, it must be unique across all variables, including subvariables, and the new variable request will be rejected if the alias is not unique. When data are imported from file formats that have unique variable names, those names will in many cases be used as the alias in Crunch.
description
Description is an optional string that provides more information about the variable. It is displayed in the web application on variable summary cards and with analyses.
Type-specific attributes
These attributes must be present for the specified variable types when creating a variable, but they are not defined for other types.
categories
Categorical variables must contain an array of Category objects, each of which includes:
- id: a read-only integer identifier for the category. These correspond to the data values.
- name: the string name which applications should use to identify the category.
- numeric_value: the numeric value bound to each name. If no numeric value should be bound, this should be null. numeric_values need not be unique, and they may be null.
- missing: (optional) boolean indicating whether the data corresponding to this category should be interpreted as missing.
- selected: (optional) boolean indicating whether this category should be treated as a “true” value for logical operations. Defaults to false if omitted. Multiple response variables are essentially logical categorical arrays, and therefore must have at least one “selected” category. More than one Category may be marked “selected”.
Categories are valid if:
- Category names are unique within the set
- Category ids are unique within the set
- Category ids are integers less than 2 ** 63
The order of the array defines the order of the categories, and thus the order in which aggregate data will be presented. This order can be changed by saving a reordered set of Categories.
All categorical variables must have a missing category named “No Data”; the Crunch system will add one if not supplied. The “No Data” missing category (whatever its code) will be used for default values when appending partial rows or collapsing timeseries data.
subvariables
Multiple Response and Categorical Array variables contain an array of subvariable references. In the HTTP API, these are presented as URLs. To create a variable of type “multiple_response” or “categorical_array”, you must include a “subvariables” member with an array of subvariable references. These variables will become the subvariables in the new array variable.
Like Categories, the array of subvariables within an array variable indicate the order in which they are presented; to reorder them, save a modified array of subvariable ids/urls.
subreferences
Multiple Response and Categorical Array variables contain an object of subvariable “references”: names, alias, description, etc. To create a variable of type “multiple_response” or “categorical_array” directly, you must include a “subreferences” member with an object of objects. These label the subvariables in the new array variable.
The shape of each subreferences member must contain a name and optionally an alias. Note that the subreferences is an unordered object. The order of the subvariables is read from the “subvariables” attribute.
{
"type": "categorical_array",
"name": "Example array",
"categories": [
{
"name": "Category 1",
"numeric_value": 1,
"id": 1,
"missing": false
},
{
"name": "Category 2",
"numeric_value": 0,
"id": 2,
"missing": false
}
],
"subvariables": [
"/api/datasets/abcdef/variables/abc/subvariables/1/",
"/api/datasets/abcdef/variables/abc/subvariables/2/",
"/api/datasets/abcdef/variables/abc/subvariables/3/"
],
"subreferences": {
"/api/datasets/abcdef/variables/abc/subvariables/2/": {"name": "subvariable 2", "alias": "subvar2_alias"},
"/api/datasets/abcdef/variables/abc/subvariables/1/": {"name": "subvariable 1"},
"/api/datasets/abcdef/variables/abc/subvariables/3/": {"name": "subvariable 3"}
}
}
resolution
Datetime variables must have a resolution string that indicates the unit size of the datetime data. Valid values include “Y”, “M”, “D”, “h”, “m”, “s”, and “ms”. Every datetime variable must have a resolution.
Other definition attributes
These attributes may be supplied on variable creation, and they are included in API responses unless otherwise noted.
format
An object with various members to control the display of Variable data:
- data: An object with a “digits” member, stating how many digits to display after the decimal point.
- summary: An object with a “digits” member, stating how many digits to display after the decimal point.
view
An object with various members to control the display of Variable data:
- show_codes: For categorical types only. If true, numeric values are shown.
- show_counts: If true, show counts; if false, show percents.
- include_missing: For categorical types only. If true, include missing categories.
- include_noneoftheabove: For multiple-response types only. If true, display a “none of the above” category in the requested summary or analysis.
- geodata: A list of associations of a variable to Crunch geodatm entities. PATCH a variable entity amending the view.geodata in order to create, modify, or remove an association. An association is an object with required keys geodatum, feature_key, and optional match_field. The geodatum must exist; feature_key is the name of the property of each ‘feature’ in the geojson/topojson that corresponds to the match_field of the variable (perhaps a dotted string for nested properties; e.g. ”properties.postal-code”). By default, match_field is “name”: a categorical variable will match category names to the feature_key present in the given geodatum.
discarded
Discarded is a boolean value indicating whether the variable should be viewed as part of the dataset. Hiding variables by setting discarded to True is like a soft, restorable delete method.
Default is false.
private
If true, the variable will not show in the common variable catalog; instead, it will be included in the personal variables catalog.
missing_reasons
An object whose keys are reason strings and whose values are the codes used for missing entries.
Crunch allows any entry in a column to be either a valid value or a missing code. Regardless of the class, missing codes are represented in the interface as an object with a single “?” key mapped to a single missing integer code. For example, a segment of [4.56, 9.23, {“?”: -1}] includes 2 valid values and 1 missing value.
For non-categorical variables, the missing codes map to a reason phrase via this “missing reasons” type member. Users may define their own missing reasons.
The “No Data” missing reason (whatever its code) will be used for default values when appending partial rows or collapsing timeseries data. All non-categorical variables must have a “No Data” missing reason; the Crunch system will add one if not supplied.
In the above example, the code of -1 would be looked up in a missing reasons map such as:
{
"missing reasons": {
"No Data": -1,
"type mismatch": -2,
"my backup was corrupted": 1
}
}
See the Endpoint Reference for user-defined missing reasons.
Categorical variables do not require a missing_reasons object because the categories array contains the information about missingness.
Values
When creating a new variable, one can also include a “values” member that contains the data column corresponding to the variable metadata. See Importing Data: Column-by-column. This subsection outlines how the various variable types have their values formatted both when one supplies values to add to the dataset and when one requests values from a dataset.
Text
Text values are an array of quoted strings. Missing values are indicated as {"?": <integer>}, as discussed above, and all integer missing value codes must be defined in the “missing_reasons” object of the variable’s metadata.
Numeric
A “numeric” value will always be e.g. 500 (a number, without quotes) in the JSON request and response messages, not “500” (a string, with quotes). Missing values are handled as with text variables.
Categorical
Insert an array of integers that correspond to the ids of the variable’s categories. Only integers found in the category ids are allowed. That is, you cannot insert values for which there is no category metadata. It is, however, permitted to have categories defined for which there are no values.
Datetime
Datetime input and output are in ISO-8601 formatted strings.
Arrays
Crunch supports array type variables, which contain an array of subvariables. “Multiple response” and “Categorical array” are both arrays of categorical subvariables. Subvariables do not exist as independent variables; they are exposed as “virtual” variables in some places, and can be analyzed independently, but they do not have their own type or categories.
Arrays are currently always categorical, so they send and receive data in the same format: category ids. The only difference is that regular categorical variables send and receive one id per row, where arrays send and receive a list of ids (of equal length to the number of subvariables in the array).
Variables
A complete Variable, then, is simply a Definition combined with its data array.
Expressions
Crunch expressions are used to compute on a dataset, to do nuanced selects, updates, and deletes, and to accomplish many other routine operations. Expressions are JSON objects in which each term is wrapped in an object which declares whether the term is a variable, a value, or a function, etc. While verbose, doing so allows us to be more explicit about the operations we wish to do.
Expressions generally contain references to variables, values, or columns of values, often composed in functions. The output of expressions can be other variables, values, or cube aggregations, depending on the context and expression content. Some endpoints have special semantics, but the general structure of the expressions follows the model described below.
Variable terms
Terms refer to variables when they include a “variable” member. The value is the URL for the desired variable. For example:
- {"variable": "../variables/X/"}
- {"variable": "https://[workspace].crunch.io/api/datasets/48ffc3/joins/abcd/variables/Y/"}
URLs must either be absolute or relative to the URL of the current request. For example, to refer to a variable in a query at https://[workspace].crunch.io/api/datasets/48ffc3/cube/, a variable at https://[workspace].crunch.io/api/datasets/48ffc3/variables/9410fc/ may be referenced by its full URL or by “../variables/9410fc/”.
Value terms
Terms refer to data values when they include a “value” member. Its value is any individual data value; that is, a value that is addressable by a column and row in the dataset. For example:
- {"value": 13}
- {"value": [3, 4, 5]}
Column terms
Terms refer to columns (construct them, actually) when they include a “column” member. The value is an array of data values. You may include “type” and/or “references” members as well.
- {"column": [1, 2, 3, 17]}
- {"column": [{"?": -2}, 1, 4, 1], "type": {"class": "categorical", "categories": [...], ...}}
Function terms
Terms refer to functions (and operators) when they include a “function” member. The value is the identifier for the desired function. They parameterize the function with an “args” member, whose value is an array of terms, one for each argument. Examples:
- {"function": "==", "args": [{"variable": "../variables/X/"}, {"value": 13}]}
- {"function": "contains", "args": [{"variable": "../joins/abcd/variables/Y/"}, {"value": "foo"}]}
You may include a “references” member to provide a name, alias, description, etc to the output of the function.
The following is a list of all supported functions available for crunch expressions. Note that these functions can be used in conjuction to compose an expression.
Array functions
- array — returns the given Frame as an array. The type of each variable in the Frame must be close enough to form a supertype for the array.
- get — returns a subvariable from the given column.
- subvariables — return a Frame containing subvariables of the given array.
- tiered — return a variable formed by collapsing the given array’s subvariables in the given category tiers.
- complete_cases — returns a logical column, with “selected” for each row where all of the input variables are valid, and “missing” where they are all missing.
Binary functions
In general, these operate only on “numeric” types.
- + — add
- - — subtract
- * — multiply
- / — div divide
- // — floor division
- ^ — power
- % — modulus
- & — bitwise and
- | — bitwise or
- ~ — invert
Logic functions
These all return a “logical” categorical column with just three categories: one is marked “selected”, one is marked “missing”, and the other is neither.
User interfaces may use the presence of a “selected” category to decide to reduce analyses to only show the “selected” category.
Any “logical” column can be used as a filter expression; rows which result in a “selected” value will match the filter, and those which are “missing” or “other” will not.
- == — equals. Exact matches will return “selected”. For non-matching values, if either input term is missing, the result is missing. Otherwise, the result is “other”.
- != — not equals. Exact matches will return “other”. For non-matching values, if either input term is missing, the result is missing. Otherwise, the result is “selected”. This is the same result as not(==).
- =><= — between
- between — between
- < — less than
- > — greater than
- <= — less than or equal
- >= — greater than or equal
- in — “selected” for each row where A is an element of array B, or a key of object B.
- all — “selected” for each row where all subvariables in a multiple_response array are selected.
- any — “selected” for each row where any subvariable in a multiple_response array is selected.
- is_none_of_the_above — “selected” for each row where no subvariables in a multiple_response array are selected, unless all subvariables have missing values.
- contains — “selected” for each row where text value A is a substring of text value B.
- icontains — Case-insensitive version of ‘contains’.
- ~= — compare against regular expression (regex).
- and — logical and. A “selected” value and “selected” results in “selected”. If either input term is missing, the result is missing. Otherwise, the result is “other”.
- or — logical or. If either input term is “selected”, the result is “selected”. A “missing” value or “missing” results in “missing”. Otherwise, the result is “other”.
- not — logical not; this is the “relative complement”–any missing values will remain missing.
- not_selected — logical not; this is the “absolute complement”–any missing values will become “selected”.
- selected — returns “selected” only for “selected” categories; “other” and “missing” values will become “other”.
- is_valid — logical array of rows which are valid for the given column.
- is_missing — logical array of rows which are missing for the given column.
- any_missing — logical array of rows where any of the subvariables are missing.
- all_valid — logical array of rows where all of the subvariables are valid.
- all_missing — logical array of rows where all of the subvariables are missing.
Selection functions
- as_selected — returns the given variable reduced to the [1, 0, -1] “logical” categories.
- selected_array — returns a bool array from the given categorical, plus None/__none__/__any__
- selected_depth — returns a numeric column containing the number of selected categories in each row of the given array.
- selections — returns the given array, reduced to the [1, 0, -1] “logical” categories, plus an __any__ magic subvariable.
Miscellaneous functions
- bin — returns column’s values broken into equidistant bins.
- case — evaluates the given conditions in order, selecting the corresponding choice.
- cast — returns a Column of column’s values cast to the given type.
- char_length — returns the length of each string (or missing reason) in the given column.
- copy_variable — returns a copy of the column with a copy of its metadata.
- combine_categories — returns a column of categories combined according to the category_info.
- combine_responses — combines the given categorical variables into a new one.
- lookup — maps each row of source through its keys index to a corresponding value.
- missing — returns the given column as missing for the given reason.
- normalize — returns a column with the given values normalized so sum(c) == len(c).
- row — returns a numeric column with row indices.
- typeof — returns (a copy of) the type of the given column.
- unmissing — returns the given column with user missing replaced by valid values.
- alter_categories — returns a categorical column with the same values but updates the metadata of the input variable. Only allowed changes are categories names and order and subvariables names.
{
"function": "alter_categories",
"args": [{"value": {
"categories": [
{"id": "<cat_id>", "name": "<new name>"},
// ... repeat for each category to be renamed
],
"order": [1, 2, 3, 4],
"subvariables": [
{"id": "<subvar_id>", "name": "<new name>"},
// ... repeat for each subvariable to be renamed
]
}}]
}
Date functions
- default_rollup_resolution — default_rollup_resolution
- datetime_to_numeric — converts the given datetime column to numeric.
- format_datetime — converts datetime values to strings using the fmt as strftime mask.
- numeric_to_datetime — converts the given numeric column to datetime with the given resolution.
- parse_datetime — parses string to datetime using optional format string.
- rollup — returns column’s values (which must be type datetime) into calendrical bins.
- difftime — takes two columns as input and an optional resolution value and returns a numeric column representing the number of units diference between the two columns.
- now — has no parameters and returns the current time on the server.
- datetime — given a set of numeric columns (year, day, month), returns a datetime variable. Optionally add hours, minutes, and seconds. Note that seconds is a floating point number with no more than 6 decimal places.
Frame Functions
- page — returns the given frame, limited/offset by the given values.
- make_frame or select — returns a frame built from the given map of variables, optionally specifying the frame’s variable order.
- frame_subset — returns a frame NOT including the indicated variables.
- sheet — returns the given frame, limited/offset in the number of variables.
- dependents — returns the given frame with only dependents of the given variable.
- adapt — returns the given frame adapted to the given to_key.
- join — returns a JoinedFrame from the given list of subframes.
- find — returns a Frame with those variables which match the given criteria.
- flatten — returns a frame including all variables, plus all subvariables as dotted ids.
{
"function": "make_frame",
"args": [
{
"map": {
<destination id>: <variable expression>,
<destination id>: <variable expression>,
...
}
},
{"value": ["destination id 2", "destination id 1", ...]}
]
}
- make_frame or select: Receives as its first argument a map expression where destination id is the ID that the mapped variable will have on the resulting frame and where variable expression` can be a reference to an existing variable from the source frame ({"variable": "source frame id"}) or a completely new variable that doesn’t exist on the source frame. An optional second argument defines the order in which variables must be arranged in the resulting frame.
{
"function": "frame_subset",
"args": [
// the frame to subset
{"frame": "unsecured"},
// the IDs of variables to explicitly include
{"value": null},
// the IDs of variables to explicitly exclude
{"value": ["000001", "000009", "001874", ...]}
]
}
- frame_subset: Same as select but will exclude the passed variable IDs from the source frame.
Measures functions
- cube_count
- cube_distinct_count
- cube_max — a measure which returns the maximum value in a column.
- cube_mean
- cube_min — a measure which returns the minimum value in a column.
- cube_missing_frequencies — returns an object with parallel ‘code’ and ‘count’ arrays.
- cube_quantile
- cube_stddev — a measure which returns the standard deviation value in a column.
- cube_sum
- cube_valid_count
- cube_weighted_max
- cube_weighted_min
- top — returns the given (1D/1M) cube, filtered to its top N members.
Cube functions
- autocube — returns a cube crossing A by B (which may be None).
- autofreq — returns a cube of frequencies for A.
- cube — returns a Cube instance from the given arguments.
- each — yields one expression result per item in the given iterable.
- multitable — returns cubes for each target variable crossed by None + each template variable.
- transpose — transposes the given cube, rearranging its (0-based) axes to the given order.
- stack — returns a cube of 1 more dimension formed by stacking the given array.
Filter terms
Terms that refer to filters entities by URL are shorthand for the logical expression stored in the entity. So, {"filter": "../filters/59fc4d/"} yields the Crunch expression contained in the Filter entity’s “expression” attribute. Filter terms can be combined together with other expressions as well. For example, {"function": "and", "args": [{"filter": "../filters/59fc4d/"}, {"function": "==", "args": [{"variable": "../variables/X/"}, {"value": 13}]}]} would “and” together the logical expression in filter 59fc4d with the X == 13 expression.