Most of the other representations returned from the API are Crunch Objects. They are built with JSON, so any JSON parser should be able to at least deserialize Crunch documents. Crunch adds two document types: Table and Cube.
Table
Tables collect columns of data and (optionally) their metadata into two-dimensional relations.
{
"element": "crunch:table",
"self": "https://.../api/datasets/.../table/?limit=7",
"description": "The data belonging to this Dataset.",
"metadata": {
"1ef0455": {"name": "Education", "type": "categorical", "categories": [...], ...},
"588392a": {"name": "Favorite color", "type": "text", ...}
},
"data": {
"1ef0455": [6, 4, 7, 7, 3, 2, 1],
"588392a": ["green", "red", "blue", "Red", "RED", "pink", " red"]
}
}
Each key in the “data” member is a variable identifier, and its corresponding value is a column of Crunch data values. The data values in a given column are homogeneous, but across columns they are heterogeneous. The lengths of all columns MUST be the same. The “metadata” member is optional; if given, it MUST contain matching keys that correspond to variable definitions.
Like any JSON object, the “data” and “metadata” objects are explicitly unordered. When supplying a crunch:table, such as when POST’ing to datasets/ to create a new dataset, you must supply an Order if you want an explicit variable order.
Table for POST
The shape of crunch:table``s is slightly different from what one would ``GET from the dataset/{id}/table/ endpoint. The major differences are:
- Variables are keyed by alias instead of by id or URL
- Array subvariables are a list, not a keyed object (and the list contains information about the subreference in the form {"alias": "subvar_alias", "name": "Subvariable name", ...})
The following is an example payload that includes each variable type that Crunch has. POST this to datasets/ will create an empty dataset with these variables as specified.
{
"element": "shoji:entity",
"body": {
"name": "A dataset for demoing crunch:table",
"description": "A set of a number of different variable types to show how to create crunch:table",
"table": {
"element": "crunch:table",
"metadata": {
"food_groups": {
"alias": "food_groups",
"name": "Food groups",
"type": "categorical",
"notes": "A categorical variable where the missing categories are interspersed throughout the non-missing categories",
"description": "Four of the five USDA food groups",
"categories": [
{
"numeric_value": 0,
"id": 0,
"name": "Vegetables",
"missing": false
},
{
"numeric_value": 32766,
"id": 32766,
"name": "Don't know",
"missing": true
},
{
"numeric_value": 2,
"id": 2,
"name": "Fruit",
"missing": false
},
{
"numeric_value": 5,
"id": 5,
"name": "Grain",
"missing": false
},
{
"numeric_value": null,
"id": -1,
"name": "No Data",
"missing": true
},
{
"numeric_value": 4,
"id": 4,
"name": "Meat",
"missing": false
},
{
"numeric_value": 32767,
"id": 32767,
"name": "Not asked",
"missing": true
}
]
},
"abolitionists": {
"description": "Do you have a favorable or an unfavorable opinion of the following:",
"name": "Abolitionists",
"notes": "A categorical array variable, where one item has no responses",
"alias": "abolitionists",
"type": "categorical_array",
"subvariables": [
{
"alias": "brown",
"name": "John Brown"
},
{
"alias": "douglass",
"name": "Frederick Douglass"
}
],
"categories": [
{
"numeric_value": 0,
"selected": false,
"id": 0,
"missing": false,
"name": "Very favorable"
},
{
"numeric_value": 2,
"selected": false,
"id": 2,
"missing": false,
"name": "Somewhat favorable"
},
{
"numeric_value": 3,
"selected": false,
"id": 3,
"missing": false,
"name": "Somewhat unfavorable"
},
{
"numeric_value": 4,
"selected": false,
"id": 4,
"missing": false,
"name": "Very unfavorable"
},
{
"numeric_value": 5,
"selected": false,
"id": 5,
"missing": false,
"name": "Don't know"
},
{
"numeric_value": 32766,
"selected": false,
"id": 32766,
"missing": true,
"name": "skipped"
},
{
"numeric_value": 32767,
"selected": false,
"id": 32767,
"missing": true,
"name": "not asked"
},
{
"numeric_value": null,
"selected": false,
"id": -1,
"missing": true,
"name": "No Data"
}
]
},
"nordics": {
"description": "Which of the following Nordic countries have you visited? (select all that apply)",
"uniform_basis": false,
"name": "Nordic countries",
"notes": "A multiple response variable",
"alias": "nordics",
"type": "multiple_response",
"subvariables": [
{
"alias": "se",
"name": "Sweden"
},
{
"alias": "fi",
"name": "Finland"
},
{
"alias": "dk",
"name": "Denmark"
},
{
"alias": "no",
"name": "Norway"
},
{
"alias": "is",
"name": "Iceland"
}
],
"categories": [
{
"numeric_value": 1,
"selected": true,
"id": 1,
"missing": false,
"name": "selected"
},
{
"numeric_value": 2,
"missing": false,
"id": 2,
"name": "not selected"
},
{
"numeric_value": 32766,
"missing": true,
"id": 32766,
"name": "skipped"
},
{
"numeric_value": 32767,
"missing": true,
"id": 32767,
"name": "not asked"
},
{
"numeric_value": null,
"missing": true,
"id": -1,
"name": "No Data"
}
]
},
"age": {
"alias": "age",
"name": "age",
"type": "numeric",
"notes": "",
"description": "Age — with a missing rule that people over 80 are marked as missing.",
"missing_reasons": {
"No Data": -1,
"over 80": 1
},
"missing_rules": {
"over 80": {
"range": [
80,
150
],
"inclusive": [
false,
false
]
}
}
},
"endtime": {
"alias": "endtime",
"resolution": "s",
"name": "endtime",
"type": "datetime",
"notes": "",
"description": "Questionnaire End Time",
"missing_reasons": {
"No Data": -1
}
},
"weight": {
"alias": "weight",
"name": "weight",
"type": "numeric",
"notes": "",
"description": "weight",
"missing_reasons": {
"No Data": -1
}
},
"identity_text": {
"alias": "identity_text",
"name": "identity (text)",
"type": "text",
"notes": "",
"description": "Identity",
"missing_reasons": {
"No Data": -1
}
},
"long_pastas": {
"alias": "long_pastas",
"name": "Long pasta",
"type": "categorical",
"notes": "",
"description": "Is the pasta a long one?",
"categories": [
{
"numeric_value": 1,
"selected": true,
"id": 1,
"missing": false,
"name": "Long"
},
{
"numeric_value": 0,
"id": 0,
"name": "Not long",
"missing": false
},
{
"numeric_value": null,
"id": -1,
"name": "No Data",
"missing": true
}
]
}
},
"order": [
"food_groups",
"abolitionists",
"nordics",
"age",
"endtime",
"weight",
"identity_text",
"long_pastas"
]
}
}
}
Variable by variable
Each variable type that Crunch supports is included in the example above. Now, we step through each type, which include some other available options. For each variable, name and type are required and other attributes (e.g., notes description) are optional, which can be left off if they aren’t being set. If you supply an alias, it must be consistent (the same) with the key for that variable. For more information about what attributes are possible in each variable, see the Variable Definitions section and the Variables endpoint.
Categorical
Each categorical must include a list of categories in its categories key. For each category: name and id are required, missing will default to false, and numeric_value to null if they are not included. If you would like to specify missing (or a non-null numeric value), include them in the payload (see the snippet below for examples).
"food_groups": {
"alias": "food_groups",
"name": "Food groups",
"type": "categorical",
"notes": "A categorical variable where the missing categories are interspersed throughout the non-missing categories",
"description": "Four of the five USDA food groups",
"categories": [
{
"numeric_value": 0,
"id": 0,
"name": "Vegetables",
"missing": false
},
{
"numeric_value": 32766,
"id": 32766,
"name": "Don't know",
"missing": true
},
{
"numeric_value": 2,
"id": 2,
"name": "Fruit",
"missing": false
},
{
"numeric_value": 5,
"id": 5,
"name": "Grain",
"missing": false
},
{
"numeric_value": null,
"id": -1,
"name": "No Data",
"missing": true
},
{
"numeric_value": 4,
"id": 4,
"name": "Meat",
"missing": false
},
{
"numeric_value": 32767,
"id": 32767,
"name": "Not asked",
"missing": true
}
]
}
Numeric
Numeric variables can include missing missing reasons and missing rules. See the missing reasons section for more information about missing reasons and missing rules.
"age": {
"alias": "age",
"name": "age",
"type": "numeric",
"notes": "",
"description": "Age — with a missing rule that people over 80 are marked as missing.",
"missing_reasons": {
"No Data": -1,
"over 80": 1
},
"missing_rules": {
"over 80": {
"range": [
80,
150
],
"inclusive": [
false,
false
]
}
}
}
Text
"identity_text": {
"alias": "identity_text",
"name": "identity (text)",
"type": "text",
"notes": "",
"description": "Identity",
"missing_reasons": {
"No Data": -1
}
}
Datetime
"endtime": {
"alias": "endtime",
"resolution": "s",
"name": "endtime",
"type": "datetime",
"notes": "",
"description": "Questionnaire End Time",
"missing_reasons": {
"No Data": -1
}
}
Logical
Logical variables (aka 3-value-logic) variables are similar to categorical variables, with the exception that (at least) one of their categories has the "selected" = true attribute.
"long_pastas": {
"alias": "long_pastas",
"name": "Long pasta",
"type": "categorical",
"notes": "",
"description": "Is the pasta a long one?",
"categories": [
{
"numeric_value": 1,
"selected": true,
"id": 1,
"missing": false,
"name": "Long"
},
{
"numeric_value": 0,
"id": 0,
"name": "Not long",
"missing": false
},
{
"numeric_value": null,
"id": -1,
"name": "No Data",
"missing": true
}
]
}
}
Categorical array
Categorical arrays have categories just like a categorical, but additionally have a subvariables key with the names and aliases of each of the items/subvariables.
"abolitionists": {
"description": "Do you have a favorable or an unfavorable opinion of the following:",
"name": "Abolitionists",
"notes": "A categorical array variable, where one item has no responses",
"alias": "abolitionists",
"type": "categorical_array",
"subvariables": [
{
"alias": "brown",
"name": "John Brown"
},
{
"alias": "douglass",
"name": "Frederick Douglass"
}
],
"categories": [
{
"numeric_value": 0,
"selected": false,
"id": 0,
"missing": false,
"name": "Very favorable"
},
{
"numeric_value": 2,
"selected": false,
"id": 2,
"missing": false,
"name": "Somewhat favorable"
},
{
"numeric_value": 3,
"selected": false,
"id": 3,
"missing": false,
"name": "Somewhat unfavorable"
},
{
"numeric_value": 4,
"selected": false,
"id": 4,
"missing": false,
"name": "Very unfavorable"
},
{
"numeric_value": 5,
"selected": false,
"id": 5,
"missing": false,
"name": "Don't know"
},
{
"numeric_value": 32766,
"selected": false,
"id": 32766,
"missing": true,
"name": "skipped"
},
{
"numeric_value": 32767,
"selected": false,
"id": 32767,
"missing": true,
"name": "not asked"
},
{
"numeric_value": null,
"selected": false,
"id": -1,
"missing": true,
"name": "No Data"
}
]
}
Multiple Response
Multiple response variables are very similar to categorical array variables above, the main difference being that (at least) one of the categories has the attribute "selected": true.
"nordics": {
"description": "Which of the following Nordic countries have you visited? (select all that apply)",
"uniform_basis": false,
"name": "Nordic countries",
"notes": "A multiple response variable",
"alias": "nordics",
"type": "multiple_response",
"subvariables": [
{
"alias": "se",
"name": "Sweden"
},
{
"alias": "fi",
"name": "Finland"
},
{
"alias": "dk",
"name": "Denmark"
},
{
"alias": "no",
"name": "Norway"
},
{
"alias": "is",
"name": "Iceland"
}
],
"categories": [
{
"numeric_value": 1,
"selected": true,
"id": 1,
"missing": false,
"name": "selected"
},
{
"numeric_value": 2,
"missing": false,
"id": 2,
"name": "not selected"
},
{
"numeric_value": 32766,
"missing": true,
"id": 32766,
"name": "skipped"
},
{
"numeric_value": 32767,
"missing": true,
"id": 32767,
"name": "not asked"
},
{
"numeric_value": null,
"missing": true,
"id": -1,
"name": "No Data"
}
]
}
Cube
Cubes have both input and output formats. The “crunch:cube” element is used for the output only.
Cube input
The input format may vary slightly according to the API endpoint (since some parameters may be inherent in the particular resource), but involves the same basic ingredients.
Example:
{
"dimensions": [
{"variable": "datasets/ab8832/variables/3ffd45/"},
{"function": "+", "args": [{"variable": "datasets/ab8832/variables/2098f1/"}, {"value": 5}]}
],
"measures": {
"count": {"function": "cube_count", "args": []}
}
}
dimensions
An array of input expressions. Each expression contributes one dimension to the output cube. The only exception is when a dimension results in a boolean (true/false) column, in which case the data are filtered by it as a mask instead of adding a dimension to the output.
When a dimension is added, the resulting axis consists of distinct values rather than all values. Variables which are already “categorical” or “enumerated” will simply use their “categories” or “elements” as the extent. Other variables form their extents from their distinct values.
measures
A set of cube functions to populate each cell of the cube. You can request multiple functions over the same dimensions (such as “cube_mean” and “cube_stddev”) or more commonly just one (like “cube_count”). Each member MUST be a ZZ9 cube function designed for the purpose. See ZZ9 User Guide:Cube Functions for a list of such functions and their arguments.
filters
An array containing references to filters that need to be applied to the dataset before starting the cube calculations. It can be an empty array or null, in which case no filtering will be applied.
weight
A reference to a variable to be used as the weight on all cube operations.
Cube output
Cubes collect columns of measure data in an arbitrary number of dimensions. Multiple measures in the same cube share dimensions, effectively overlaying each other. For example, a cube might contain a “count” measure and a “mean” measure with the same shape:
{
"element": "crunch:cube",
"n": 210,
"missing": 12,
"dimensions": [
{"references": {"name": "A", ...}, "type": {"class": "categorical", "categories": [{"id": 1, ...}, {"id": 2, ...}, {"id": 3, ...}]}},
{"references": {"name": "B", ...}, "type": {"class": "categorical", "categories": [{"id": 11, ...}, {"id": 12, ...}]}}
],
"measures": {
"count": {
"metadata": {"references": {}, "type": {"class": "numeric", "integer": true, ...}},
"data": [10, 20, 30, 40, 50, 60],
"n_missing": 12
},
"mean": {
"metadata": {"references": {}, "type": {"class": "numeric", ...}},
"data": [3.5, 17.8, 9.9, 7.32, 0, 23.4],
"n_missing": 12
}
},
"margins": {
"data": [210],
"0": {"data": [30, 70, 110]},
"1": {"data": [90, 120]}
}
}
dimensions
The “dimensions” member is the most straightforward: an array of variable Definition objects. Each one defines an axis of the cube’s output. This may be different from the input dimensions’ definitions. For example, when counting numeric variables, the input dimension might be an expression involving the bin builtin function. Even though the input variable is of type “numeric”, the output dimension would be of type “enum”.
n
The number of rows considered for all measures.
measures
The “measures” member includes one object for each measure. The “metadata” member of each tells you the name, type and other definitions of the measure. The “data” member of each is a flattened array of values for that measure; the dimensions stride into that array in order, with the last dimension varying the fastest. In the example above, the first dimension (“A”) has 3 categories, while “B” has 2; therefore, the “flat” array [10, 20, 30, 40, 50, 60] for the “count” measure is interpreted as the “unflattened” array [[10, 20], [30, 40], [50, 60]]. Graphically:
B:11 | B:12 | |
A:1 | 10 | 20 |
A:2 | 30 | 40 |
A:3 | 50 | 60 |
This is known in NumPy and other domains as “C order” (versus “Fortran order” which would be interpreted as [[10, 30, 50], [20, 40, 60]] instead).
n_missing
The number of rows that are missing for this measure. Because different measures may have different inputs (the column to take the mean of, for example, or weighted versus unweighted), this number may vary from one measure to another even though the total “n” is the same for all.
margins
The “margins” member is optional. When present, it is a tree of nested margins with one level of depth for each dimension. At the top, we always include the “grand total” for all dimensions. Then, we include a branch for each axis we “unroll”. So, for example, for a 3-dimensional cube of X, Y, and Z, the margins member might contain:
{
"margins": {
"data": [4526],
"0": {
"data": [1755, 2771],
"1": {"data": [
[601, 370, 322, 269, 147, 46],
[332, 215, 596, 523, 437, 668]
]},
"2": {"data": [[1198, 557], [1493, 1278]]}
},
"1": {
"data": [933, 585, 918, 792, 584, 714],
"0": {"data": [
[601, 370, 322, 269, 147, 46],
[332, 215, 596, 523, 437, 668]
]},
"2": {"data": [
[825, 108], [560, 25], [325, 593],
[417, 375], [191, 393], [373, 341]
]}
},
"2": {
"data": [2691, 1835],
"0": {"data": [[1198, 557], [1493, 1278]]},
"1": {"data": [
[825, 108], [560, 25], [325, 593],
[417, 375], [191, 393], [373, 341]
]}
}
}
Again, each branch in the tree is an axis we “unroll” from the grand total. So margins[0][2] contains the margin where X (axis 0) and Z (axis 2) are unrolled, and only Y (axis 1) is still “rolled up”.