Cubes have both input and output formats. The “crunch:cube” element is used for the output only.
Cube input
The input format may vary slightly according to the API endpoint (since some parameters may be inherent in the particular resource), but involves the same basic ingredients.
Example:
{
"dimensions": [
{"variable": "datasets/ab8832/variables/3ffd45/"},
{"function": "+", "args": [{"variable": "datasets/ab8832/variables/2098f1/"}, {"value": 5}]}
],
"measures": {
"count": {"function": "cube_count", "args": []}
}
}
dimensions
An array of input expressions. Each expression contributes one dimension to the output cube. The only exception is when a dimension results in a boolean (true/false) column, in which case the data are filtered by it as a mask instead of adding a dimension to the output.
When a dimension is added, the resulting axis consists of distinct values rather than all values. Variables which are already “categorical” or “enumerated” will simply use their “categories” or “elements” as the extent. Other variables form their extents from their distinct values.
measures
A set of cube functions to populate each cell of the cube. You can request multiple functions over the same dimensions (such as “cube_mean” and “cube_stddev”) or more commonly just one (like “cube_count”). Each member MUST be a ZZ9 cube function designed for the purpose. See ZZ9 User Guide:Cube Functions for a list of such functions and their arguments.
filters
An array containing references to filters that need to be applied to the dataset before starting the cube calculations. It can be an empty array or null, in which case no filtering will be applied.
weight
A reference to a variable to be used as the weight on all cube operations.
Cube output
Cubes collect columns of measure data in an arbitrary number of dimensions. Multiple measures in the same cube share dimensions, effectively overlaying each other. For example, a cube might contain a “count” measure and a “mean” measure with the same shape:
{
"element": "crunch:cube",
"n": 210,
"missing": 12,
"dimensions": [
{"references": {"name": "A", ...}, "type": {"class": "categorical", "categories": [{"id": 1, ...}, {"id": 2, ...}, {"id": 3, ...}]}},
{"references": {"name": "B", ...}, "type": {"class": "categorical", "categories": [{"id": 11, ...}, {"id": 12, ...}]}}
],
"measures": {
"count": {
"metadata": {"references": {}, "type": {"class": "numeric", "integer": true, ...}},
"data": [10, 20, 30, 40, 50, 60],
"n_missing": 12
},
"mean": {
"metadata": {"references": {}, "type": {"class": "numeric", ...}},
"data": [3.5, 17.8, 9.9, 7.32, 0, 23.4],
"n_missing": 12
}
},
"margins": {
"data": [210],
"0": {"data": [30, 70, 110]},
"1": {"data": [90, 120]}
}
}
dimensions
The “dimensions” member is the most straightforward: an array of variable Definition objects. Each one defines an axis of the cube’s output. This may be different from the input dimensions’ definitions. For example, when counting numeric variables, the input dimension might be an expression involving the bin builtin function. Even though the input variable is of type “numeric”, the output dimension would be of type “enum”.
n
The number of rows considered for all measures.
measures
The “measures” member includes one object for each measure. The “metadata” member of each tells you the name, type and other definitions of the measure. The “data” member of each is a flattened array of values for that measure; the dimensions stride into that array in order, with the last dimension varying the fastest. In the example above, the first dimension (“A”) has 3 categories, while “B” has 2; therefore, the “flat” array [10, 20, 30, 40, 50, 60] for the “count” measure is interpreted as the “unflattened” array [[10, 20], [30, 40], [50, 60]]. Graphically:
B:11 | B:12 | |
A:1 | 10 | 20 |
A:2 | 30 | 40 |
A:3 | 50 | 60 |
This is known in NumPy and other domains as “C order” (versus “Fortran order” which would be interpreted as [[10, 30, 50], [20, 40, 60]] instead).
n_missing
The number of rows that are missing for this measure. Because different measures may have different inputs (the column to take the mean of, for example, or weighted versus unweighted), this number may vary from one measure to another even though the total “n” is the same for all.
margins
The “margins” member is optional. When present, it is a tree of nested margins with one level of depth for each dimension. At the top, we always include the “grand total” for all dimensions. Then, we include a branch for each axis we “unroll”. So, for example, for a 3-dimensional cube of X, Y, and Z, the margins member might contain:
{
"margins": {
"data": [4526],
"0": {
"data": [1755, 2771],
"1": {"data": [
[601, 370, 322, 269, 147, 46],
[332, 215, 596, 523, 437, 668]
]},
"2": {"data": [[1198, 557], [1493, 1278]]}
},
"1": {
"data": [933, 585, 918, 792, 584, 714],
"0": {"data": [
[601, 370, 322, 269, 147, 46],
[332, 215, 596, 523, 437, 668]
]},
"2": {"data": [
[825, 108], [560, 25], [325, 593],
[417, 375], [191, 393], [373, 341]
]}
},
"2": {
"data": [2691, 1835],
"0": {"data": [[1198, 557], [1493, 1278]]},
"1": {"data": [
[825, 108], [560, 25], [325, 593],
[417, 375], [191, 393], [373, 341]
]}
}
}
Again, each branch in the tree is an axis we “unroll” from the grand total. So margins[0][2] contains the margin where X (axis 0) and Z (axis 2) are unrolled, and only Y (axis 1) is still “rolled up”.