In the Crunch system, any analysis is also referred to as a “cube”. Cubes are the mechanical means of representing analyses to and from the Crunch system; you can think of them as spreadsheets that might have other than two dimensions. A cube consists of two primary parts: “dimensions” which supply the cube axes, and “measures” which populate the cells. Although both the request and response include dimensions and measures, it is important to distinguish between them. The request supplies expressions for each, while the response has data (and metadata) for each. The request declares what variables to use and what to do with them, while the response includes and describes the results. See Object Reference:Cube for complete details.
Dimensions
Each dimension of an analysis can be simply one variable, a function over it, a traversal of its subvariables (for array variables), or even a combination of multiple variables (e.g. A + B). Any expression you can use in a “make_frame” command can be used as a dimension. The big difference is that the system will consider the distinct values rather than all values of the result. Variables which are already “categorical” or “enumerated” will simply use their “categories” or “elements” as the extent. Other variables form their extents from their distinct values.
For example, if “3ffd45” is a categorical variable with three categories (one of which is “No Data”: -1), then the following dimension expressions:
{
"dimensions": [
{"variable": "datasets/ab8832/variables/3ffd45/"},
{"function": "+", "args": [{"variable": "datasets/ab8832/variables/2098f1/"}, {"value": 5}]}
]
}
…would form a result cube with two dimensions: one using the categories of variable “3ffd45”, and one using the distinct values of (variable “2098f1” + 5). If variable “2098f1” has the distinct values [5, 15, 25, 35], then we would obtain a cube with the following extents:
1 | 2 | -1 | |
5 | |||
15 | |||
25 | |||
35 |
Each dimension used in a cube query needs to be reduced to distinct values. For categorical or enumerated variables, we only need to refer to the variable, and the system will automatically use the “categories” or “elements” metadata to determine the distinct values. For other types, the default is to scan the variable’s data to find the unique values present and use those. Often, however, we want a more sophisticated approach: numeric variables, for example, are usually more useful when binned into a handful of ranges, like “0 to 10, 10 to 20, …90 to 100” rather than 100 distinct points (or many more when dealing with non-integers). The available dimensioning functions vary from type to type; the most common are:
- categorical: {“variable”: url}
- text: {“variable”: url}
- numeric: Group the distinct values into a smaller number of bins via:
- {“function”: “bin”, “args”: [{“variable”: url}]}
- datetime: Roll up seconds into hours, days into months, or any other grouping via:
- {“function”: “rollup”, “args”: [{“variable”: url}, {“value”: variable.rollup_resolution}]}
- categorical_array:
- One dimension for the subvariables: {“each”: url}
- One dimension for the categories: {“variable”: url}
- multiple response:
- One dimension for the subvariables: {“each”: url}
- One dimension for the selected-ness, which means transforming the array from a set of arbitrary categories to a standard “selected” set of categories (1, 0, -1) via:
- {“function”: “selections”, “args”: [{“variable”: url}]}
Measures
A set of named functions to populate each cell of the cube. You can request multiple functions over the same dimensions (such as “cube_mean” and “cube_stddev”) or more commonly just one (like “cube_count”). For example:
{"measures": {"count": {"function": "cube_count", "args": []}}}
or:
{"measures": {
"mean": {"function": "cube_mean", "args": [{"variable": "datasets/1/variables/3"}]},
"stddev": {"function": "cube_stddev", "args": [{"variable": "datasets/1/variables/3/"}]}
}}
When applied to the dimensions we defined above, this second example might fill the table thusly for the “mean” measure:
mean | 1 | 2 | -1 |
5 | 4.3 | 12.3 | 8.1 |
15 | 13.1 | 0.0 | 9.2 |
25 | 72.4 | 4.2 | 55.5 |
35 | 8.9 | 9.1 | 0.4 |
…and produce a similar one for the “stddev” measure. You can think of multiple measures as producing “overlays” over the same dimensions. However, the actual output format (in JSON) is more compact in that the dimensions are not repeated; see Object Reference:Cube output for details.
ZCL expressions are composable. If you need, for example, to find the mean of a categorical variable’s “numeric_value” attributes, cast the variable to the “numeric” type class before including it as the cube argument:
{"measures": {
"mean": {
"function": "cube_mean",
"args": [{
"function": "cast",
"args": [
{"variable": "datasets/1/variables/3"},
{"class": "numeric"}
]
}]
}
}}
Multitables
GET /datasets/{id}/multitables/ HTTP/1.1
200 OK
{
"element": "shoji:catalog",
"index": {
"1/": {"name": "Major demographics"},
"2/": {"name": "Political tendencies"}
}
}
POST /datasets/{id}/multitables/ HTTP/1.1
{
"element": "shoji:entity",
"body": {
"name": "Geographical indicators",
"template": [
{
"query": [
{
"variable": "../variables/de85b32/"
}
]
},
{
"query": [
{
"variable": "../variables/398620f/"
}
]
},
{
"query": [
{
"function": "bin",
"args": [
{
"variable": "../variables/398620f/"
}
]
}
]
}
],
"is_public": false
}
}
201 Created
Location: datasets/{id}/multitables/3/
Analyses as described above are truly multidimensional; when you add another variable, the resulting cube obtains another dimension. Sometimes, however, you want to compare analyses side by side, typically looking at several (even all) variables against a common set of conditioning variables. For example, you might nominate “Gender”, “Age”, and “Race” as the conditioning variables and cross every other variable with those, in order to quickly discover common correlations.
Multi-table definitions mainly provide a template member that clients can use to construct a valid query with the variable(s) of interest.
Crunch provides a separate catalog where you can define and manage these common sets of variables. Like most catalogs, you can GET it to see which multitables are defined.
Template query
A multitable is a set of queries that form groups of ‘columns’ for different later chosen ‘row’ variables. It is defined by a name and a template. At minimum the template must contain a query fragment: this will be later inserted after some function of a row variable to form the dimension <analyzing-dimensions> of a result. Each template dimension can currently only be a function of one variable.
GET /datasets/{id}/multitable/3/ HTTP/1.1
{
"element": "shoji:entity",
"body": {
"name": "Geographical indicators",
"template": [
{
"query": [
{
"variable": "../variables/de85b32/"
}
]
},
{
"query": [
{
"variable": "../variables/398620f/"
}
]
},
{
"query": [
{
"function": "bin",
"args": [
{
"variable": "../variables/398620f/"
}
]
}
]
}
]
}
}
Each multi-table template may be a list of variable references and other information used to construct the dimension and transform its output.
Transforming Analyses for Presentation
The transform member of an analysis specification (or multitable definition) is a declarative definition of what the dimension should look like after computation. The cube result dimension itself will always be derived from the query part of the request ({variable: $variableId}), {function: f, args: [$variableId, …]}, etc., after which clients should do what is necessary to arrive at the transformed result — changing element names, orders, and so on.
For historical reasons, the JSON schema for transforms on a multitable definition differ somewhat from the schema used for slide analyses. The following schema description applies to transforms appearing on an analysis. The (legacy) schema used for a multitable template appears in the following section.
Slide-analysis transforms schema
The transforms affecting a slide in a deck are on the analysis entity corresponding to the slide.
Unlike a multitable-template transform, transforms on a slide-analysis can apply to both the rows and columns of the data matrix underlying a visualization.
Example:
A saved analysis can include a “transforms”: object. This example makes a univariate table of a multiple response variable, applies an explicit reordering, relabels one row, and hides another, and specifies the color of one bar when displayed as a chart.
{
"query": {
"dimensions": [
{"variable": "../variables/398620f/"},
{
"function": "selections",
"args": [{"variable": "../variables/398620f/"}]
}
],
"measures": {
"count": {"function": "cube_count", "args": []}
}
},
"transforms": {
"rows_dimension": {
"elements": {
"0003": {
"fill": "#AB12EF",
"name": "Toastie"
},
"0002": {
"hide": true
}
},
"order": {
"type": "explicit",
"element_ids": ["0003", "0001", "0002", "0005", "0004"],
}
}
}
}
Schema — General principles
The following principles underlie the design of this schema for analysis transforms.
- transforms cascade
- Transforms can be defined at multiple levels. For example, default transforms can be defined on a variable and apply to all analyses deriving from that variable. The analysis-transforms described here inherit from the default transforms. When a transform specified at the default level is also specified at the analysis level, the analysis-transform overrides the inherited value. Any default-transforms not specified at the analysis level apply without change.
- Not all inherited values arise from other transforms. For example, element name and element ordering are inherited from the variable definition itself.
- strictly-sparse
- The transforms schema is strictly sparse, meaning only those items that must be specified should be specified. This is critical to the proper cascading behavior of transforms.
- In particular, adding an element transform for all elements in the dimension, setting "name": to the existing element name in each case, will produce misbehavior. While this would produce a visualization that looked right initially, it defines transforms which the user did not ask for and almost certainly did not intend. In this case, if the underlying element name were later changed on the variable definition, that change would appear to “not take” on this analysis because the new name is being transformed back to the old name.
- A consequence of the strictly-sparse ethic is that everything is optional. Every value present specifies a transform and no values need be present. Container objects that are empty should not appear.
- In charts, rows are series and columns are categories
- Because transforms describe the underlying data matrix, we use the names “rows” and “columns” to give the dimensions an intuitive interpretation that corresponds to a tabular visualization. The mapping of rows and columns to chart items has some added complication though, depending on the dimensionality of the analysis.
- In a bivariate chart, each row maps to a series (set of same-color bars) and each column maps to a chart category (cluster of bars). In a univariate chart, the overall chart maps to the rows-dimension and each row element represents a chart category.
transforms object
The analysis entity can contain an optional "transforms": object. If omitted, no transforms are specified on the analysis.
The "transforms": object can contain a "rows_dimension": item and a "columns_dimension": item. Both are optional. A "columns_dimension": object can only appear on a bivariate analysis (a univariate analysis has no columns dimension).
transforms/rows_dimension object
The "rows_dimension": item in the transforms object is optional. If omitted, no transforms are specified on the rows dimension.
The "rows_dimension": object can contain a "name": item, a "description": item, an "elements": object and an "order": object. All are optional.
- The "name": item contains a string display-name that will override the dimension name (generally the variable name) wherever that appears in the analysis or visualization.
- The "description": item contains a string display-name that will override the dimension description (generally the variable description) wherever that appears in the analysis or visualization.
- The "elements": object contains an element-transforms object for each rows-dimension element having one or more transforms. Each element-transforms object is keyed by its element-id, converted to a string when numeric. The form of an element-transforms object is described in a following section. The elements container object should be omitted when empty.
- The "order": object, if present, specifies the sequence in which rows should appear. The form of the order object is described in a following section.
"transforms": {
"rows_dimension": {
"elements": {
"3": {...},
"2": {...},
},
"order": {
...
}
}
transforms/columns_dimension object
The "columns_dimension": item in the transforms object is optional. If omitted, no transforms are specified on the columns dimension.
A "columns_dimension": object is only meaningful for a bivariate analysis (a univariate analysis has no columns-dimension). It should not appear on a univariate analysis.
The "columns_dimension": object can contain a "name": item, a "description": item, an "elements": object and an "order": object. All are optional.
- The "name": item contains a string display-name that will override the dimension name (generally the variable name) wherever that appears in the analysis or visualization.
- The "description": item contains a string display-name that will override the dimension description (generally the variable description) wherever that appears in the analysis or visualization.
- The "elements": object contains an element-transforms object for each columns-dimension element having one or more transforms. Each element-transforms object is keyed by its element-id, converted to string when numeric. The form of the element-transforms object is the same as that appearing in the rows-dimension object and is described in a following section. The elements container object should be omitted when empty.
- The "order": object, if present, specifies the sequence in which columns should appear. The form of the order object is the same as that appearing in the rows-dimension object and is described in a following section.
"transforms": {
"columns_dimension": {
"elements": {
"1": {...},
"4": {...},
},
"order": {
...
}
}
Element-transforms objects
Certain transforms can be applied to an individual element, including rename and hide. These transforms are specified in an element-transforms object, contained in the "elements:" collection for a row or column dimension. Each element-transforms object is keyed by its element-id (category-id on certain variable types). The id-value is converted to string when numeric as JSON only allows string keys.
An element-transforms object can contain a "fill": item (sometimes), a "hide": item and a "name": item. All are optional.
- The "fill": item is an RGB color value like "#ffd0ac" and only applies to visualizations that use color, such as a chart; it is ignored by a table visualization. The color value must begin with a hash character (“#”) followed by six hexadecimal digits (no 3-digit color codes). Case is insignificant.
Note that the fill item can appear only in a rows-dimension element. Fill color behavior for a columns dimension element is not defined.
The behavior of a fill transform varies by dimensionality. In a univariate chart, it specifies the color of the single bar (sector in a donut chart) associated with the row-element. In a bivariate chart it specifies the color of all bars in the chart series associated with the row-element. - A "hide": item takes the value true or false. Omitted implies “inherit”. By default, all non-missing (aka. valid) elements in a dimension appear in the visualization. The element-hide transform allows appearance of an element to be suppressed. This translates to an “invisible” row or column in a table, or a category or series that does not appear in a chart.
Hiding of this sort does not change the calculation of the underlying data matrix in any way. All values are the same, including summary values and totals. Only the display of the values associated with the element in the visualization is suppressed.
In general, a hide item should only appear when its value is true. A value of false overrides any setting cascaded from a higher level, such as a default transform, and would restore visibility to an element hidden there. The hide-transform should be thought of as a three-valued setting (show, hide, or inherit-setting). Make sure you’re reflecting the user’s intent when setting this value, in particular when choosing between false and omitting the item entirely. - A "name": item changes the name displayed on the analysis without changing the underlying element name. Do not add a name item where the user has not explicitly specified a name change, in particular, do not write the element-name into this item for each element simply because it is more convenient than distinguishing actual name-overrides. Doing so would shadow element-name changes made in the variable definition.
"elements": {
"1": {
"fill": "#ffd0ac",
"hide": true,
"name": "T-Mobility"
}
}
Element-order objects
By default, dimension elements appear in cube-result document order (where they are represented as a sequence). An element-order transform applies a different ordering to those elements, for this analysis only. The new order is specified as a list of element-ids, possibly partial.
This transform must be tolerant of both element additions and removals. Since the transform is stored separately from the variable definition, there is no assurance that a once exhaustive sequence of element-ids will fully match the elements in a later cube-result. The following processing rules provide this accommodation:
- Any element ids in the order sequence that do not correspond to an element in the cube-result are ignored.
- Any elements in the cube result that are not present in the order-sequence appear at the end, in cube-result document order.
Note that this rule provides support for “partial” ordering, such as placing a given brand first, followed by the remaining brands in document order, without needing to specify the remaining brands.
Currently, the only sort type is “explicit”. Others, such as sort by value are expected to follow.
"order": {
"type": "explicit",
"element_ids": [42, 7, 3]
}
Multitable Transforms Schema
A transform can contain elements or categories, which is an array of target transforms for output-dimension elements. Therefore to create a valid element/category transform it is generally necessary to make a cube query, inspect the result dimension, and proceed from there. For categorical and multiple response variables, elements may also be obtained from the variable entity.
Transforms are designed for variables that are more stable than not, with element ids that inhere in the underlying elements, such as category or subvariable ids. Dynamic elements such as results of binning a numeric variable, may not be transformed.
Transformations stored on a variable’s view are the default transforms for that variable. They may be shorter, alternate versions of category names, or contain insertions, described below.
Insertions
In addition to transforming the categories or elements already defined on a cube ‘dimension’, it is possible to insert headings and subtotals to the result. These insertions are attached after an anchor element/category id.
Insertions are processed last, after renaming, reordering, or sorting elements according to the elements/categories transform specification. They are “attached” to their anchor, always following it in the result; or, simply appended to the end of the result. If the result is sorted by some column’s value, it may make the most sense to choose to display insertions last, rather than inserting them into a result table because their values will not be considered when sorting the non-inserted elements themselves.
An insertion is defined by an anchor and a name, which will be displayed alongside the names of categories/elements. It may also contain "function": "subtotal" and "args": [], where the array of args are the category ids of elements to combine as “subtotals”.
Use an anchor of top to indicate an insertion before other results. Use an anchor of bottom to indicate an insertion after other results. Any anchor that does not match an id in the elements/categories will be included at the end of results.
Example
Consider the following example result dimension:
Name | missing | id |
Element A | 0 | |
Element B | 1 | |
Element C | 2 | |
Don't know | 3 | |
Not asked | true | 4 |
An element transform can specify a new order of output elements, new names, and in the future, bases for hypothesis testing, result sorting, and aggregation of results. A transform has elements that look generally like the dimension’s extent, with some optional properties:
- id: (required) id of the target element/category
- name: name of new target element/category
- sort: -1 or 1 indicating to sort results descending or ascending by this element
- compare: neq, leq, geq indicating to test other rows/columns against the hypothesis that they are ≠, ≤, or ≥ to the present element
- hide: suppress this element’s row/column from displaying at all. Defaults to false for valid elements, true for missing, so that if an element is added, it will be present until a transform with hide: true is added to suppress it.
A transform with object members can do lots of things. Suppose we want to put Element C first, hide the Don’t know, and more compactly represent the result as just C, A, B:
{
"transform": {"categories": [
{"id": 2, "name": "C"},
{"id": 0, "name": "A"},
{"id": 1, "name": "B"},
{"id": 3, "hide": true}
]}
}
Example transform in a multitable template
In a multitable, the transform is part of each dimension definition object in the template array.
{
"template": [
{
"query": [
{"variable": "A"}
],
"transform": [{}, {}]
},
{
"query": [
{
"function": "rollup",
"args": [
{"value": "M"},
{"variable": "B"}
]
}
]
}
]
}
More Complex Multitable Templates
The template may contain in addition to variable references and their query arguments, an optional transform: To obtain their multiple output cubes, you GET datasets/{id}/cube?query=<q> where <q> is a ZCL object in JSON format (which must then be URI encoded for inclusion in the querystring). Use the “each” function to iterate over the overview variables’ query, producing one output cube for each one as “variable x”. For example, to cross each of the above 3 variables against another variable “449b421”:
{
"function": "each",
"args": [
{"value": "x"},
[
{"variable": "de85b32"},
{"variable": "398620f"},
{"variable": "c116a77"}
]
],
"block": {
"function": "cube",
"args": [
[
{"variable": "449b421"},
{"variable": "x"}
],
{
"map": {
"count": {
"function": "cube_count",
"args": []
}
}
},
{"value": null}
]
}
}
The result will be an array of output cubes:
{
"element": "shoji:view",
"value": [
{
"query": {},
"result": {
"element": "crunch:cube",
"dimensions": [
{
"references": "449b421",
"type": "etc."
},
{
"references": "de85b32",
"type": "etc."
}
],
"measures": {
"count": {
"function": "cube_count",
"args": []
}
}
}
},
{
"query": {},
"result": {
"element": "crunch:cube",
"dimensions": [
{
"references": "449b421",
"type": "etc."
},
{
"references": "398620f",
"type": "etc."
}
],
"measures": {
"count": {
"function": "cube_count",
"args": []
}
}
}
},
{
"query": {},
"result": {
"element": "crunch:cube",
"dimensions": [
{
"references": "449b421",
"type": "etc."
},
{
"references": "c116a77",
"type": "etc."
}
],
"measures": {
"count": {
"function": "cube_count",
"args": []
}
}
}
}
]
}