Deriving Variables
Derived variables are variables that, instead of having a column of values backing them, are functionally dependent on other variables. In Crunch, users with view-only permissions on a dataset can still make derived variables of their own–just as they can make filters. Dataset editors can also derive other types of variables as permanent additions to the dataset, available for all viewers.
Combining Categories
The “combine_categories” function takes two arguments:
- A reference to the categorical or categorical_array variable to be combined.
- A definition of the categories of the new variable, including all members found in categories, plus a “combined_ids” key that maps the derived category to one or more categories (by id) in the input variable.
Given a variable such as:
{
"element": "shoji:entity",
"self": "https://[workspace].crunch.io/api/datasets/3ad42c/variables/0000f5/",
"body": {
"name": "Education",
"alias": "educ",
"type": "categorical",
"categories": [
{"numeric_value": null, "missing": true, "id": -1, "name": "No Data"},
{"numeric_value": 1, "missing": false, "id": 1, "name": "No HS"},
{"numeric_value": 2, "missing": false, "id": 2, "name": "High school graduate"},
{"numeric_value": 3, "missing": false, "id": 3, "name": "Some college"},
{"numeric_value": 4, "missing": false, "id": 4, "name": "2-year"},
{"numeric_value": 5, "missing": false, "id": 5, "name": "4-year"},
{"numeric_value": 6, "missing": false, "id": 6, "name": "Post-grad"},
{"numeric_value": 8, "missing": true, "id": 8, "name": "Skipped"},
{"numeric_value": 9, "missing": true, "id": 9, "name": "Not Asked"}
],
"description": "Education"
}
}
POST’ing to the private variables catalog a Shoji Entity containing a ZCL function like:
{
"element": "shoji:entity",
"body": {
"name": "Education (3 category)",
"description": "Combined from six-category education",
"alias": "educ3",
"derivation": {
"function": "combine_categories",
"args": [
{
"variable": "https://[workspace].crunch.io/api/datasets/3ad42c/variables/0000f5/"
},
{
"value": [
{"name": "High school or less", "numeric_value": null, "id": 1, "missing": false, "combined_ids": [1, 2]},
{"name": "Some college", "numeric_value": null, "id": 2, "missing": false, "combined_ids": [3, 4]},
{"name": "4-year college or more", "numeric_value": null, "id": 3, "missing": false, "combined_ids": [5, 6]},
{"name": "Missing", "numeric_value": null, "id": 4, "missing": true, "combined_ids": [8, 9]},
{"name": "No Data", "numeric_value": null, "id": -1, "missing": true, "combined_ids": [-1]}
]
}
]
}
}
}
…results in a private categorical variable with three valid categories.
Combining the categories of a categorical array is the same as it is for categorical variables. The resulting variable is also of type “categorical_array”. This variable type also has a “subvariables_catalog”, like the variable from which it is derived, and the subvariables contained in it are derived “combine_categories” categorical variables.
Combining Responses
For multiple response variables, you may combine responses rather than categories.
Given a variable such as:
{
"element": "shoji:entity",
"self": "https://[workspace].crunch.io/api/datasets/455288/variables/3c2e57/",
"body": {
"name": "Aided awareness",
"alias": "aided",
"subvariables": [
"../870a2d/",
"../a8b0eb/",
"../dc444f/",
"../8e6279/",
"../f775ab/",
"../6405c2/"
],
"type": "multiple_response",
"categories": [
{"numeric_value": 1, "selected": true, "id": 1, "name": "Selected", "missing": false},
{"numeric_value": 2, "id": 2, "name": "Not selected", "missing": false},
{"numeric_value": 8, "id": 3, "name": "Skipped", "missing": true},
{"numeric_value": 9, "id": 4, "name": "Not asked", "missing": true},
{"numeric_value": null, "id": -1, "name": "No Data", "missing": true}
],
"description": "Which of the following coffee brands do you recognize? Check all that apply."
}
}
POSTing to the variables catalog a Shoji Entity containing a ZCL function like:
{
"element": "shoji:entity",
"body": {
"name": "Aided awareness by region",
"description": "Combined from aided brand awareness",
"alias": "aided_region",
"derivation": {
"function": "combine_responses",
"args": [
{
"variable": "https://[workspace].crunch.io/api/datasets/455288/variables/3c2e57/"
},
{
"value": [
{"name": "San Francisco", "combined_ids": ["../870a2d/", "../a8b0eb/","../dc444f/"]},
{"name": "Portland", "combined_ids": ["../8e6279/", "../f775ab/"]},
{"name": "Chicago", "combined_ids": ["../6405c2/" ]}
]
}
]
}
}
}
…results in a multiple response variable with three responses. The “selected” state of the responses in the derived variable is an “OR” of the combined subvariables.
Case Statements
The “case” function derives a variable using values from the first argument. Each of the remaining arguments contains a boolean expression. These are evaluated in order in an IF, ELSE IF, ELSE IF, …, ELSE fashion; the first one that matches selects the corresponding value from the first argument. For example, if the first two boolean expressions do not match (return False) but the third one matches, then the third value in the first argument is placed into that row in the output. You may include an extra value for the case when none of the boolean expressions match; if not provided, it defaults to the “No Data” missing value.
{
"element": "shoji:entity",
"body": {
"name": "Market segmentation",
"description": "Super-scientific classification of people",
"alias": "segments",
"derivation": {
"function": "case",
"args": [
{
"column": [1, 2, 3, 4],
"type": {
"value": {
"class": "categorical",
"categories": [
{"id": 3, "name": "Hipsters", "numeric_value": null, "missing": false},
{"id": 1, "name": "Techies", "numeric_value": null, "missing": false},
{"id": 2, "name": "Yuppies", "numeric_value": null, "missing": false},
{"id": 4, "name": "Other", "numeric_value": null, "missing": true}
]
}
}
},
{
"function": "and",
"args": [
{"function": "in", "args": [{"variable": "55fc29/"}, {"value": [5, 6]}]},
{"function": "<=", "args": [{"variable": "673dde/"}, {"value": 30}]}
]
},
{
"function": "and",
"args": [
{"function": "in", "args": [{"variable": "889dc3/"}, {"value": [4, 5, 6]}]},
{"function": ">", "args": [{"variable": "673dde/"}, {"value": 40}]}
]
},
{"function": "==", "args": [{"variable": "13cbf4/"}, {"value": 1}]}
]
}
}
}
Making Ad Hoc Arrays
It is possible to create derived arrays reusing subvariables from other arrays using the array function and indicating the reference for each of its subvariables.
The subvariables of an array are specified using the make_frame function, with its first map argument indicating the IDs for each of these virtual subvariables. These IDs are user defined and can be any string. They remain unique inside the parent variable so they can be reused between different arrays. The second argument of the make_frame function indicates the order of the subvariables in the array. They are referenced by the user defined IDs.
Each of its variables must point to a variable expression, which can take an optional (but recommended) references attribute to specify a particular name and alias for the subvariable, if not specified, the same name from the original will be used and the alias will be padded to ensure uniqueness.
{
"CA3": {
"name": "cat array 3",
"derivation": {
"function": "array",
"args": [
{
"function": "make_frame",
"args": [
{
"map": {
"var1": {
"variable": "ca2-subvar-2",
"references": {"alias": "subvar2", "name": "Subvar 2"}
},
"var0": {
"variable": "ca1-subvar-1",
"references": {"alias": "subvar1", "name": "Subvar 1"}
}
}
},
{"value": ["var1", "var0"]}
]
}
]
}
},
"CA2": {
"subvariables": [
{"alias": "ca2-subvar-1", "name": "ca2-subvar-1"},
{"alias": "ca2-subvar-2", "name": "ca2-subvar-2"}
],
"type": "categorical_array",
"name": "cat array 2",
"categories": [
{"numeric_value": null, "missing": false, "id": 1, "name": "yes"},
{"numeric_value": null, "missing": false, "id": 2, "name": "no"},
{"numeric_value": null, "missing": true, "id": -1, "name": "No Data"}
]
},
"CA1": {
"subvariables": [
{"alias": "ca1-subvar-1", "name": "ca1-subvar-1"},
{"alias": "ca1-subvar-2", "name": "ca1-subvar-2"},
{"alias": "ca1-subvar-3", "name": "ca1-subvar-3"}
],
"type": "categorical_array",
"name": "cat array 1",
"categories": [
{"numeric_value": null, "missing": false, "id": 1, "name": "yes"},
{"numeric_value": null, "missing": false, "id": 2, "name": "no"},
{"numeric_value": null, "missing": true, "id": -1, "name": "No Data"}
]
}
}
In the above example, the array CA3uses the array function and uses subvariables ca1-subvar-1 and ca2-subvar-2 from CA1 and CA2 respectively. The references attribute is used to indicate specific name/alias for these subvariables.
Weights
A numeric variable suitable for use as row weights can be constructed from one or more categorical variables and target proportions of their categories. The sample distribution is “raked” iteratively to each categorical marginal target to produce a set of joint values that can be used as weights. Note that available weight variables are shared by all; you may not create private weights. To create a weight variable, POST a JSON variable definition to the variables catalog describing the properties of the weight variable, with an “derivation” member indicating to use the “rake” function, which takes arguments containing an array of variable targets:
POST /api/datasets/{datasetid}/variables/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 739
{
"name": "weight",
"description": "my raked weight",
"derivation": {
"function": "rake",
"args": [{
"variable": variabl1.id,
"targets": [[1, 0.491], [2, 0.509]]
}]
}
}
---------
201 Created
Location: /api/datasets/{datasetid}/variables/{variableid}/
Additionally, it is possible to trim the results of the rake by adding a trim argument at the end of the rake expression that contains the min and max limits as follows:
POST /api/datasets/{datasetid}/variables/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 739
{
"name": "weight",
"description": "my raked weight",
"derivation": {
"function": "rake",
"args": [{
"variable": variabl1.id,
"targets": [[1, 0.491], [2, 0.509]]
}, {
"trim": {"min": 0.1, "max": 1.0}
}]
}
}
---------
201 Created
Location: /api/datasets/{datasetid}/variables/{variableid}/
Multiple Response Views
The “select_categories” function allows you to form a multiple response array from a categorical array, or alter the “selected” categories in an existing multiple response array. It takes two arguments:
- A reference to a categorical or categorical_array variable
- A list of the category ids or category names to mark as “selected”
Given a variable such as:
{
"element": "shoji:entity",
"self": "https://[workspace].crunch.io/api/datasets/3ad42c/variables/0000f5/",
"body": {
"name": "Cola",
"alias": "cola",
"type": "categorical",
"categories": [
{"id": -1, "name": "No Data", "numeric_value": null, "missing": true},
{"id": 0, "name": "Never", "numeric_value": null, "missing": false},
{"id": 1, "name": "Sometimes", "numeric_value": null, "missing": false},
{"id": 2, "name": "Frequently", "numeric_value": null, "missing": false},
{"id": 3, "name": "Always", "numeric_value": null, "missing": false}
],
"subvariables": ["0001", "0002", "0003"],
"references": {
"subreferences": {
"0003": {"alias": "Coke"},
"0002": {"alias": "Pepsi"},
"0001": {"alias": "RC"}
}
}
}
}
POST’ing to the private variables catalog a Shoji Entity containing a ZCL function like:
{
"element": "shoji:entity",
"body": {
"name": "Cola likes",
"description": "Cola preferences",
"alias": "cola_likes",
"derivation": {
"function": "select_categories",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/3ad42c/variables/0000f5/"},
{"value": [2, 3]}
]
}
}
}
…results in a private multiple_response variable where the “Frequently” and “Always” categories are selected.
Categorical Logic
Crunch provides logical operations like most languages do, but with a few enhancements.
First, Crunch logic is actually based on three values so you can properly handle missing data. Many programming languages do this, too: SQL implements NULL in addition to TRUE and FALSE, and general languages often have a similar None or NA value.
In Crunch, rather than having a “boolean” type with true/false/null values, we instead allow any categorical variable to participate in logic. Categories which have the selected: true flag are “selected”; categories which have the missing: true flag are “missing”; any categories that lack both are “other”. Any categorical variable that has at least one “selected” category can function as a “logical variable”.
How can you obtain a logical variable? A few ways:
- You can define it yourself simply by setting the selected: true flag on one or more categories when your data is uploaded. You are probably already doing this for missing: true.
- You can use the select_categories function to derive new Multiple Response variables.
- Logic Functions like ==, >, in, or, and not all return logical variables.
- You can use the “filter builder” in the web UI to quickly create derived variables using the logic functions.
Given a variable such as:
{
"element": "shoji:entity",
"self": "https://[workspace].crunch.io/api/datasets/3ad42c/variables/000301/",
"body": {
"name": "Height",
"alias": "height",
"type": "numeric",
}
}
POST’ing to the private variables catalog a Shoji Entity containing a ZCL function like:
{
"element": "shoji:entity",
"body": {
"name": "Tall people",
"description": "People who are taller than 200cm",
"alias": "tall",
"derivation": {
"function": ">",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/3ad42c/variables/000301/"},
{"value": 200}
]
}
}
}
…results in a private logical variable with three categories: “Selected”, “Other”, and “No Data”.
What can you do with logical variables? A few things:
- You can use them as filters. Any rows where the logical variable is “selected” match the filter and are retained; any that are “other” or “missing” do not match and are filtered out.
- You can reduce them to just three values with the as_selected function. This lets you:
- Calculate proportions and other statistics accurately for Multiple Response analyses by properly representing and propagating missingness.
- Present views where only the “selected” slice is shown.
- Represent the variable in clients as if it were true/false/NA; the “rcrunch” library for R does this, transforming three-valued Crunch variables into R logicals and back.
- You can use and and or to combine them.
Logical Negation
In addition to the typical not operation, Crunch provides a separate not_selected function. The difference is in how missing values are handled. For not, missing values are left missing:
A | not A |
---|---|
S | O |
O | S |
M | M |
This is called the “relative complement” and is usually what you want when you invert a set of selected categories. For example, if you have a variable telling you who likes Brad Pitt, it probably has some “missing” entries, where it’s unknown whether the respondent likes Brad Pitt or not. You often want the inverse of “who does not like Brad Pitt” to not turn those “missing” entries into “selected” ones. It also means that A != B has the same output as not(A == B).
When filtering, however, you often want the “absolute complement”. If you define a filter of those who like Brad Pitt, but then want to exclude those folks, you’ll want the “selected” entries to invert to “other” so they are excluded. But you don’t want the “missing” entries to stay “missing” or they will also be excluded. In this case, you want to use not_selected instead:
A | not_selected A |
---|---|
S | O |
O | S |
M | S |
Binary Logic
The and function behaves like most languages. Any “missing” value makes the output “missing”:
and | S | O | M |
---|---|---|---|
S | S | O | M |
O | O | O | M |
M | M | M | M |
The or function behaves like SQL and R: any “selected” value wins, but any row not matched by another case where any of the input variables are missing stays missing.
or | S | O | M |
---|---|---|---|
S | S | S | S |
O | S | O | M |
M | S | M | M |
Contrary to lots of other languages (like SQL and R), eq returns “selected” if the two values are the same code, even if missing. For example, when comparing whether variable A is equal to missing -8, we want a -8 value in A to evaluate to “selected”, not “missing”. But other missings should result in a missing result.
eq | 1 | 2 | 99 | -1 |
---|---|---|---|---|
missing -1 | M | M | M | S |
missing 99 | M | M | S | |
valid 2 | O | S | ||
valid 1 | S |
ne | 1 | 2 | 99 | -1 |
---|---|---|---|---|
missing -1 | M | M | M | O |
missing 99 | M | M | O | |
valid 2 | S | O | ||
valid 1 | O |
Text Analysis
Sentiment Analysis
The “sentiment” function allows you to derive a categorical variable from text variable data, which is classified and accumulated in three categories (positive, negative, and neutral). It takes one parameter:
- A reference to a text variable
Given a variable such as:
{
"element": "shoji:entity",
"self": "https://[workspace].crunch.io/api/datasets/3ad42c/variables/0000f5/",
"body": {
"name": "Zest",
"alias": "zest",
"type": "text",
"values": [
"Zest is best",
"Zest I can take it or leave it",
"Zest is the worst"
]
}
}
POSTing to the private variables catalog a Shoji Entity containing a ZCL function like:
{
"element": "shoji:entity",
"body": {
"name": "Zesty Sentiment",
"description": "Customer sentiment about Zest",
"alias": "zest_sentiment",
"derivation": {
"function": "sentiment",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/3ad42c/variables/0000f5/"}
]
}
}
}
Will result in a new categorical variable, where for each row the text variable is classified as “Negative”, “Neutral”, or “Positive” using the VADER English social-media-tuned lexicon.
Other Transformations
Arithmetic operations
It is possible to create new numeric variables out of pairs of other numeric variables. The following arithmetic operations are available and will take two numeric variables as their arguments.
- + — for adding up two numeric variables
- - — returns the difference between two numeric variables
- * — for the product of two numeric variables
- / — real division
- // — floor division; Returns always an integer
- ^ — raises the first argument to the power of the second argument
- % — modulo operation; accepts floats
The usage is as follows for all operators:
{
"function": "+",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/123/variables/abc/"}
{"variable": "https://[workspace].crunch.io/api/datasets/123/variables/def/"}
]
}
bin
Receives a numeric variable and returns a categorical one where each category represents a bin of the numeric values.
Each category on the new variable is annotated with a “boundaries” member that contains the lower/upper bound of each bin.
{
"function": "bin",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/123/variables/abc/"}
]
}
Optionally it is possible to pass a second argument indicating the desired bin size to use instead of allowing the API to decide them.
{
"function": "bin",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/123/variables/abc/"},
{'value': 100}
]
}
case
Returns a categorical variable with its categories following the specified conditions from different variables on the dataset. See Case Statements.
cast
Returns a new variable with its type and values casted. Not applicable on arrays or date variable; use Date Functions to work with date variables.
{
"function": "cast",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/123/variables/abc/"},
{"value": "numeric"}
]
}
The allowed output variable types are:
- numeric
- text
- categorical
For categorical types, it is necessary to indicate the categories as a type definition instead of a string name.
To cast to categorical type, the second argument value should not be a name string (numeric, text) but a type definition indicating a class and categories as follows:
{
"function": "cast",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/123/variables/abc/"},
{"value": {
"class": "categorical",
"categories": [
{"id": 1, "name": "one", "missing": false, "numeric_value": null},
{"id": 2, "name": "two", "missing": false, "numeric_value": null},
{"id": -1, "name": "No Data", "missing": true, "numeric_value": null},
]
}
}
]
}
To change the type of a variable a client should POST to the /variable/:id/cast/ endpoint. See Convert type for API examples.
char_length
Returns a numeric variable containing the text length of each value. Only applicable on text variables.
{
"function": "char_length",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/123/variables/abc/"}
]
}
copy_variable
Returns a shallow copy of the indicated variable maintaining type and data.
{
"function": "copy_variable",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/123/variables/abc/"}
]
}
Changes on the data of the original variable will be reflected on this copy.
combine_categories
Returns a categorical variable with values combined following the specified combination rules. See Combining categories.
combine_responses
Given a list of categorical variables, return the selected value out of them. See Combining responses.
row
Returns a numeric variable with row 0 based indices. It takes no arguments.
{
"function": "row",
"args": []
}
remap_missing
Given a text (numeric or datetime variable), it returns a new variable of the same type with its missing values mapped to new codes.
{
"function": "remap_missing",
"args": [
{"variable": "varid"},
{"value": [
{"reason": "Combined 1 and 2", "code": 1, "mapped_codes": [1, 2]},
{"reason": "Only 3", "code": 2, "mapped_codes": [3]},
{"reason": "No Data", "code": -1, "mapped_codes": [-1]}
]}
]
}
The example above will return a copy of the variable with id varid with the new missing_reasons grouping and mapping following the original codes.
Integrating Variables
“Integrating” a variable means to remove its derived properties and turn it into a regular base variable. Doing so will make this variable stop reflecting the expression if new data is added to its original parent variable and new rows will be filled with No Data {"?": -1}.
To integrate a variable it is necessary to PATCH to the variable entity with the derived attribute set to false as so:
PATCH /api/dataset/abc/variables/123/
{
"element": "shoji:entity",
"body": {
"derived": false
}
}
Will effectively integrate the variable and make its derivation attribute contain null from now in. Note that it is only possible to set the derived attribute to false and never to true.
Creating unlinked derivations
It is possible to create a material copy, or one off copy of a variable or an expression of it.
To create such variables, proceed normally creating a derived variable with the derivation expression, but also include derived: false attribute to it. So the variable will be created with the values of the expression but will be unlinked from the original variable.
POST /api/dataset/abc/variables/
Array Variables
Simple variables have only one value per row; sometimes, however, it is convenient to consider multiple values (of the same type) as a single Variable. The Crunch system implements the data as a 2-dimensional array, but the array variable includes two additional attributes: “subvariables”, which is a list of subvariable URLs, and “subreferences”, which is an object of {name, alias, description, …} objects keyed by subvariable URL. There are two types of array variable: categorical array and multiple response.
Categorical Arrays
For the “categorical_array” type, a row has multiple values, and may have a different value for each subvariable. For example, you might field a survey where you ask respondents to rate soft drinks by filling in a grid of a set of brands versus a set of ratings:
72. How much do you like each soft drink?
Not at all Not much OK A bit A lot
Coke o o o o o
Pepsi o o o o o
RC o o o o o
The respondent may only select one rating in each row. To represent that answer data in Crunch, you would define an array. For example, you might POST a Variable Entity with the payload:
{
"element": "shoji:entity",
"body": {
"name": "Soft Drinks",
"type": "categorical_array",
"subvariables": [
"./subvariables/001/",
"./subvariables/002/",
"./subvariables/003/"
],
"subreferences": {
"./subvariables/002/": {"name": "Coke", "alias": "coke"},
"./subvariables/003/": {"name": "Pepsi", "alias": "pepsi"},
"./subvariables/001/": {"name": "RC", "alias": "rc"}
},
"categories": [
{"id": -1, "name": "No Data", "numeric_value": null, "missing": true},
{"id": 1, "name": "Not at all", "numeric_value": null, "missing": false},
{"id": 2, "name": "Not much", "numeric_value": null, "missing": false},
{"id": 3, "name": "OK", "numeric_value": null, "missing": false},
{"id": 4, "name": "A bit", "numeric_value": null, "missing": false},
{"id": 5, "name": "A lot", "numeric_value": null, "missing": false},
{"id": 99, "name": "Skipped", "numeric_value": null, "missing": true}
],
"values": [
[1, 2, {"?": 99}],
[{"?": -1}, 4, 3],
[5, 2, {"?": -1}],
]
}
}
The “Soft Drinks” categorical array variable may now be included in analyses like any other variable, but has 2 dimensions instead of the typical 1. For example, a crosstab of a 1-dimensional “Gender” variable with a 1-dimensional “Education” variable yields a 2-D cube. A crosstab of 1-D “Gender” by 2-D “Soft Drinks” yields a 3-D cube.
In rare cases, you may have already added a separate Variable for “Coke”, one for “Pepsi”, and one for “RC”. You may move them to a single array variable by POSTing a Variable Entity for the array that instead of a “subreferences” attribute has a “subvariables” attribute, a list of URL’s of the variables you’d like to bind together:
{
"body": {
"name": "Soft Drinks",
"type": "categorical_array",
"subvariables": [<URI of the "Coke" variable>, <URI of the "Pepsi" variable>, <URI of the "RC" variable>]
}
}
The existing variables are removed from the normal order and become virtual subvariables of the new array. This approach will cast all subvariables to a common set of categories if they differ. The existing name and alias of each subvariable will be moved to the array’s “subreferences” attribute.
If you wish to analyze a set of categorical variables as an array without moving them, you need to build a derived array instead.
{
"body": {
"name": "Soft Drinks",
"type": "categorical_array",
"derivation": {
"function": "array",
"args": [{
"function": "make_frame",
"args": [{"map": {
"000000": {"variable": <URI of the "Coke" variable>},
"000001": {"variable": <URI of the "Pepsi" variable>},
"000002": {"variable": <URI of the "RC" variable>}
}}]
}]
}
}
}
Your client library may have helper functions to construct the above more easily. This is a bit more advanced, but consequently more powerful: you can grab subvariables from other existing arrays, use more powerful subsetting functions like “frame_subset” and “subvariables”, cast, combine, what-have-you.
Multiple Response
The second type of array is “multiple_response”. These arrays look very similar to categorical_array variables in their data representations, but are usually gathered very differently and behave differently in analyses. For example, you might field a survey where you ask respondents to select countries they have visited:
38. Which countries have you visited?
[] USA
[] Germany
[] Japan
[] None of the above
The respondent may check the box or not for each row. To represent that answer data in Crunch, you would define an array Variable with separate subreferences for “USA”, “Germany”, “Japan”, and “None of the above”:
{
"element": "shoji:entity",
"body": {
"name": "Countries Visited",
"type": "multiple_response",
"subvariables": [
"./subvariables/001/",
"./subvariables/002/",
"./subvariables/003/",
"./subvariables/004/"
],
"subreferences": {
"./subvariables/002/": {"name": "USA", "alias": "visited_usa"},
"./subvariables/004/": {"name": "Germany", "alias": "visited_germany"},
"./subvariables/001/": {"name": "Japan", "alias": "visited_japan"},
"./subvariables/003/": {"name": "None of the above", "alias": "visited_none_of_the_above"}
},
"categories": [
{"id": -1, "name": "No Data", "numeric_value": null, "missing": true},
{"id": 1, "name": "Checked", "numeric_value": null, "missing": false, "selected": true},
{"id": 2, "name": "Not checked", "numeric_value": null, "missing": false},
{"id": 98, "name": "Not shown", "numeric_value": null, "missing": true},
{"id": 99, "name": "Skipped", "numeric_value": null, "missing": true}
]
}
}
Aside from the new type name, the primary difference from the basic categorical array is that one or more categories are marked as “selected”. These are then used to dichotomize the categories such that any subvariable response is treated more as if it were true or false (selected or unselected) than maintaining the difference between each category. If POSTing to create “multiple_response”, you may include a “selected_categories” key in the body, containing an array of category names that indicate the dichotomous selection. If you do not include “selected_categories”, there must be at least one “selected”: true category in the subvariables you are binding into the multiple-response variable to indicate the dichotomous selection–see Object Reference#categories. If neither are true, the request will return 400 status.
The “Countries Visited” multiple response variable may now be included in analyses like any other variable, but with a noticeable difference. Rather than contributing a dimension of distinct categories, it instead contributes a dimension of distinct subvariables. For example, a crosstab of a 1-dimensional “Gender” variable with a 1-dimensional “Education” variable yields a 2-D cube: one dimension of the categories of Gender and one dimension of the categories of Education. A crosstab of 1-D “Gender” by the multiple response “Countries Visited” also yields a 2-D cube: one dimension of the categories of Gender but the other dimension has one entry for “USA”, one for “Germany”, one for “Japan”, and one for “None of the above”.
A quirk of multiple response variables is that analyses of them often require knowledge across subvariables: which rows had any subvariable selected, which rows had no subvariable selected, and which rows had all subvariables marked as “missing”. The Crunch system calculates these ancillary “subvariables” for you, and includes them in analysis output. Including an explicit “None of the above” subvariable in the example above complicates this, since Crunch has no way of knowing to treat such subvariables specially; it will faithfully consider the “None of the above” subvariable like any other subvariable when calculating the any/none/missing views. Depending on your application, you may wish to 1) not even include that option in your survey, 2) skip adding that variable to your Crunch dataset, 3) add it but do not bind it into the parent array variable, or 4) include it and have it be treated like any other multiple response subvariable in your analyses.
Non-uniform basis
As presented above, multiple response variables assume that subvariables have a consistent, uniform basis or number of rows in each subvariable. In some cases, the number of valid and missing entries may be wildly different from one subvariable to the next. In a survey example, a new response may be added to a longer-running series, or different responses may be presented to subsets of respondents in the context of an experiment. The boolean field uniform_basis, if false, provides a hint to users that, rather than using the __any__ column (from the selections function output) in an analysis query, they should instead calculate the basis per subvariable by summing the ‘selected’ and ‘not selected’ categories. The field’s default is true.
Adding New Subvariables
In the scenario that a variable was left out when creating an array variable, it is possible to modify the array variable so that new subvariables get added (always on the last position).
To do so, the subvariable-to-be should currently be a variable of the dataset and have the same type as the subvariables (“categorical”).
Send a PATCH request containing the url of the new subvariable with an empty object as its tuple:
{
...
"index": {
"http://.../url/new/subvariable/": {}
}
}
A 204 response will indicate that the catalog was updated, and the new subvariable now is part of the array variable.