Overview
Crunch Automation is only available for users who have Editor access on the folder where the dataset is saved. See the following article for more information:
This page describes how to use Crunch Automation, a scripting language to execute common Crunch commands. Crunch Automation runs inside the Crunch app, which means 1.) no development environment is required and 2.) no code is run on users' computers. Crunch Automation is executed via scripts, which are plain text files with a sequence of automation commands that run on existing datasets. Examples of available operations include creating filters, creating new derived variables, and changing variable attributes.
You must be the dataset’s current editor in order to execute a Crunch Automation script on a dataset. Crunch Automation can be accessed from within a dataset using the dataset dropdown menu in the upper left corner:
See the how-to guide for further information on how to use Crunch Automation in the web app.
Writing a script
Scripts are plain flat-text files that consist of a list of commands, each terminated by a semicolon ; and a new line for the next command.
The commands are organized into schema commands and non-schema commands.
The following schema commands must always be at the top of the script since they can change the schema of the dataset on which they’re running:
- DROP VARIABLES command
- RENAME command
- REPLACE CATEGORICAL ARRAY CASE command
- REPLACE CONVERT command
- RESET FOLDERS command
After the first non-schema command is declared, the app returns an error if any schema command is found subsequently in the script.
Comments
Scripts support comments. To write or comment any line, it must begin with the pound # character.
Variables
Variables are referred to by their unique alias in a dataset. If any RENAME command is used, then this how you should refer to the variable going forward.
Aliases do not need quotation marks within the script unless they are invalid. Valid aliases:
- cannot start with a number
- cannot contain spaces
- cannot contain any special characters, such as ()!@#$%^& *[]{}?¿`\\'\".,/"
You cannot create new variables with invalid aliases.
If your dataset already includes invalid aliases, you can refer to them in an Automation script by using backticks (`). You are encouraged to rename them using the RENAME command.
# Example use of an alias without escaping
variable_alias
# Example of an alias that has spaces escaped
`alias with space`
Referring to variables in an array
When working with an array of variables, each variable is referenced as array_alias[variable_alias].
For example, if you are working with an array (my_cat_array) that contains variables (var_1, var_2, var_3), you would refer to var_1 as follows:
CHANGE TITLE IN my_cat_array[var_1] WITH "My Variable Name";
Using regex matching variables
It is now possible to expand variables using regex matching against the dataset schema:
regex("quoted_regex")
which evaluates the regular expression against the beginning and end of each alias in the dataset.
For variables a_1, a_2, a_21, a_a, b_1:
regex("a_.") "a_1", "a_2", "a_a"
regex("a_.+") "a_1", "a_2", "a_a", "a_21"
REGEX("a_[1-9]") "a_1", "a_2"
regex("a_[a-z]") "a_a"
Additionally, a LIKE() function is available as well that performs SQL-like % matching:
% character will match any string
_ character will match any single character
For variables 1a2, 2a3, 1b2:
like("%a%") "1a2", "2a3"
LIKE("1%") "1a2", "1b2"
LIKE("1_2") "1a2", "1b2"
As with the rest of the syntax, the function calls are not case sensitive, but the matching argument is.
These can be combined in the same way with other kinds of expansions:
- a_1...a_10
- regex("b.?")
- other alias
which returns the full list of all three expansions.
Using the subvariables_of function
The subvariables_of function allows you to pass multiple array aliases and returns all variables in those arrays in the format array_alias[variable_alias].
subvariables_of(my_array1, my_array2, ...)
If any non-arrays are passed to this function an error is returned.
Using the ALL() function
You can use the ALL() function to indicate all the variables in a dataset. This function takes a single argument that refers to all variables of a certain type:
For example, use the following for variables my_num1, my_num2, my_text, and my_categorical:
ALL() "my_num1", "my_num2", "my_text", "my_categorical"
all("numeric") "my_num1", "my_num2"
All("text") "my_text"
Using the STARTSWITH / ENDSWITH functions
The STARTSWITH() and ENDSWITH() functions allow you to match variables aliases that either start or end with certain strings (case sensitive):
STARTSWITH("prefix_")
ENDSWITH("_suffix")
Generating aliases or variables attributes
Crunch Automation provides a handful of functions that operate on attributes of the input variables so that they can be used as attributes of the newly-generated variables:
- COPY(field_name)
- KEEP_AFTER(field_name, separator)
- KEEP_BEFORE(field_name, separator)
- KEEP_BETWEEN(field_name, start_separator, end_separator)
- PREFIX(field_name, prefix)
- REGEX_REPLACE(field_name, match, replacement)
- REMOVE_COMMON_PREFIX(field_name)
- REMOVE_COMMON_SUFFIX(field_name)
- SUFFIX(field_name, suffix)
For example:
CREATE CONVERT q_1...q_100 TO NUMERIC AS
REGEX_REPLACE(ALIAS, "^q_", "num_")
TITLE COPY(DESCRIPTION)
DESCRIPTION PREFIX(TITLE, "Numeric version of ")
NOTES REMOVE_COMMON_PREFIX(NOTES);
Expressions
The Crunch Automation language allows you to use logical and arithmetic expressions in filters or other commands.
Parenthesis order is respected, and the following junctions are supported:
- <
- >
- >=
- <=
- ==
- AND
- OR
- ORNM
- +
- -
- *
- /
Expressions support function calls, and the following functions are supported:
- add_valid(numeric, ..., numeric): Takes a list of numeric values or numeric arrays and returns the row sum of the valid values.
- all_missing(array): Returns rows where all of the subvariables have missing values.
- all_valid(array): Returns rows where none of the subvariables have missing values.
- any_missing(array): Returns rows where any of the subvariables have a missing value.
- complete_cases(variable, ..., variable): Returns the matching rows where values of all variables involved are not missing.
- duplicates(variable): Returns the column removing any repeated value. Returning only the first occurrence for each.
- is_missing(variable): Returns boolean type with True values on those rows that have a missing value.
- is_valid(variable): Returns boolean type with True on rows that don’t have a missing value.
- straightline(array): Returns a logical variable selecting those rows where only one value has been selected for all responses.
- trim(numeric_variable, min_number, max_number): Replaces all values lesser than min_number and replaces them with it, same for max_number.
Additionally, all boolean expressions can be wrapped in a NOT (expr) negation.
The special value NULL can be used in expressions to refer to system missing values for categorical variables.
Example expressions:
CREATE FILTER Age > 18 AND (Gender = "F" OR Gender = "Female") NAME "Adult women";
CREATE FILTER favorite_browser IN ["Chrome", "Firefox"] AND NOT is_missing(answered) NAME "Top Browsers";
CREATE FILTER 3 <= rating < 8 AND (Age / 2) + 7 NAME "Medium rating dating range";
CREATE FILTER opinion = NULL NAME "No opinion";
CREATE NUMERIC (2020 - Age) * 2 AS double_age NAME "Double Age";
Example script
# First commands specify the Schema block
REPLACE CONVERT v1, v2 TO CATEGORICAL;
REPLACE CONVERT v3, v4 TO NUMERIC;
RENAME v1 TO Age;
RENAME v2 TO Year;
# This is the first non-schema command; no schema commands are possible
# after this.
# Sets the exclusion filter on the dataset
SET EXCLUSION v3 < 0;
CREATE FILTER Age < 20 AS "Teenagers";
Running a script
Following any dataset’s shoji:entity representation, a scripts catalog will be linked.
To run a dataset, a client must perform a POST request with a shoji:entity to the scripts’ catalog with a body attribute containing the text contents of the script file inlined in the payload:
POST /api/datasets/123456/scripts/ HTTP/1.1
{
"element": "shoji:entity",
"body": {
"body": "<Script contents>"
}
}
The server will validate that the commands are consistent with the dataset’s schema and return either a 202 response indicating a progress of the script’s execution, or a 400 response with any detected errors.
Managing a script
There is no management of scripts: once a script is executed, you cannot manage it any further after that. Its record exists as an execution log, as a reference of what changes were performed on it.
There is also no support to DELETE scripts. The only alternative to undo or delete a script is to revert the dataset to a savepoint before its execution.
Reverting a script
There are two ways to revert the output of a script:
- UNDO - Delete he artifacts and variables created by a script, or
- RESTORE - A hard revert that returns the dataset to the state it was before running such script, deleting not only the artifacts and variables created by a script but also reverting all other changes made by the Crunch app, rCrunch, or any other API calls subsequent to running that script.
The difference between both is that a hard revert restores the dataset, as it drops all ensuing scripts and their output (artifacts and variables), while an undo only deletes the artifacts and variables created by this script, but changes made by other scripts and this script's record will remain in place. In both cases, the associated script will be deleted.
Undo
Undoing a script's output is accomplished by sending a DELETE request on the script's output catalog.
In cases where there are any dependencies that prevent the artifacts from being deleted, the request will return a 409 response.
Restore
The application stores a savepoint of the dataset directly before the execution of the commands in a script.
To perform a script-driven restore where all the associated non-versioned artifacts are deleted, the client must send a POST request to the revert endpoint. This will fire up an asynchronous task to iteratively drop all the artifacts from all scripts and restore the dataset.
In cases where there are any dependencies that prevent this script from being reverted, the request will return a 409 response. Keep in mind that some automation commands (e.g. ORGANIZE) cannot be reverted by the undo or restore commands.
Crunch Automation commands
See the following sections to view a current list of all of the commands: