See Crunch Automation basics for more information.
The CREATE CATEGORICAL CASE command allows you to create your own categorical variable. There are many use cases for this (e.g., allocating respondents to segments, fixing survey skips (DP jargon sometimes refers to this as "rebasing"), and combining variables into one). CATEGORICAL CASE variables can be thought of as cure-all for many DP tasks.
Each category is defined on a case-by-case basis by a logical condition (see the expressions section for more information on conditions/expressions). When a case (i.e., row of the data, typically a respondent) evaluates to TRUE in the expression (boolean logic), then it is allocated to that category. The web application allows you to do the same thing using the Create Categorical variable builder.
- A common user error is to forget the END argument, which is needed to tell the script that the series of conditional statements is over.
- Use commas (,) in the script to separate each statement. Commas are needed in all but the last statement (see examples below).
THEN VARIABLE (optional argument)
You can also optionally set each category as a constant or as a fill from another variable. Both examples are illustrated in the following script. If you are filling from another variable, then this is when you use the THEN VARIABLE argument.
Creating a variable with constant categories:
CREATE CATEGORICAL CASE WHEN condition THEN "label" [CODE value [MISSING]], ... WHEN condition THEN "label" [CODE value [MISSING]] [ELSE "string" [CODE value [MISSING]|ELSE INTO NULL] END AS alias [TITLE "string"] [DESCRIPTION "string"] [NOTES "string"];
Creating a variable with a variable fill:
CREATE CATEGORICAL CASE WHEN condition THEN VARIABLE alias, ... WHEN condition THEN VARIABLE alias [ELSE "string" [CODE value [MISSING]|ELSE INTO NULL] END AS alias [TITLE "string"] [DESCRIPTION "string"] [NOTES "string"];
You can also use REPLACE within the CATEGORICAL CASE command. The difference is that REPLACE does not include the AS... assignment at the end, which is used to indicated a new variable. Instead, the new variables are indicated at the beginning (before WHEN).
Everything else remains the same, including how it computes. The results are not stored on a new variable but instead overwrites the mentioned columns:
REPLACE CATEGORICAL CASE alias,...,alias WHEN condition THEN "label" [CODE value [MISSING]], ... WHEN condition THEN "label" [CODE value [MISSING]] [ELSE "string" [CODE value [MISSING]|ELSE INTO NULL] END;
There are two use cases in the following: the first use case uses constant output categories and the second one uses a variable fill.
- Allocating respondents to segments (example: combining two questions)
- Setting certain cases with missing or non-missing (example: fixing a survey skip—DP jargon sometimes refers to this as "rebasing a variable")
The data for both cases can be found in the example files (Core Trends Survey 2019) article.
Allocating respondents to segments
Suppose we have the following two variables, and the second variable is a confirmation of the speed status (a double-check in the survey). For analysis purposes, we want to combine this into one (and perhaps clean up the labels as well), which leads us to the third variable on the right:
This can be achieved with the following Crunch Automation script:
CREATE CATEGORICAL CASE WHEN bbhome2 = "No, higher speed connection" OR bbhome1 = "Higher-speed" THEN "High speed", WHEN bbhome2 = "Yes, dial-up" OR bbhome1 = "Dial-up" THEN "Dial up", WHEN bbhome1 = "(VOL) Both Slow-speed/Dial-up and Higher-speed/Broadband" THEN "Both", WHEN bbhome1 = "(VOL) Access internet only using cell phone or tablet" THEN "Cell/Phone tablet" ELSE INTO NULL END AS bbhome_combined TITLE "Confirmed Speed of subscription" DESCRIPTION "Combing questions about home internet speed" NOTES "Base: Only those with home internet access";
Setting certain cases with missing or non-missing
Consider the following variable, which has n=130 set to system missing (“No Data”). We want to create a categorical case where the system missing values are reallocated to a "No Internet" category. In that sense, the variable is "based on the total" (respondents in the data; i.e., all rows):