See Crunch Automation basics for more information.
Overview
The CREATE CATEGORICAL CUT command converts a numeric variable to a categorical one, by defining boundaries between ranges. It allows an analyst to work with ranges of numbers, or "bins", as categories.
For example, you might collect information about "Age", "Income", "Year of manufacture", or "Number of wine bottles bought last month" with the exact number. That exact number is stored in a numeric variable. However, for the purposes of analysis, you might want to create bins that are relevant to the study. You could take the numeric variable of "Age" and from it derive a new variable with a small number of bins that become the basis of comparisons (e.g., "18-29", "30-39", "40+").
This is an example of a command with many different optional arguments as denoted by the use of [square brackets]. Specifying the breaks is not optional, because these are the cut points on the continuous numeric variable.
The CREATE CATEGORICAL CUT command can also take multiple input variables and produce multiple output variables. For example, if you have a set of numeric variables for which you want to make the same cuts, then you can do them all at once. Afterward, you can run the CREATE CATEGORICAL ARRAY command from the resultant variables (since they will all have the same categories).
CREATE CATEGORICAL CUT
alias, ..., alias
BREAKS [MIN,] number, ..., number [, MAX] [closed LEFT|RIGHT]
[LABELS "string" [CODE code] [NUMERIC VALUE <float>] [MISSING], ..., "string" [CODE code] [NUMERIC VALUE <float>] [MISSING]]
[SET MISSING value ["string"] [CODE code], ..., value ["string"] [CODE code]]
AS alias, ..., alias
[TITLE "string", ..., "string"]
[DESCRIPTION "string", ..., "string" | COPY]
[NOTES "string", ..., "string" | COPY];
Understanding the breaks
The CLOSED argument tells how the bins treat the cut point: is it up to and including the cut point (RIGHT)? Or is it anything that is the cut point and greater (LEFT)? The CLOSED side is the side that is included in the range.
The following are three simple examples to illustrate the CLOSED argument using a numeric variable storing Age where we want categories ("18–29", "30–39", "40+):
BREAKS 18, 30, 40 CLOSED LEFT
which results in two categories:
- "18 <= x < 30" (i.e., "18 up to but not including 30")
- "30 <= x < 40" (i.e., "30 and up to but not including 40")
BREAKS 18, 30, 40 CLOSED RIGHT
which results in two categories:
- "18 < x <= 30" (i.e., "Over 18 up to and including 30")
- "30 < x <= 40" (i.e., "Over 30 up to and including 40")
Additionally, if we include the MIN or MAX command, we then expand the ranges on either side of the cut points:
BREAKS MIN, 18, 30, 40, MAX CLOSED RIGHT
which results in four categories:
- "<=18" (i.e., "Up until 18 inclusive")
- "18 < x <= 30" (i.e., "Over 18 up to and including 30")
- "30 < x <= 40" (i.e., "Over 30 up to and including 40")
- ">40" (i.e., "Over 40")
Using labels
Labels for the categories are optional but recommended. If you don’t specify labels, you will end up with categories with mathematical inequalities as labels:
- <18
- 18 <= x < 30
- 30 <= x < 40
- >=40
Though, users probably prefer to see:
- Under 18
- 18–29
- 30–39
- 40+
You also have the option within the labels options of setting the categorical CODE. The code is not the numeric value associated with the category, but rather the id of the category. Generally, you don’t need to specify the code unless you have a specific reason to do so.
Using the DATE attribute
The CREATE CATEGORICAL CUT command also allows you to create categorical-date variables by specifying a DATE attribute to each of the categories defined in the command.
Creating a categorical date variable requires that all the non-missing categories contain a DATE attribute that should contain a valid and unique ISO-8601 date string:
CREATE CATEGORICAL CUT my_num
BREAKS MIN, 18, 25, 45, 75, MAX
LABELS
"Underage" MISSING,
"Young adults" CODE 1 DATE "2022-01-23",
"Adults" CODE 3 DATE "2022-01-24",
"Older adults" CODE 5 DATE "2022-01-25",
"Elder" CODE 100 MISSING
AS my_cut;
Setting missing values
Surveys can sometimes include numeric variables to denote "missingness". For example, "-1" or a large number like "999". If you don’t want these values included in the categorical cuts, then you can specify them as missing here. Missing values indicated here are compared using strict equality (if you want to cap variables, such as to remove outliers, you should first use the CREATE NUMERIC CASE command).
Again, you can set the code of a category (if you wish). An example is that you might want all the missing values with IDs such as 98 or 99: because the data is defined as missing, it then has no bearing on the analysis by a dataset viewer.
CREATE CATEGORICAL CUT
alias, ..., alias
BREAKS [MIN,] number, ..., number [, MAX] [closed LEFT|RIGHT]
[LABELS "string" [CODE code] [NUMERIC VALUE <float>] [MISSING], ..., "string" [CODE code] [NUMERIC VALUE <float>] [MISSING]]
[SET MISSING value ["string"] [CODE code], ..., value ["string"] [CODE code]]
AS alias, ..., alias
[TITLE "string", ..., "string"]
[DESCRIPTION "string", ..., "string" | COPY]
[NOTES "string", ..., "string" | COPY];
Use case one
Consider a numeric variable representing the respondents' reported integer age. For analysis, we would create a categorical version with four categories:
CREATE CATEGORICAL CUT age
BREAKS MIN, 18, 30, 40, MAX CLOSED LEFT
LABELS "Under 18", "18-29", "30-39", "40+"
AS age_categories
TITLE "Age Categories";
Use case two
Let’s consider another example with a series of twelve numeric variables that we want to bin up in order to evaluate the ranges. We can do the same categorical cut on all of them at once:
- Raw Data (Numeric variables):
CREATE CATEGORICAL CUT Q5_1...Q5_12
BREAKS MIN, 51, 101, 151, 201, MAX CLOSED LEFT
LABELS "Upto 50", "51-100", "101-150", "151-200", "201+" MISSING
AS Q5Newt1_1...Q5Newt1_12
DESCRIPTION "Brand 1", "Brand 2", "Brand 3", "Brand 4", "Brand 5", "Brand 6", "Brand 7", "Brand 8", "Brand 9", "Brand 10", "Brand 11", "Brand 12";
Here, we set the "201+" cases as outliers (MISSING). Note the difference in the Missing counts between the input and output variables. If you click the "Properties" of the categorical variables created, you can see the category "201+" and its counts marked as "Missing".
We might subsequently group the resultant categorical variables as a categorical array variable (since they share equivalent categories):
CREATE CATEGORICAL ARRAY
Q5Newt1_1...Q5Newt1_12
LABELS USE DESCRIPTIONS
AS Q5_array
TITLE "Number of visits";
As you already set up the brand names in all the variable’s description, you can simply set the labels with LABELS USE DESCRIPTIONS.