In the following, we outline the process you should take to achieve the outcomes described and point you to the relevant Crunch Automation commands. Whereas you can do simple exclusions in the web app (see this article), Crunch Automation is most often the best route to achieve data cleaning in terms of replicability and efficiency. Some users may choose to use R to do some special or niche tasks.
What is data cleaning?
Data cleaning generally describes three major data processing tasks:
- Removing respondents from analysis — incomplete respondents, those who completed the survey “too quickly”, outliers by some metric, straight-liners on grid questions, screen-outs, and so on.
- In Crunch, this is achieved using Exclusions.
- Recoding variable information — fix programming errors, miscoded data due to survey routing logic, and so on.
- In Crunch, you have the choice to REPLACE data if it was collected incorrectly or CREATE a new variable preserving the original input data.
- Converting variable types — when variables have the incorrect variable type, such as unlabeled integers appearing as numeric when they are meant to represent logical categories.
Removing respondents from analysis (Exclusions)
The general process is as follows:
- Write down the logical conditions that describe the rows (respondents) that should be excluded.
- Optionally, store these in one or more logical variables for easy reference or later modification.
- Use the SET EXCLUSION command with either the logical condition(s) or a reference to the variable.
The first step is the main step where you determine all the criteria for exclusions. These criteria can include:
- Illogical answering
Use the CREATE LOGICAL command to create a series of variables (one per criterion).
CREATE LOGICAL condition_1 AS exclude_1 DESCRIPTION "Description of condition 1";
CREATE LOGICAL condition_2 AS exclude_2 DESCRIPTION "Description of condition 2";
CREATE LOGICAL condition_3 AS exclude_3 DESCRIPTION "Description of condition 3";
The second step is simply a recommended housekeeping step. By putting all of these derived variables into a hidden folder called “Exclusions”, you can easily identify which variables are being used in the exclusions.
ORGANIZE INTO HIDDEN "Exclusions";
The final step applies the exclusion to the dataset. If you’ve made the suggested individual logical variables at step one, then the SET EXCLUSION step will exclude any case that has a value greater than or equal to one of any of these variables. The result of the exclusion will be apparent in the web app if you go to the Dataset Properties > Exclusions option.
SET EXCLUSION exclude_1 = 1 OR exclude_2 = 1 OR exclude_3 = 1 ;
Will the exclusions be permanently deleted?
The exclusion feature in Crunch hides rows in the main dataset. This means that if you Restore/Undo the Crunch Automation script, you will un-exclude the respondents. Note that re-running the script will reinstall the exclusions.
Likewise, if you open the Dataset Properties > Exclusions option and remove the exclusions there, it will also unhide the respondents. This is not recommended if you are following a scripting approach: if you need to change the exclusions, you should instead change it in your Crunch Automation script and then re-run the script.
If you have a tracking study, you will need to run your Crunch Automation script on the new wave of data. In that way, you will apply the same exclusions to the new wave, unless you need to make any tweaks to the exclusion criteria (which may very well happen in trackers, since change is likely). In the subsequent append process, where the new wave is combined, excluded cases will not be brought over from the wave into the tracking dataset. See the Definitive Guide to Tracking article for more information.
Recoding variable information
Part of the data cleaning can involve recoding to fix programming errors, survey skips, and much more. It can vary greatly depending on the project and how much recoding needs to be done.
There is a key distinction in the recoding commands for Crunch Automation:
- REPLACE is where you want to overwrite a variable.
- CREATE is where you want to make a new variable, preserving the original input data.
All of the REPLACE commands are schema-changing commands and therefore must be run at the top of the very first script run on a dataset before running any non-schema commands. If you need to make a modification to the definition of the recoding when using REPLACE, you will need to Restore/Undo the script, make the change in the script, and then re-run the script. You would use REPLACE if you definitely do not want the source variables to remain as they were on import.
CREATE, by contrast, derives a new variable. Since you are not modifying the schema of the original input data, you can derive new variables at any point in the script, it doesn’t need to be at the top. Generally speaking, CREATE is preferred to REPLACE since you are keeping the original source variables in case you need to refer to them in the future.
The main commands used in data recoding are:
Additional recoding logic and command tasks are covered in the Recoding: case-level missing & valid values ("rebasing") article.
RENAME is another schema command which involves changing variable aliases. This is technically not a recode, RENAME tends to be used for alignment purposes (see the Definitive Guide to Tracking).
Converting variable types
Sometimes data files (notoriously CSV files) present variables to Crunch in the incorrect variable type, often because the metadata in the data file is wrong. A classic example is when dichotomous variables (Yes/No) are numeric, whereas they should be categorical with something similar to 1=Yes and 2=No as they will subsequently form a multiple response variable using the CREATE MULTIPLE DICHOTOMY command.
Similar to the above, you have the option of REPLACE and CREATE. In the context of changing variable type, you can use either of the following: