In the following, we outline the process you should take to achieve the outcomes described and point you to the relevant Crunch Automation commands. Whereas you can do simple exclusions in the web app (see this article), Crunch Automation is most often the best route to achieve data cleaning in terms of replicability and efficiency. Some users may choose to use R to do some special or niche tasks.
What is data cleaning?
Data cleaning generally describes three major DP tasks:
- Removing respondents from analysis — incomplete respondents, those who completed the survey “too quickly”, outliers by some metric, straight-liners on grid questions, screen-outs, and so on.
- In Crunch, this is achieved using Exclusions.
- Recoding variable information — fix programming errors, miscoded data due to survey routing logic, and so on.
- In Crunch, you have the choice to REPLACE data if it was collected incorrectly or CREATE a new variable preserving the original input data.
- Converting variable type — when variables have the incorrect variable type, such as unlabeled integers appearing as numeric when they are meant to represent logical categories.
Removing respondents from analysis (Exclusions)
The general process is as follows:
- Write down the logical conditions that describe the rows (respondents) that should be excluded.
- Optionally, store these in one or more logical variables for easy reference or later modification.
- Use the SET EXCLUSION command with either the logical condition(s) or with reference to the variable.
The first step is the main step where you set up all the criteria for exclusions. These criteria can include:
- Outliers
- Speedsters
- Straightliners/flatliners
- Screen-outs
- Illogical answering
Step one
Use the CREATE LOGICAL command to create a series of variables (one per criterion):
CREATE LOGICAL condition_1 AS exclude_1 DESCRIPTION “Description of condition 1”;
CREATE LOGICAL condition_2 AS exclude_2 DESCRIPTION “Description of condition 2”;
CREATE LOGICAL condition_3 AS exclude_3 DESCRIPTION “Description of condition 3”;
Step two
The second step is simply a recommended housekeeping step: by putting all of these derived variables into a hidden folder called “Exclusions”, you can easily identify which variables are being used in the exclusions:
ORGANIZE INTO HIDDEN “Exclusions”;
Step three
The final step does the excluding: if you’ve made the suggested different variables at step one, then the SET EXCLUSION step will exclude any case that has a value greater than or equal to one of any of these variables. The result of the exclusion will be apparent in the web app if you go to the Dataset Properties > Exclusions option.
SET EXCLUSION exclude_1 = 1 OR exclude_2 = 1 OR exclude_3 = 1 ;
Will the exclusions be permanently deleted?
The exclusion feature in Crunch hides rows in the main dataset. This means that if you Restore/Undo the Crunch Automation script, you will un-exclude the respondents. Note that re-running the script will reinstall the exclusions.
Likewise, if you open the Dataset Properties > Exclusions option and remove the exclusions there, it will also unhide the respondents. This is not recommended if you are following a scripting approach: if you need to change the exclusions, you should instead change it in your Crunch Automation script and then re-run the script.
If you have a tracking study, and you have a new wave of data, you will run your Crunch Automation script on the new wave. In that way, you will apply the same exclusions to the new wave, unless of course, you make any tweaks to the exclusion criteria (which may very well happen in trackers, since change is likely). In the subsequent append process, where the new wave is combined, excluded cases will not be brought over from the wave into the tracking dataset. See the Definitive Guide to Tracking article for more information.
Recoding variable information
Part of the data cleaning instructions can involve recoding to fix programming errors and survey skips and much more. It can vary greatly depending on the project and how much recoding needs to be done.
A key distinction in the recoding commands for Crunch Automation is between REPLACE and CREATE:
- REPLACE is where you want to overwrite a variable.
- CREATE is where you want to make a new variable.
Any of the REPLACE commands are schema-changing commands and must be run at the top of the very first script before any non-schema commands are run. If you want to make an alteration to the definition of the recoding when using REPLACE, you will need to Restore/Undo the script, make the change, and then re-run the script. You would use REPLACE if you definitely do not want the source variables to remain as they were on import.
CREATE, by contrast, derives a new variable. As such, you can derive new variables at any point (it doesn’t need to be at the top of the script). Generally speaking, CREATE is preferred to REPLACE since you are keeping the original source variables in case you want to refer to them in the future.
The main commands that get used in data recoding:
- REPLACE | CREATE CATEGORICAL RECODE
- REPLACE | CREATE CATEGORICAL CASE
- REPLACE | CREATE NUMERIC CASE
All the recoding logic and command tasks are covered in the Recoding: case-level missing & valid values ("rebasing") article.
RENAME is another schema command (involves changing variable aliases), but this is technically not a recode: RENAME tends to be used in alignment purposes (see Definitive Guide to Tracking).
Converting variable type
Sometimes datafiles (notoriously CSV and SPSS files) present variables to Crunch in the incorrect variable type (e.g., because the metadata in the datafile is wrong). A classic example is when dichotomous (Yes/No) variables are numeric (they should be categorical 1 = Yes and 2 = No because they subsequently will form a multiple response variable with CREATE MULTIPLE DICHOTOMY).
As with the previous section, with Crunch, you have the option of REPLACE and CREATE. In the context of changing variable type, you, therefore, can use either of the following: