This article is part of The Definitive Guide to Importing and Preparing Data.
High-quality data files are important to high-quality and efficient setup processes.
The three most important requirements when using SPSS files with Crunch are:
- Multiple response data should be in dichotomous format
- Encoding missing data to define the base
- Sufficient and tidy metadata in the SPSS file to recognize questions
Multiple response data should be in dichotomous format
Multiple response data should be stored in dichotomous columns, where there is one column per response option.
For example, if the question is:
Which of the following cola brands do you like?
Select all that apply.
- Coke
- Pepsi
- Fanta
There should be 3 columns in the datafile for the question, one column for Coke, Pepsi, and Fanta respectively. The data should be stored in dichotomous format (i.e., 1s and 0s and missing data (see below)).
That means that the data should be stored like the following:
The Coke response variable:
- 1 = “Selected”
- 0 = “Not Selected”
- System missing data
The Pepsi response variable:
- 1 = “Selected”
- 0 = “Not Selected”
- System missing data
The Fanta response variable:
- 1 = “Selected”
- 0 = “Not Selected”
- System missing data
It should not be stored like this:
- The Coke response variable:
- 1 = “Coke”
- The Pepsi response variable:
- 2 = “Pepsi”
- The Fanta response variable:
- 3 = “Fanta”
Sometimes multiple response data can be stored in a “fixed-column” format (sometimes called “max-multi”). This should be transformed to dichotomous format before use in Crunch, otherwise, you may need to use the CREATE MULTIPLE SELECTIONS command in Crunch Automation.
Encoding missing data to define the base
Missing data is what defines the base of statistics (i.e., the valid base). Without missing data, you are reliant on filters, which can be tedious and error-prone as a user may forget to apply the filter in the course of analysis.
For a multiple-response question, if a response is hidden from a respondent or if they are not exposed to it due to survey skip logic, then the cell must be encoded with missing data.
Multiple dichotomous variables can have 3 response options:
- 1 = “Selected”,
- 0 = “Not Selected”
- NA = “Missing”
If the data is encoded with 1s and 0s only, then it is effectively “based to the total sample”. This may not be the case in reality, if the response option only became available in wave 4 or wasn’t available to the German market. In that case, for waves 1, 2, and 3 the data should be encoded as missing and not 0. Likewise, for the German sample.
For categorical (single-response) questions, if you add a category to a variable at a point in time, you are actually changing that variable and making it a different question. For example, if you have a 5-point scale (single-select) and then you add in a “Don’t Know” answer option at wave 4, then you’ve got a different question from wave 4 onwards because the distribution of responses can fundamentally change. So your options for that question are:
- accept this as you do your analysis (that you’re getting zeros for the percentages in waves 1, 2, and 3) or
- split the question into two questions: one version that is for waves before wave 4, and one version that is for wave 4 onwards. In doing so, you are acknowledging that you cannot compare/trend the two variables since they are fundamentally different measures.
Sufficient and tidy metadata in the SPSS file to recognize questions
The metadata in an SPSS file should be tidy and clean, enabling the user to link back to the questionnaire or other metadata map easily.
- SPSS variable names (called “aliases” in Crunch) should be short and logical. They should ideally reflect the question number in the questionnaire.
- SPSS variable labels should be tidy and avoid truncation. Commonly, long-winded question text can be truncated and obscure the information about the variable asked about (e.g., the statement, the TV program, the brand, or whatever it refers to).
Furthermore:
- Variable labels and category labels need to avoid duplication
- For example, you cannot repeat “Amazon” as a response option in the same question. It is either rolled into the first one or you will need to encode one of the response options “Amazon 1” (or similar).
- Variables should be in chronological order in the data map
- Rather than putting new variables for question 3 at the bottom, all the question 3 variables should belong together in sequence.