High-quality data files are important to high-quality and efficient setup processes.
The three most important requirements when using SPSS files with Crunch.
- Multiple response data should be in dichotomous format
- Encoding missing data to define the base
- Sufficient and tidy metadata in the SPSS file to recognise questions
Multiple response data should be in dichotomous format
Multiple response data should be stored in dichotomous columns, where there is one column per response option.
For example, if the question was
Which of the following cola brands do you like?
Select all that apply.
- Coke
- Pepsi
- Fanta
There should be 3 columns in the datafile for the question. One column for Coke, Pespi and Fanta respectively.
The data should be stored in dichotomous format (ie: 1’s and 0’s... (and missing data - see below)). That means that the data should be stored like the following:
The Coke response variable
- 1 = “Selected”
- 0 = “Not Selected”
- System missing data
The Pepsi response variable
- 1 = “Selected”
- 0 = “Not Selected”
- System missing data
It should not be stored like this:
Coke response variable
1 = “Coke”
Pepsi response variable
2 = “Pepsi”
Sometimes multiple response data can be stored in a “fixed-column” format (sometimes called “max-multi”). This should be transformed to dichotomous format before use in Crunch (otherwise you may need to use the MULTIPLE SELECTIONS command in Crunch).
Encoding missing data to define the base
Missing data is what defines the base of statistics (ie: the valid base). Without missing data, you are reliant on filters. Relying on filters can be tedious and error prone (because you may forget to apply the filter in the course of your analysis).
This is a multiple response, if a response is hidden from a respondent, or if they are not exposed to it due to a skip, then the cell must be encoded with missing data.
Multiple dichotomous can have 3 response options:
- 1 = “Selected”,
- 0 = “Not Selected”
- NA = “Missing”
If the data is encoded with 1’s and 0’s only then it effectively “based to the total sample”. This may not be the case in reality, if the response option only became available in wave 4 or wasn’t available to the German market. In that case, for waves 1,2 and 3 the data should be encoded as missing and not 0. Likewise, for the German sample.
For categorical (single-response) questions, if you add a category to a variable at a point in time, you are actually changing that variable and making it a different question. For example, if you have a 5-point scale (single-select) and then you add in a “Don’t Know” answer option at wave 4, then you’ve actually got a different question from wave 4 onwards (because the distribution of responses can fundamentally change). So you either
- accept this as you do your analysis (that you’re getting zero’s for the %’s in waves 1,2, 3) or
- split the question into two. One version that is prior to wave 4, and one version that is wave 4 onwards. In doing so, you are acknowledging that you really can’t compare/trend the two variables, since they are fundamentally different measures.
Sufficient and tidy metadata in the SPSS file to recognise questions
The metadata in an SPSS file should be tidy, clean and enable the user to link back to the questionnaire or other metadata map easily.
SPSS variable names (called “aliases” in Crunch) should be short and logical. They should ideally reflect the question number in the questionnaire.
SPSS variable labels should be tidy and avoid truncation. It is common that long-winded question text can be truncated and actually obscure the information about the variable (ie: the statement, the TV program, the brand, or whatever it refers to).
Furthermore:
- Variable labels and category labels need to avoid duplication
- Eg: you can’t repeat “Amazon” as a response option… it either is rolled into the one, or you encode one of the response options “Amazon 1” (or similar).
- Variables should be in chronological order in the data map
- Ie: don’t put new variables for question 3 right at the bottom. All the question 3 variables should belong together in sequence.