SPSS Datafile Specifications

This article is part of The Definitive Guide to Importing and Preparing Data.

High-quality data files are important to high-quality and efficient setup processes.

The three most important requirements when using SPSS files with Crunch are:

Multiple response data should be in dichotomous format
Encoding missing data to define the base
Sufficient and tidy metadata in the SPSS file to recognize questions

Multiple response data should be in dichotomous format

Multiple response data should be stored in dichotomous columns, where there is one column per response option.

For example, if the question is:

Which of the following cola brands do you like? 
Select all that apply.
- Coke
- Pepsi
- Fanta

There should be 3 columns in the datafile for the question, one column for Coke, Pepsi, and Fanta respectively. The data should be stored in dichotomous format (i.e., 1s and 0s and missing data (see below)).

That means that the data should be stored like the following:

The Coke response variable:

1 = “Selected”
0 = “Not Selected”
System missing data

The Pepsi response variable:

1 = “Selected”
0 = “Not Selected”
System missing data

The Fanta response variable:

1 = “Selected”
0 = “Not Selected”
System missing data

It should not be stored like this:

The Coke response variable:
- 1 = “Coke”
The Pepsi response variable:
- 2 = “Pepsi”
The Fanta response variable:
- 3 = “Fanta”

Sometimes multiple response data can be stored in a “fixed-column” format (sometimes called “max-multi”). This should be transformed to dichotomous format before use in Crunch, otherwise, you may need to use the CREATE MULTIPLE SELECTIONS command in Crunch Automation.

Encoding missing data to define the base

Missing data is what defines the base of statistics (i.e., the valid base). Without missing data, you are reliant on filters, which can be tedious and error-prone as a user may forget to apply the filter in the course of analysis.

For a multiple-response question, if a response is hidden from a respondent or if they are not exposed to it due to survey skip logic, then the cell must be encoded with missing data.

Multiple dichotomous variables can have 3 response options:

1 = “Selected”,
0 = “Not Selected”
NA = “Missing”

If the data is encoded with 1s and 0s only, then it is effectively “based to the total sample”. This may not be the case in reality, if the response option only became available in wave 4 or wasn’t available to the German market. In that case, for waves 1, 2, and 3 the data should be encoded as missing and not 0. Likewise, for the German sample.

For categorical (single-response) questions, if you add a category to a variable at a point in time, you are actually changing that variable and making it a different question. For example, if you have a 5-point scale (single-select) and then you add in a “Don’t Know” answer option at wave 4, then you’ve got a different question from wave 4 onwards because the distribution of responses can fundamentally change. So your options for that question are:

accept this as you do your analysis (that you’re getting zeros for the percentages in waves 1, 2, and 3) or
split the question into two questions: one version that is for waves before wave 4, and one version that is for wave 4 onwards. In doing so, you are acknowledging that you cannot compare/trend the two variables since they are fundamentally different measures.

Sufficient and tidy metadata in the SPSS file to recognize questions

The metadata in an SPSS file should be tidy and clean, enabling the user to link back to the questionnaire or other metadata map easily.

SPSS variable names (called “aliases” in Crunch) should be short and logical. They should ideally reflect the question number in the questionnaire.
SPSS variable labels should be tidy and avoid truncation. Commonly, long-winded question text can be truncated and obscure the information about the variable asked about (e.g., the statement, the TV program, the brand, or whatever it refers to).

Furthermore:

Variable labels and category labels need to avoid duplication
- For example, you cannot repeat “Amazon” as a response option in the same question. It is either rolled into the first one or you will need to encode one of the response options “Amazon 1” (or similar).
Variables should be in chronological order in the data map
- Rather than putting new variables for question 3 at the bottom, all the question 3 variables should belong together in sequence.

Help Center

Multiple response data should be in dichotomous format

Encoding missing data to define the base

Sufficient and tidy metadata in the SPSS file to recognize questions

Related articles