The following article discusses the princples behind Crunch’s approach to tracking as well as providing a general template for conducting wave-based tracking studies.
Considerations
Consideration 1: Reusability
A key principle underlying tracking studies is that you want all the work you’ve done in previous waves to carry over. No one wants to recreate everything for a new wave all over again. Crunch Automation enables you to replicate (and align) datasets for the append process.
Consideration 2: Reconciling changes
Change is inevitable. Category labels change; new questions are added to questionnaires; brands and statements are added and subtracted from arrays.
The back-end process that merges datasets (the append process) accommodates certain changes between datasets (e.g., the addition of brands in a multiple response variable). It also safeguards the target dataset against irregularities (i.e., where disparities exist that cannot be automatically reconciled). Common examples of irregularities include an alias changing or a category label changing between waves.
Wave data: discrete, cumulative, or continuous?
Some users work with discrete wave data (either via file upload or direct import). There is a different upload/import for Wave 1, Wave, 2, and Wave 3. This is the ideal scenario and fits best with the process outlined below. Essentially, each wave is uploaded as a separate dataset and then appended to the master dataset (after alignment work takes place).
Other users receive a cumulative data file. So they start with Wave 1 (n=100), and then they’ll get a Wave 1 + Wave 2 file (n=200), and then they’ll get a Wave 1 + 2 + 3 file (n=300), and so forth. Crunch also accommodates this scenario: it just requires an additional set of excluding previous waves before append (so you don’t double-up).
Some users also work with continuous data collection. This is when data is streamed or updated via an integration (with a survey platform, such as Qualtrics or Decipher, or perhaps your proprietary platform). When this happens you don’t have different datasets to append (it’s all just continuously updating a single dataset). Even with continuous data collection, Crunch Automation should be used to define the target schema (wave 1) so that in case the integration breaks, you can follow the general process above.
The General Approach to Tracking (Discrete Waves)
- Import Wave 1 as Dataset A
- Script all changes using Crunch Automation (your script is stored)
- Import Wave 2 as Dataset B
- Tweak your Crunch Automation from Dataset A and run the whole thing on Dataset B
- Materialize all variables in both datasets with MATERIALIZE VARIABLES ALL();
- Append Dataset B to Dataset A
The above 6 steps are a general process that works for discrete wave trackers. There are more considerations and nuances to consider, which this article attempts to clarify. There are other scenarios for merging data that don't fit this process perfectly (such as when you have a partial fieldwork export and then you want to just update when end of fieldwork) in which case you may use a different process.
Solving the alignment problem
Alignment is making sure that the information in dataset B is compatible with the information in dataset A (the target dataset).
When appending new data with changes, you need to align your data. Aligning data means that the variables are correctly updated in dataset A (the target schema) when you append dataset B (the incoming wave).
Crunch prevents datasets from merging when there are irreconcilable changes. There will be error messages when you try to combine misaligned data.
At this stage, the best tool for detecting misalignment in advance of using append is the Crunch R package. (Note: in the future, there may be non-R ways to detect and reconcile misalignment between datasets). Some template R one can leverage:
library(crunch)
ds_a <- loadDataset("<INSERT URL>") ds_b <- loadDataset("<INSERT URL>") report <- compareDatasets(ds_a, ds_b) capture.output(summary(report), file = "my_file.txt")
Frequently asked questions
Why you need to materialize variables?
When you create variables, the new variables are linked to their source variable(s).
This can cause problems with trackers. The Wave 1 definition for a particular variable may differ from the definition in Wave 2, perhaps because a new category has been introduced.
The following command decouples the user-created variables from their source variable(s). After running this command, you can have different definitions of categories and subvariables combining together (from different waves):
MATERIALIZE VARIABLES ALL();
Run this command at the end of your script.
Materialized variables will not be altered by changes to their source variables. Generally, in trackers, this approach is preferred.
Do I need to run the 'whole thing' on dataset B?
You want each wave to operate as a dataset in its own right (before you append this). There are a number of benefits to this, notwithstanding:
- It is a way to validate the dataset before append — you can see if any variables or categories have not been accounted for.
- Materialization (see above) means that all variables need to be created prior to append. You can't rely on derivation to 'automatically generate values for the new responses'. You need to see the variables/subvariables/categories in the latest wave, before append.
- You don't have to decide which commands are necessary to run and which are not. Importantly, commands like LABEL CATEGORIES must align (because categories are matched on their label, not on their code).
How does Crunch handle adding brands and statements?
Suppose in Wave 1 you have an array (multiple response, categorical array, or numeric array) which has 3 subvariables: Coke, Pepsi, and Fanta. In Wave 2, you then have a new brand, Sprite, in the same array. The append process leaves you with a variable that has 4 subvariables. The fourth subvariable, Sprite, will be missing for all rows in Wave 1.
How you set this up depends on the type of array you are working with.
If the array was made in Crunch (e.g., using a CREATE command), then it’s considered a derived variable. That means when you copy over your Crunch Automation from Wave 1 to Wave 2, you will need to tweak the definition of the variable in Wave 2 to incorporate the new subvariable Sprite. This is typically, but not always the case, when working with SPSS files.
If the array is already defined in the data file or from a direct import (via a Crunch integration) then you wouldn't use Crunch Automation in the previous wave to set it up. Then, the array is considered a real array and it is not derived. In that case, you don’t need to do anything.
In both cases above (real or derived array), the append process automatically takes care of the union of subvariables. That is, it matches the array between dataset A and B based on its parent alias, and then in the append process, adds the new subvariable to the array.
What about dropping brands and statements?
The process is exactly the same as above. Suppose in Wave 3 you drop the Coke subvariable: if it’s a derived array, you tweak the Crunch Automation to remove Coke (if the subvariable it is referring to is gone). If it’s a real array, you don’t need to do anything.
In the append process, you’ll still end up with 4 subvariables. It just means that for Wave 3 there will be missing data for Coke. Can you delete Coke completely from the array? No. You’ll need to redefine a new array (Note: Crunch has the ability to suppress empty rows/columns for analysts, so this may not be necessary).
How to create a weight with different definitions for each wave?
The 5-step process above handles this for you.
In Wave 1, you define a weighting variable, raking Age and Gender:
CREATE WEIGHT
RAKING (age = xxx, Gender= xx)
AS weight_demo ;
Then in Wave 2, you want to create a weight with a third level Income in addition to Age and Gender. You simply modify the Crunch Automation for Wave 2:
CREATE WEIGHT
RAKING (age = xxx, Gender= xxm, Income = xxx)
AS weight_demo;
Then in the append process, when Crunch conjoins the variable weight_demo, it will have a different definition for wave 1 than it does for wave 2. This can be inspected in the merged dataset.