The following article discusses the principles behind Crunch’s approach to tracking studies and provides a general template for conducting wave-based tracking studies.
Tracking Study Considerations
Consideration 1: Reusability
A key principle for tracking studies is that you want all the work you’ve done in previous waves to carry over to future waves, rather than recreating everything for a new wave each time. Crunch Automation enables you to replicate work and align individual wave datasets for appending to the main dataset.
Note: To assist in aligning the categories (subvariables) of arrays and multiple response questions across waves, we strongly recommend that you define the subvariable aliases when defining those variables in your Crunch Automation script whenever possible (it is currently an optional setting).
Consideration 2: Reconciling Changes
In tracking studies, change is inevitable:
- Category labels change between waves
- Additional questions are added to the questionnaire
- Brands and statements are added and/or removed
The back-end process of appending a new wave to the main dataset accommodates certain changes between datasets by default (e.g., adding brands in a multiple response variable). It also safeguards the main dataset against irregularities, where disparities exist that cannot be automatically reconciled (e.g., a question number/alias changing or a category label changing between waves that still needs to be trended).
Types of Wave Data: Discrete, Cumulative, or Continuous
- Discrete Wave Data: Many times users receive discrete wave data, meaning there is a different data file to upload for Wave 1, Wave, 2, and Wave 3. This is the ideal scenario for Crunch and the appending process outlined below. Each wave is uploaded as a separate dataset and then appended to a single master dataset after any variable alignment work is completed.
- Cumulative Data: Other times users may receive a cumulative data file that starts with Wave 1 (n=100). After each wave, they receive a new file that contains the current wave along with all the historical waves: Wave 1 + 2 (n=200), then Wave 1 + 2 + 3 (n=300), and so forth. Working with this data is similar to the previous scenario, in that each file is uploaded as a separate dataset though it requires an additional step of excluding previous waves from the current wave data before appending it to the main dataset so that respondents are not duplicated.
- Continuous Data: Some users work with continuous data collection, which is when data is streamed or updated via an integration (with a survey platform such as Qualtrics or Decipher or their own proprietary platform). With this method, there are no individual datasets for each wave to append to the main dataset, rather it continuously updates the single, main dataset. Crunch Automation should still be used in this case to define the target schema (starting with Wave 1) so that if for some reason the integration has an issue, you can follow the general process in either of the above options.
The General Approach to Tracking for Discrete and Cumulative Waves
The steps below are a general process that works for discrete and cumulative wave trackers.
- Import Wave 1 as Dataset A. This will be the main dataset for the tracker moving forward.
- Script any changes to the dataset using Crunch Automation and save the text of your script for use on future waves.
- When the next wave is ready, import that wave as Dataset B.
- Tweak the Crunch Automation script you saved from Dataset A with any changes made between waves, and run the full script on Dataset B.
- For Cumulative Data: Exclude respondents included in Dataset B from previous waves so that Dataset B only contains the current wave respondents.
- Materialize all variables in both datasets to decouple the variables created in that wave from their source variables, using the following Crunch Automation command: MATERIALIZE VARIABLES ALL();
More details on materializing variables can be found below. - Open Dataset A and append the data from Dataset B.
With tracking studies there may be additional nuances to consider, along with other scenarios for merging data that don't fit this process perfectly. One such example is when you have partial fieldwork data that you have imported and you need to update it with the final data.
Aligning Data Between Waves
Aligning the structure of each subsequent wave (Dataset B) to be compatible with the structure of the main target dataset (Dataset A) is critical for trending and analysis purposes. This is one of the reasons we recommend using Crunch Automation to set up the dataset so that you have a record of what was updated in each wave.
Crunch will prevent datasets from appending/merging when there are irreconcilable changes and you will receive an error message when you try to combine misaligned data. One example is when a question is categorical (single-punch) in one wave and multiple response in a subsequent wave.
Currently, the best method for detecting misalignments before appending data is the Crunch R package. (See: Connecting to the Crunch API with R and Crunch R Package).
Below is a sample template R script which can be modified to generate a report of the differences between datasets:
library(crunch)
ds_a <- loadDataset("<INSERT URL>") ds_b <- loadDataset("<INSERT URL>") report <- compareDatasets(ds_a, ds_b) capture.output(summary(report), file = "my_file.txt")
Note: To assist in aligning the categories (subvariables) of arrays and multiple response questions across waves, we strongly recommend that you define the subvariable aliases when defining those variables in your Crunch Automation script whenever possible (it is currently an optional setting).
Frequently Asked Questions
Why do I need to materialize the variables?
When you create variables in Crunch, the newly created variables are linked to their source variable(s). This may cause issues with trackers though, as the Wave 1 definition for a particular variable may differ from the definition in Wave 2, perhaps because a new category has been introduced.
The following command decouples all the user-created variables from their source variable(s). After running this command, it is possible to have different definitions of categories and subvariables from different waves combined.
MATERIALIZE VARIABLES ALL();
Note: The above command should be run on both the main dataset and the subsequent wave dataset that is being appended.
Materialized variables will no longer be updated with changes to their source variables. Generally in trackers, this approach is preferred. For this reason, the command should be run at the end of your script, as the final step before data delivery.
Why must I run the full Crunch Automation script on Dataset B?
It is best practice to set each wave up as an independent dataset before appending it to the main dataset. There are several benefits to this, including:
- This is a quick way to validate the dataset before you append it. You will quickly be able to see if any variables or categories in this wave have not been accounted for in the main dataset script.
- Materializing the variables (see above answer) means that all variables created in the main dataset will need to be created in each wave before appending it. Created (derived) variables will not automatically generate values for the new responses, as they have been decoupled from their source variables, allowing you to modify the variable definitions each wave if desired.
- Categories of variables are matched on their label text rather than their code, so any LABEL CATEGORIES commands must align them exactly for trending purposes.
- By running the full script, you will save time by not having to decide which commands are necessary to run and which are not.
How does Crunch handle adding or removing brands and statements?
As frequently happens with tracking studies, lists of brands and statements could have items added or removed from them in subsequent waves.
Adding Statements
Suppose in Wave 1 you have an array (either multiple response, categorical array, or numeric array) that has 3 brands: Coca-Cola, Diet Coke, and Coke Zero. In Wave 2, a new brand (Pepsi) has been added to the question. The append process results in a variable with 4 subvariables, where the new item, Pepsi, will be missing for all rows in Wave 1.
The work involved in setting this up depends on how the array variable was created in Wave 1:
- If the array variable was generated in Crunch using Crunch Automation (i.e., using a CREATE command), before running your full script on Wave 2, you will need to update the definition of the variable in the script to incorporate the new option, Pepsi. This is typically the case when working with SPSS files.
- If the array variable was defined in the source data file or from a direct import via an integration, then the array definition will already include the updated options and you should not need to do anything.
In both cases above (created or source variable), the append process automatically matches the array between the Wave 1 and Wave 2 datasets based on the array alias and adds the new subvariable to the array.
Removing Statements
Suppose in Wave 3 you decide to remove the Coca-Cola option.
- If the array variable was generated in Crunch using Crunch Automation (i.e., using a CREATE command):
- If the subvariable has been removed from the data file, you will need to update the definition of the variable in the script to remove the Coca-Cola option.
- If the subvariable still exists in the data file, no changes to the script will be necessary.
- If the array variable was defined in the source data file or from a direct import via an integration, you do not need to do anything.
The result of the append process will still be 4 subvariables, however, for Wave 3 there will be missing data for the Coca-Cola option.
If you would like to remove Coca-Cola completely from the array, you will need to redefine a new variable that does not include it in the definition. It is not possible to remove a category from an array variable.
Note: Crunch gives users the ability to suppress empty rows/columns, so it may not be necessary to create a new array for analysis.
How can I create a weight with different definitions for each wave?
Following the process outlined above for materializing and appending data, different wave calculations can be used for each wave of a tracking study.
For example, in Wave 1, you define a weight variable, raking Gender and Age:
CREATE WEIGHT
RAKE
gender (TARGETS "Male"=0.5, "Female"=0.5),
age (TARGETS "18-24"=0.2, "25-34 years"=0.25, "45-54 years"=0.3, "55+ years"=0.25)
AS weight_demo
TITLE "Demo Weight";
Then in Wave 2, if you want to create a weight with an additional input, Income, or adjust the target percentages for the calculations, you would modify the weight variable's definition in the Crunch Automation script before running it on the Wave 2 dataset.
CREATE WEIGHT
RAKE
gender (TARGETS "Male"=0.5, "Female"=0.5),
age (TARGETS "18-24"=0.2, "25-34 years"=0.3, "45-54 years"=0.3, "55+ years"=0.2),
income (TARGETS "Less than $75,000"=0.6, "$75,000 or more"=0.4)
AS weight_demo
TITLE "Demo Weight";
Materializing both datasets before appending the data will lock the values in place for each wave. This way, the resulting combined weight_demo variable will have a different definition for Wave 1 than for Wave 2.