There's often a need with surveys to correct the base of a question (rebasing a variable), and this can involve the responses to another question in the survey.
- For example, everyone in the survey was asked which US State they live in, but the study covered China, the UK and the USA.
- For example, all respondents were asked which of the following brands they would consider purchasing, but really they should only be exposed to the brands if they were aware of them (ie: like a brand funnel)
The following shows you some approaches to deriving a new variable in your dataset that defines the base (ie: sets missing values) based on responses to other variables in your study. This is a more advanced use of R, because it requires you to really understand how you are defining the logic of the variable you want to create.
We'll cover two scenarios:
- Rebasing a categorical variable using a single condition
- Rebasing a multiple response variable using a single condition
All the above assumes you know How to log in and access your dataset with R
Rebasing a categorical variable by using a single condition
Suppose a survey asked which state of the US you lived in, but this was asked of all respondents, irrespective of which country they lived in (for example, people in the UK where asked which US state they lived in). You may like to rebase the variable to pertain only to US residents. That is a process of deriving a new variable, and setting the missing data accordingly.
With the below, update the text in blue to specify the categorical variable that you want to rebase. You can always hide the original variable if you don't want to see it.
The condition defines the target "base". This is a custom consideration that you need to define. In the case of the US states, it would be a filter where only those who lived in the US would be selected. Thus it can be useful to use the Filter Builder first and then Save as a Variable. That would then work in the below as the "condition" in the expression.
cat_var <- ds$categorical_variable
condition <- ds$filter
output_alias <- "new_alias"
output_name <- "New variable's output name"
Next, if you want to preserve the system missingness of the original categorical variable, but impose the "condition" (filter) to reduce the base further, then use the below, otherwise see the next segment of code.
categories <- makeCaseVariable(
cases = lapply(categories(cat_var)[!is.na(categories(cat_var))], function(cat) {
list(expression = (cat_var == cat$numeric_value & condition == 1), name = cat$name)
}),
name = output_name
)
ds[[output_alias]] <- categories
Alternatively, if you want to turn the missing values from the original variable into non-missing values, and then impose the "condition", then use the following code.
categories <- makeCaseVariable(
cases = c(
lapply(categories(cat_var)[!is.na(categories(cat_var))], function(cat) {
list(expression = (cat_var == cat$numeric_value & condition == 1), name = cat$name)
}),
lapply(categories(cat_var)[is.na(categories(cat_var))], function(cat) {
list(expression = (condition == 1), name = "Not answered")
})),
name = output_name
)
ds[[output_alias]] <- categories
Rebasing a multiple response variable by using a single condition
Multiple response variables are different to categorical variables, in that you need to rebase (define the missingness) for each of the subvariables individually.
In the below, we rebase a multiple response question by performing the same logic expression on each of the subvariables, and then combining them into a new derivation (deriveArray) at the end.
The tricky part about this is that you need to adjust carefully the expression to meet your exact needs, so it requires an understanding of how logic expressions work. With the above, it assumes the subvariables in the input variable are dichotomous (1's and 0's), with potentially some missing data (NA). So the first line of the expressions sets the 1's in new MR variable, the second line sets the 0's, and the third line changes any missing data into a third category (value = 99), which you can optionally set as missing data later on.
Given that all missing values are being converted to non-missing data by the third line, the condition is therefore defining the target "base". This is a custom consideration that you need to define. Typically, the condition is equivalent to a filter. So if you use the Filter Builder, Save as a Variable, you could potentially reference that as a condition. For example:
condition <- ds$new_filter
The template for the code is:
input_variable <- "my_mr_variable"
output_alias <- “output_alias”
output_name <- “The name of my output variable"
subvars <- lapply(aliases(subvariables(ds[[input_variable]])), function(sv_alias) {
makeCaseVariable(
cases = list(
list(expression = (ds[[input_variable]][[sv_alias]] == 1 & condition == 1), name = "Yes", numeric_value = 1),
list(expression = (ds[[input_variable]][[sv_alias]] != 1 & condition == 1), name = "No", numeric_value = 0),
list(expression = (is.na(ds[[input_variable]][[sv_alias]]) & condition == 1), name = "None", numeric_value = 99)
),
name = name(ds[[input_variable]][[sv_alias]])
)
}
)
ds[[output_alias]] <- deriveArray(subvars, name = output_name, selections = 1)
If you're in the specific situation where you want to "rebase the multiple response variable to all those who answer the question" you can avoid making a filter variable, and instead use the expression below.
input_variable <- "my_mr_variable"
output_alias <- “output_alias”
output_name <- “The name of my output variable"
condition <- Reduce(`|`, lapply(aliases(subvariables(ds[[input_variable]])), function(x) !is.na(ds[[input_variable]][[x]]) & ds[[input_variable]][[x]] == 1))
subvars <- lapply(aliases(subvariables(ds[[input_variable]])), function(sv_alias) {
makeCaseVariable(
cases = list(
list(expression = (ds[[input_variable]][[sv_alias]] == 1 & condition), name = "Yes", numeric_value = 1),
list(expression = (ds[[input_variable]][[sv_alias]] != 1 & condition), name = "No", numeric_value = 0),
list(expression = (is.na(ds[[input_variable]][[sv_alias]]) & condition), name = "None", numeric_value = 99)
),
name = name(ds[[input_variable]][[sv_alias]])
)
}
)
ds[[output_alias]] <- deriveArray(subvars, name = output_name, selections = 1)