Overview
You may have some questions when it comes to restrictions or limitations in Crunch. For example:
- What is the maximum dataset size I can have?
- Are there any restrictions on the number of variables? Or the number of rows?
- At what point will there be performance issues with Crunch?
These are all valid questions, and our users want to be sure that Crunch provides adequate performance with all of their datasets. The following describes in detail any restrictions/limitations when using datasets, variables, rows, and at what point Crunch optimally performs.
Background
There are many factors that can affect performance.
The size of the dataset is not the key determinant here (i.e., how many megabytes or gigabytes of disk space it takes up). What matters is the structure of the data. We provide some rules of thumb with respect to limits (which are very large).
To begin, dataset is made up of a number of rows (cases) and columns (which form variables).
In terms of the number of rows, Crunch can handle a lot of cases. As a rule of thumb, you are unlikely to notice performance issues up to around 10 million rows (that’s 10,000,000+). In terms of survey data, especially tracking data, that’s a lot of respondents: Crunch’s columnar database storage means you get sub-second calculations on millions of cases. So generally speaking, the number of rows is not the limiting factor for performance in Crunch.
In terms of columns, this is where considerations around structure become important. Crunch can treat a single column as a variable (such as “Gender”) or treat multiple columns as a single variable (such as “Awareness of brands”, where each brand is a different column). In general, Crunch can handle tens of thousands of columns, provided they are structured appropriately. The main performance benefit of columnar data storage is that in general relatively few columns ever need to be accessed concurrently. Other systems must load and seek within “entire datasets” rather than only the needed columns for a given task or analysis.
The number of variables in Crunch is generally not an issue either. At any moment using Crunch, it only loads up several variables at a time. For example, if you look at the variable cards in Variable Summaries mode, you are only looking at a handful of variables on the screen. On a dashboard, the tiles only query the database for the variables they need at any given time.
Performance issues may arise when there is no way to know which columns are relevant, effectively asserting that everything must be analyzed. This can lead to performance issues when aspects of the metadata exceed functional limitations, such as:
- The columns are unstructured
- The variables are not organized (into folders)
- The number of labels is extremely long
Unstructured columns
Sometimes datasets have poorly defined metadata. One such deficit is when the columns are not declared as a multiple response or categorical variable. This means Crunch recognizes more variables than there actually should be. This is commonplace with SPSS files, where metadata around column groupings can be lost. In other words, if you have properly defined metadata for column groupings, then this can have an impact on performance in Crunch.
Recommendation
Define your column groupings for the dataset before you upload data into Crunch. Groupings are taken care of automatically when importing data from data-collection platforms such as Decipher, SurveyMonkey, or Confirmit. SPSS files can have the grouping metadata encoded in the .sav files as “mrsets”, which is recommended before uploading into Crunch.
Unorganized variables
Crunch has a unique additional variable attribute that boosts performance by only using the data you need at any given time: folders. Variables are organized into folders, where they are not displayed until you open them. This is very helpful for the user — it allows the dataset to be navigated more easily. We say the dataset is ‘tidy’ once variables are stored into folders.
But folders go beyond that, which means that Crunch isn’t even loading anything about the variables outside the present folder. This is because it’s loading and displaying the folders and not the variables within until opened.
Recommendation
Organize your variables into folders and subfolders (using the Folder Sidebar) that reflect the organization of the data for analysis.
Extremely long lists of labels
Categorical variables have, by definition, categories, and the number of categories can be very large. Likewise, multiple response variables and categorical arrays have subvariables (that is, the columns that are grouped into the variable). The number of subvariables can also be very large. For example, you might have a list of 3,000 brands in a categorical variable (i.e., 3000 categories). Or, you have a multiple response variable with 3,000 brands. Either way, Crunch is pulling up 1000s of pieces of information at the same time. This can lead to diminished performance.
Recommendation
Clean your variables so they contain only the information you actually want to analyze. It is unlikely you are going to analyze 3,000 brands at once or need to display it as such. If required, hide the original variable and derive a variable from it that has a smaller list of labels (categories or subvariables).