Context
This document defines the canonical Crunch Schema to support both existing and new datasets provided by our users.
Over the years, the way Crunch describes its datasets has evolved, with an unclear distinction of what’s schema and what’s metadata.
Schema is meant to describe the variable types of a dataset, and metadata is meant to contain information that identifies and describes the data.
With this distinction, we are able to define rules that can safely enforce how schema definitions are built.
Use Cases
- Schema for datasets stored in the Silverlake
- Common representation for translating importer data
- Streaming schema
Driving Principles
User-friendly
The schema is intended to be as succinct and friendly to compose as possible. Data types are designed to be explicit in what’s describing, and they should be extensible when possible.
Explicit
Clear distinction between schema and metadata.
Schema and metadata are defined separately and each data type can have specific metadata attributes that are not present in other data types.
Metadata cannot be described by itself—it only supports the schema.
Compatible
Compatible with most common data formats and tools, such as parquet, pyarrow and pandas.
We want converting imports into DataFrames (whether from R, Pandas, Polars, Dask, or anywhere else) to be as easy as possible, using their native support for categorical types and nested columns.
On top of that, this schema is designed to support the existing datasets so all features should be supported in the new schema.
Terminology
Term | Definition |
---|---|
Variable | Identifies values on a dataset. |
Data Type | A data type defines a variable. |
Metadata | Information that identifies and describes an individual data type. |
Schema | The composition of multiple data type definitions that describe a dataset. |
Data Types
There are two required attributes that compose a variable definition: name and type.
For categorical, there’s an extra attribute (values) and for array, there’s value.
Attribute | Type | Description | Required |
---|---|---|---|
name | string | Unique name of the variable | true |
type | string | A Crunch data type | true |
values | list<integer, string> | Required if type categorical | false |
value | map | Required if type array | false |
Data types can be Atomic Types or Array Type.
- Generally speaking, the data can be represented and transported as a JSON document.
- All data types are represented in a basic {key: value} format as per standard JSON objects.
Atomic Types
Numeric
Numeric variables are used to store numbers, including both integer numbers or floats:
// schema
{
"name": "numchildren",
"type": "numeric"
}
Text
Text variables encoded as UTF-8. Text data is of variable length:
// schema
{
"name": "zipcode",
"type": "text"
}
Datetime
Datetime variables are serialized as an ISO8601 UTC datetime string (e.g., 2023-05-01T09:00:00):
// schema
{
"name": "starttime",
"type": "datetime"
}
Categorical
Categorical variables store a value from a finite set of categories:
// schema
{
"name": "gender",
"type": "categorical",
"values": ["Female", "Male", "Other"]
}
You can also use numeric codes instead of strings:
// schema
{
"name": "educ4",
"type": "categorical",
"values": [1, 2, 3, 4, 9]
}
// metadata
{
"values": [
{"value": 1, "label": "High school or less"},
{"value": 2, "label": "Some college"},
{"value": 3, "label": "College graduate"},
{"value": 4, "label": "Postgraduate (M.D., J.D., M.A., M.S., M.B.A., Ph.D., D.V.M., or equivalent)"},
{"value": 9, "label": "Refused, skipped, not applicable"}
]
}
Historically, numeric codes were used to save space compared to using string values.
There is no inefficiency in long-term storage file formats such as Parquet, including sending text strings as values, since these will be dictionary encoded.
With categorical types, metadata is assigned to each categorical value:
Attribute | Type | Description | Required |
---|---|---|---|
values | list<item> | false | |
values.item.value | same as schema.values.<item> | category id, the one from schema.values | true |
values.item.label | string | Presentable category name | false |
values.item.scale | float | Numeric value associated with the category | false |
values.item.date | string | Date in ISO8601 format (any precision) associated with the category | false |
values.item.selected | boolean | Category to be considered as selected | false |
values.item.missing | boolean | Category to be considered as missing data | false |
values.item.notes | string | false |
Values in metadata
While not required, when defined, the values in the metadata are set as a list because the order matter for categorical data.
The order of the values is the default order that they should be analyses/presented, which takes precedence over the order defined in the schema.
When there’s no metadata for a category variable, the default order is defined by the schema.
Array Type
Multiple response variables, categorical arrays, and numerical arrays are commonly found types in survey systems. All three of these concepts (and more, including multidimensional arrays) are represented as a single array type..
Here is a definition of a question where the respondent is asked the awareness of all five brands:
// schema
{
"name": "awareness",
"type": "array",
"value": {
"type": "categorical",
"values": [1, 2, 9]
},
"axes": [
{
"name": "brand",
"type": "categorical",
"values": ["101", "102", "103", "201", "202"]
}
]
}
// metadata
{
"label": "Aided awareness",
"description": "Which of these brands have you heard of?",
"values": [
{"value": 1, "label": "Selected", "selected": true},
{"value": 2, "label": "Not selected"},
{"value": 9, "label": "Not asked", "missing": true}
],
"axes": {
"brand": {
"values": [
{"value": "101", "label": "Apple"},
{"value": "102", "label": "Borland"},
{"value": "103", "label": "Corel"},
{"value": "201", "label": "Google"},
{"value": "202", "label": "Facebook"}
]
}
}
}
Data type attributes:
Attribute | Type | Description | Required |
---|---|---|---|
axes | list<item> | ||
axes.item.name | string | Unique name for the axis | true |
axes.item.type | string | An atomic data type | true |
axes.item.values | list<string> | Required if axes.item.type is categorical or array | false |
Metadata attributes:
Attribute | Type | Description | Required |
---|---|---|---|
axes.<axis>.values | list<item> | ||
metadata.axes.<axis>.values.<item>.value | same as in axes.<item>.values | value id, the one from schema.axes.<item>.values | true |
metadata.axes.<axis>.values.<item>.label | string | Presentable category name | false |
metadata.axes.<axis>.values.<item>.description | string | false | |
metadata.axes.<axis>.values.<item>.notes | string | false | |
metadata.axes.<axis>.values.<item>.missing | boolean | Axis value to be considered missing | false |
Examples
Categorical Examples
Example one — Categories as codes with labels
Question: “Who did you vote for in the 2020 presidential election?”
Answers:
- Joseph R. Biden, Jr.
- Donald J. Trump
- Other
- Did not vote
- Skipped
- Not asked
Variable definition:
// schema
{
"name": "presidential_vote_2020",
"type": "categorical",
"values": [1, 2, 3, 4, 8, 9]
}
// metadata
{
"label": "2020 Presidential vote choice",
"description": "Who did you vote for in...",
"values": [
{"value": 1, "label": "Biden"},
{"value": 2, "label": "Donald J. Trump"},
{"value": 3, "label": "Other"},
{"value": 4, "label": "Did not vote"},
{"value": 8, "label": "Skipped"},
{"value": 9, "label": "Not asked"}
]
}
// Sample data
{
"row_id": 101,
"var_name": "presidential_vote_2020",
"categorical_value": 4
}
Example two — Categories as labels with scale
Question: Do you approve or disapprove of the job Joe Biden is doing as President?”
Answers:
- Strongly approve
- Somewhat approve
- Somewhat disapprove
- Strongly disapprove
- Not sure
- Skipped
- Not asked
Variable definition:
// schema
{
"name": "approval_biden",
"type": "categorical",
"values": [
"Strongly Approve",
"Somewhat Approve",
"Neutral",
"Somewhat Disapprove",
"Strongly Disapprove",
"Not sure",
"Skipped",
"Not asked"
]
}
}
// metadata
"metadata": {
"label": "Biden approval",
"description": "Do you approve of the job...",
"values": [
{"value": "Strongly Approve", "scale": 20.0, "selected": true},
{"value": "Somewhat Approve", "scale": 10.0, "selected": true},
{"value": "Neutral", "scale": 0},
{"value": "Somewhat Disapprove", "scale": -10.0},
{"value": "Strongly Disapprove", "scale": -20.0},
{"value": "Not sure", "missing": true},
{"value": "Skipped", "missing": true},
{"value": "Not asked", "missing": true}
]
}
// Sample data
{
"row_id": 101,
"var_name": "approval_biden",
"categorical_value": "Not asked"
}
Example three — Categories with date associated
Variable definition:
// schema
{
"name": "month",
"type": "categorical",
"values": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 98, 99]
}
// metadata
{
"label": "Month of 2023",
"description": "What month did you...",
"values": [
{"value": 1, "date": "2023-01"},
{"value": 2, "date": "2023-02"},
{"value": 3, "date": "2023-03"},
// ... hidden for brevity
{"value": 98, "label": "Skipped", "missing": true},
{"value": 99, "label": "Not Asked", "missing": true},
]
}
// Sample data
{
"row_id": 101,
"var_name": "month",
"categorical_value": 2
}
Array Examples
Example one — Dichotomies array
Single dimension array, defined by one categorical axis and categorical values.
Question: Which social networks do you use?
Answers:
Variable definition:
// schema
{
"name": "socnetuse",
"type": "array",
"value": {
"type": "categorical",
"values": [1, 2, 8, 9]
},
"axes": [
{
"name": "network",
"type": "categorical",
"values": [1, 2, 3, 4]
}
]
}
// metadata
{
"values": {
{"value": 1, "label": "Yes", "selected": true},
{"value": 2, "label": "No", "selected": false},
{"value": 8, "label": "Not Asked", "missing": true},
{"value": 9, "label": "Not Aware", "missing": true},
},
"axes": {
"network": {
"values": [
{"value": 1, "label": "Facebook"},
{"value": 2, "label": "Twitter"},
{"value": 3, "label": "Linkedin"},
{"value": 4, "label": "Instagram"}
]
}
}
}
Example two — Numeric array
Single dimension array, defined by one categorical axis and integer values.
Question: Please rate the following computers on a scale from 0 to 100, where 100 is the platonic ideal and 0 is metaphysically terrible.
Answers:
- iPad Pro
- Chromebook
- Lenovo Laptop
- Macbook
Variable definition:
// schema
{
"name": "computer_rating",
"type": "array",
"value": { "type": "numeric" },
"axes": [
{
"name": "computer",
"type": "categorical",
"values": [
"ipad_pro",
"chromebook",
"lenovo",
"macbook"
]
}
]
}
Example three — Categorical array (Grid single selection)
Single dimension array, defined by one categorical axis with categorical values.
Question: How satisfied were you with each of the following products that said you had purchased?
Very satisfied | Somewhat satisfied | Somewhat dissatisfied | Very dissatisfied | |
---|---|---|---|---|
Lenovo PC | X | |||
MacBook Pro | X | |||
IBM 360 | X |
Variable definition:
// schema
{
"name": "brand_statements",
"type": "array",
"value": {
"type": "categorical",
"values": [1, 2, 3, 4]
},
"axes": [
{
"name": "brand",
"type": "categorical",
"values": [7, 8, 9],
}
]
}
// metadata
{
"values": {
{"value": 1, "label": "very_satisfied"},
{"value": 2, "label": "somewhat_satisfied"},
{"value": 3, "label": "somewhat_dissatisfied"},
{"value": 4, "label": "very_dissatisfied"},
},
"axes": {
"brand": {
"values": [
{"value": 7, "label": "Lenovo"},
{"value": 8, "label": "Macbook Pro"},
{"value": 9, "label": "IBM"}
]
}
}
}
Example four — Multiple response (Grid multi selection)
Multidimensional array, defined by two categorical axes with boolean values.
Question: For each brand name listed below, which characteristics do you perceive to be true? (check all that apply)
Trustworthy | Trendy | Reliable | High Quality | |
---|---|---|---|---|
Apple | X | X | X | |
X | X | |||
X |
Variable definition:
// schema
{
"name": "brand_metrics_select",
"type": "array",
"value": {
"type": "categorical",
"values": [1, 2, 9]
},
"axes": [
{
"name": "brand",
"type": "categorical",
"values": [1, 2, 3]
},
{
"name": "metric",
"type": "categorical",
"values": [10, 11, 12, 13]
}
]
}
// metadata
{
"values": {
{"value": 1, "label": "Yes", "selected": true},
{"value": 2, "label": "No", "missing": false},
{"value": 9, "label": "Not Asked", "missing": true}
},
"axes": {
"brand": {
"values": [
{"value": 1, "label": "Apple"},
{"value": 2, "label": "Google"},
{"value": 3, "label": "Facebook"}
]
},
"metric": {
"values": [
{"value": 10, "label": "Trustworthy"},
{"value": 11, "label": "Trendy"},
{"value": 12, "label": "Reliable"},
{"value": 13, "label": "High Quality"}
]
}
}
}