Crunch Logical Schema

Context

This document defines the canonical Crunch Schema to support both existing and new datasets provided by our users.

Over the years, the way Crunch describes its datasets has evolved, with an unclear distinction of what’s schema and what’s metadata.

Schema is meant to describe the variable types of a dataset, and metadata is meant to contain information that identifies and describes the data.

With this distinction, we are able to define rules that can safely enforce how schema definitions are built.

Use Cases

Schema for datasets stored in the Silverlake
Common representation for translating importer data
Streaming schema

Driving Principles

User-friendly

The schema is intended to be as succinct and friendly to compose as possible. Data types are designed to be explicit in what it is describing, and they should be extensible when possible.

Explicit

Clear distinction between schema and metadata.

Schema and metadata are defined separately and each data type can have specific metadata attributes that are not present in other data types.

Metadata cannot be described by itself—it only supports the schema.

Compatible

Compatible with most common data formats and tools, such as parquet, pyarrow and pandas.

We want converting imports into DataFrames (whether from R, Pandas, Polars, Dask, or anywhere else) to be as easy as possible, using their native support for categorical types and nested columns.

On top of that, this schema is designed to support the existing datasets so all features should be supported in the new schema.

Terminology

Term	Definition
Variable	Identifies values on a dataset.
Data Type	A data type defines a variable.
Metadata	Information that identifies and describes an individual data type.
Schema	The composition of multiple data type definitions that describe a dataset.

Data Types

There are two required attributes that compose a variable definition: name and type.
For categorical, there’s an extra attribute (values) and for array, there’s value.

Attribute	Type	Description	Required
name	string	Unique name of the variable	true
type	string	A Crunch data type	true
values	list<integer, string>	Required if type categorical	false
value	map	Required if type array	false

Data types can be Atomic Types or Array Type.

Generally speaking, the data can be represented and transported as a JSON document.
All data types are represented in a basic {key: value} format as per standard JSON objects.

Atomic Types

Numeric

Numeric variables are used to store numbers, including both integer numbers or floats:

// schema
{ 
  "name": "numchildren", 
  "type": "numeric"
}

Text

Text variables are encoded as UTF-8. Text data is of variable length:

// schema
{
  "name": "zipcode",
  "type": "text"
}

Datetime

Datetime variables are serialized as an ISO8601 UTC datetime string (e.g., 2023-05-01T09:00:00):

// schema
{
  "name": "starttime",
  "type": "datetime"
}

Categorical

Categorical variables store a value from a finite set of categories:

// schema
{
  "name": "gender",
  "type": "categorical",
  "values": ["Female", "Male", "Other"]
}

You can also use numeric codes instead of strings:

// schema
{
  "name": "educ4",
  "type": "categorical",
  "values": [1, 2, 3, 4, 9]
}
// metadata
{
  "values": [
      {"value": 1, "label": "High school or less"},
      {"value": 2, "label": "Some college"},
      {"value": 3, "label": "College graduate"},
      {"value": 4, "label": "Postgraduate (M.D., J.D., M.A., M.S., M.B.A., Ph.D., D.V.M., or equivalent)"},
      {"value": 9, "label": "Refused, skipped, not applicable"}
  ]
}

Historically, numeric codes were used to save space compared to using string values.

There is no inefficiency in long-term storage file formats such as Parquet, including sending text strings as values, since these will be dictionary encoded.

With categorical types, metadata is assigned to each categorical value:

Attribute	Type	Description	Required
values	list<item>		false
values.item.value	same as schema.values.<item>	category id, the one from schema.values	true
values.item.label	string	Presentable category name	false
values.item.scale	float	Numeric value associated with the category	false
values.item.date	string	Date in ISO8601 format (any precision) associated with the category	false
values.item.selected	boolean	Category to be considered as selected	false
values.item.missing	boolean	Category to be considered as missing data	false
values.item.notes	string		false

Values in metadata

While not required, when defined, the values in the metadata are set as a list because the order matters for categorical data.

The order of the values is the default order in which they should be analyzed/presented, which takes precedence over the order defined in the schema.

When there’s no metadata for a category variable, the default order is defined by the schema.

Array Type

Multiple response variables, categorical arrays, and numeric arrays are commonly found types in survey systems. All three of these concepts (and more, including multidimensional arrays) are represented as a single array type.

Here is a definition of a question where the respondent is asked about their awareness of all five brands:

// schema 
{
  "name": "awareness",
  "type": "array",
  "value": {
    "type": "categorical",
    "values": [1, 2, 9]
  },
  "axes": [
    { 
      "name": "brand",
      "type": "categorical",
      "values": ["101", "102", "103", "201", "202"]
    }
  ]
}

// metadata
{
  "label": "Aided awareness",
  "description": "Which of these brands have you heard of?",
  "values": [
      {"value": 1, "label": "Selected", "selected": true},
      {"value": 2, "label": "Not selected"},
      {"value": 9, "label": "Not asked", "missing": true}
  ],
  "axes": {
    "brand": {
      "values": [
        {"value": "101", "label": "Apple"},
        {"value": "102", "label": "Borland"},
        {"value": "103", "label": "Corel"},
        {"value": "201", "label": "Google"},
        {"value": "202", "label": "Facebook"}
      ]
    }
  }
}

Data type attributes:

Attribute	Type	Description	Required
axes	list<item>
axes.item.name	string	Unique name for the axis	true
axes.item.type	string	An atomic data type	true
axes.item.values	list<string>	Required if axes.item.type is categorical or array	false

Metadata attributes:

Attribute	Type	Description	Required
axes.<axis>.values	list<item>
metadata.axes.<axis>.values.<item>.value	same as in axes.<item>.values	value id, the one from schema.axes.<item>.values	true
metadata.axes.<axis>.values.<item>.label	string	Presentable category name	false
metadata.axes.<axis>.values.<item>.description	string		false
metadata.axes.<axis>.values.<item>.notes	string		false
metadata.axes.<axis>.values.<item>.missing	boolean	Axis value to be considered missing	false

Examples

Sample Data in these examples are actually stored in the Long Format. See Crunch Lakehouse Format Data Model to learn more.

Categorical Examples

Example one — Categories as codes with labels

Question: “Who did you vote for in the 2020 presidential election?”

Answers:

Joseph R. Biden, Jr.
Donald J. Trump
Other
Did not vote
Skipped
Not asked

Variable definition:

// schema
{
  "name": "presidential_vote_2020",   
  "type": "categorical",
  "values": [1, 2, 3, 4, 8, 9]
}

// metadata
{
  "label": "2020 Presidential vote choice",
  "description": "Who did you vote for in...",
  "values": [
    {"value": 1, "label": "Joseph R. Biden, Jr."},
    {"value": 2, "label": "Donald J. Trump"},     
    {"value": 3, "label": "Other"},
    {"value": 4, "label": "Did not vote"},
    {"value": 8, "label": "Skipped"},
    {"value": 9, "label": "Not asked"}
  ]
}

// Sample data
{
  "row_id": 101,
  "var_name": "presidential_vote_2020",
  "categorical_value": 4
}

Example two — Categories as labels with scale

Question: "Do you approve or disapprove of the job Joe Biden is doing as President?”

Answers:

Strongly approve
Somewhat approve
Somewhat disapprove
Strongly disapprove
Not sure
Skipped
Not asked

Variable definition:

// schema
{
  "name": "approval_biden",
  "type": "categorical",
  "values": [
     "Strongly Approve",
     "Somewhat Approve",
     "Neutral",
     "Somewhat Disapprove",
     "Strongly Disapprove",
     "Not sure",
     "Skipped",
     "Not asked"
    ]
  }
}

// metadata
"metadata": {
  "label": "Biden approval",
  "description": "Do you approve of the job Joe Biden is doing as President?",
  "values": [
    {"value": "Strongly Approve", "scale": 20.0, "selected": true},
    {"value": "Somewhat Approve", "scale": 10.0, "selected": true},
    {"value": "Neutral", "scale": 0},
    {"value": "Somewhat Disapprove", "scale": -10.0},
    {"value": "Strongly Disapprove", "scale": -20.0},
    {"value": "Not sure", "missing": true},
    {"value": "Skipped", "missing": true},
    {"value": "Not asked", "missing": true}   
  ]
}

// Sample data
{
  "row_id": 101,
  "var_name": "approval_biden",
  "categorical_value": "Not asked"
}

Example three — Categories with date associated

Variable definition:

// schema
{
  "name": "month",
  "type": "categorical",
  "values": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 98, 99]
}

// metadata
{
  "label": "Month of 2023",
  "description": "What month did you...",
  "values": [
    {"value": 1, "date": "2023-01"},
    {"value": 2, "date": "2023-02"},
    {"value": 3, "date": "2023-03"},
		// ... hidden for brevity
    {"value": 98, "label": "Skipped", "missing": true},
    {"value": 99, "label": "Not Asked", "missing": true},
  ]
}

// Sample data
{
  "row_id": 101,
  "var_name": "month",
  "categorical_value": 2
}

Array Examples

Example one — Dichotomies array

Single dimension array, defined by one categorical axis and categorical values.

Question: Do you use the following social networks?

Answers:

Facebook
Twitter
Linkedin

Variable definition:

// schema
{
  "name": "socnetuse",
  "type": "array",
  "value": {
    "type": "categorical",
    "values": [1, 2, 8, 9]
  },
  "axes": [
    { 
      "name": "network", 
      "type": "categorical", 
      "values": [1, 2, 3, 4]
    }
  ]
}

// metadata
{
  "values": {
    {"value": 1, "label": "Yes", "selected": true},
    {"value": 2, "label": "No", "selected": false},
    {"value": 8, "label": "Not Asked", "missing": true},
    {"value": 9, "label": "Not Aware", "missing": true},
  },
  "axes": {
    "network": {
      "values": [
        {"value": 1, "label": "Facebook"},
        {"value": 2, "label": "X (Twitter)"},
        {"value": 3, "label": "LinkedIn"},
        {"value": 4, "label": "Instagram"}
      ]
    }
  }
}

Example two — Numeric array

Single dimension array, defined by one categorical axis and integer values.

Question: Please rate the following computers on a scale from 0 to 100, where 100 is the platonic ideal and 0 is metaphysically terrible.

Answers:

iPad Pro
Chromebook
Lenovo Laptop
MacBook

Variable definition:

// schema
{
  "name": "computer_rating",
  "type": "array",
  "value": { "type": "numeric" },
  "axes": [
    { 
      "name": "computer",
      "type": "categorical",
      "values": [
        "ipad_pro", 
        "chromebook", 
        "lenovo", 
        "macbook"
      ]
    }
  ]
}

Example three — Categorical array (Grid single selection)

Single dimension array, defined by one categorical axis with categorical values.

Question: How satisfied were you with each of the following products that you have purchased?

	Very satisfied	Somewhat satisfied	Somewhat dissatisfied
Lenovo PC		X
MacBook Pro	X
IBM 360			X

Variable definition:

// schema
{
  "name": "brand_statements",
  "type": "array",
  "value": {
    "type": "categorical",
    "values": [1, 2, 3, 4]
  },
  "axes": [
    { 
      "name": "brand", 
      "type": "categorical",
      "values": [7, 8, 9],
    }
  ]
}
 
// metadata
{
  "values": {
    {"value": 1, "label": "Very Satisfied"},
    {"value": 2, "label": "Somewhat Satisfied"},
    {"value": 3, "label": "Somewhat Dissatisfied"},
    {"value": 4, "label": "Very Dissatisfied"},
  },
  "axes": {
    "brand": {
      "values": [
        {"value": 7, "label": "Lenovo PC"},
        {"value": 8, "label": "MacBook Pro"},
        {"value": 9, "label": "IBM 360"}
      ]
    }
  }
}

Example four — Multiple response (Grid multi selection)

Multidimensional array, defined by two categorical axes with boolean values.

Question: For each brand name listed below, which characteristics do you perceive to be true? (Check all that apply.)

	Trustworthy	Trendy	Reliable	High Quality
Apple	X		X	X
Google			X	X
Facebook		X

Variable definition:


// schema
{
  "name": "brand_metrics_select",
  "type": "array",
  "value": {
    "type": "categorical",
    "values": [1, 2, 9]
  },
  "axes": [
    { 
      "name": "brand", 
      "type": "categorical", 
      "values": [1, 2, 3] 
    },
    { 
      "name": "metric", 
      "type": "categorical", 
      "values": [10, 11, 12, 13]
    }
  ]
}

// metadata
{
  "values": {
    {"value": 1, "label": "Yes", "selected": true},
    {"value": 2, "label": "No", "missing": false},
    {"value": 9, "label": "Not Asked", "missing": true}   
  },
  "axes": {
    "brand": {
      "values": [
        {"value": 1, "label": "Apple"},
        {"value": 2, "label": "Google"},
        {"value": 3, "label": "Facebook"}
      ]
    },
    "metric": {
      "values": [
        {"value": 10, "label": "Trustworthy"},
        {"value": 11, "label": "Trendy"},
        {"value": 12, "label": "Reliable"},
        {"value": 13, "label": "High Quality"}
      ]
    }
  }
}

Help Center

Context

Use Cases

Driving Principles

User-friendly

Explicit

Compatible

Terminology

Data Types

Atomic Types

Numeric

Text

Datetime

Categorical

Array Type

Examples

Categorical Examples

Example one — Categories as codes with labels

Example two — Categories as labels with scale

Example three — Categories with date associated

Array Examples

Example one — Dichotomies array

Example two — Numeric array

Example three — Categorical array (Grid single selection)

Example four — Multiple response (Grid multi selection)

Related articles