This article describes the data layout in parquet files and contains the row level respondent data for surveys defined in the Crunch Logical Schema.
Data columns
| column | type | required |
|---|---|---|
| row_id | INT | true |
| var_name | STRING | true |
| var_type | STRING | true |
| axis | LIST<STRING> | false |
| categorical_value | STRING | false |
| numeric_value | DOUBLE | false |
| datetime_value | TIMESTAMP[us] | false |
| text_value | STRING | false |
Examples
The first four examples are of atomic types. To save space, we show only a single non-null value field, though in the first row, categorical_value is "male" and the other three value fields are null. For age, numeric_value is 42 and the other three value fields are null:
| row_id | var_name | var_type | axis | categorical_value |
|---|---|---|---|---|
| 1 | gender | categorical | NULL | "male" |
| 1 | age | numeric | NULL | 42.0 |
| 1 | firstname | text | NULL | "Joe" |
| 1 | birthdate | datetime | NULL | 1970-01-01 00:00:00 |
| row_id | var_name | var_type | axis | categorical_value |
|---|---|---|---|---|
| 1 | awareness | categorical | ["apple"] | "yes" |
| 1 | awareness | categorical | ["borland"] | "yes" |
| 1 | awareness | categorical | ["corel"] | "no" |
For array variables, axis is an array. The first example is a one dimensional categorical array for awareness of Apple, Borland, and Corel.
A two-dimensional categorical array (brand metric) is represented as:
| row_id | var_name | var_type | axis | categorical_value |
|---|---|---|---|---|
| 1 | rating | categorical | ["apple", "value"] | "1" |
| 1 | rating | categorical | ["apple", "quality"] | "2" |
| 1 | rating | categorical | ["borland", "value"] | "2" |
| 1 | rating | categorical | ["borland", "quality"] | "3" |
This requires support for variable length arrays in the axis column. (In a system without this feature, you could support up to a fixed number of dimensions by having separate fields for, say, the first and second axes.)