You can retrieve tabulated data from the Lakehouse using the Tabulation API by specifying variable aliases for rows and columns..
The API returns a pre-signed Amazon S3 URL that provides temporary access to the results. The output is delivered as a single CSV file in long format. See the section below for details on the long-format CSV structure
The Tabulation API operates asynchronously. When a request completes, the result includes the S3 URL allowing you to download the generated results.
This article uses data from Example files to demonstrate how the Tabulation API works in practice.
At this time, the tabulation API only supports categorical variable types.
Categorical Tabulation
Let’s say you want to retrieve data showing which social media sites respondents use, cross-tabulated by gender. The dataset includes nine variables for the question “Please tell me if you ever use any of the following social media sites. Do you ever use…”, with aliases 'web1a' through 'web1i', and a gender variable with the alias 'sex'. These variables are placed in the columns and rows, respectively, as shown below.
You can specify a Lakehouse datasource or a Crunch platform dataset ID.
{
"datasource": "<datasource ID>", # or "dataset": "<dataset ID>",
"tabulation": {
"rows": [
"web1a", "web1b", "web1c", "web1d", "web1e",
"web1f", "web1g", "web1h", "web1i"
],
"columns": [
"sex"
],
}
}Returns a CSV that looks like:
row_alias,row_axis,row_code,column_alias,column_axis,column_code,column_proportion,count_unweighted,column_base_unweighted,count_weighted,column_base_weighted
web1a,,1,sex,,1,24.033816425120776,199,828,,
web1a,,1,sex,,2,18.991097922848667,128,674,,
web1a,,2,sex,,1,75.96618357487924,629,828,,
web1a,,2,sex,,2,81.00890207715133,546,674,,
web1b,,1,sex,,1,28.985507246376812,240,828,,
web1b,,1,sex,,2,37.53709198813057,253,674,,
web1b,,2,sex,,1,70.89371980676329,587,828,,
web1b,,2,sex,,2,62.166172106824924,419,674,,
web1b,,8,sex,,1,0.12077294685990338,1,828,,
web1b,,8,sex,,2,0.2967359050445104,2,674,,
...Each row is a single comparison of a row and a column variable value
Filtering Data
The tabulation request can include a filter expression along side the rows and columns.
{
"dataset": f"<dataset ID>",
"tabulation": {
"rows": ["web1a", ..., "web1i"],
"columns": ["sex"],
"filter": {"state": {"==": "17"}}
}
}Variables used in filters do not have to be included in rows or columns.
For a detailed explanation of constructing filter expressions, see Lakehouse Filtering Specification.
Cross-tabulating Arrays
The Tabulation API handles arrays the same as any other categorical variable, except the Axis column is filled out (in either rows or columns) with the respective axis name. This will naturally create more rows in the result as it calculates the proportion for every axis value row/column combination. The row_axis or column_axis column in the CSV will identify the corresponding axis for each row or column.
Weighted Data
Any tabulation may be weighted by a numeric variable. Every respondent in the weight variable must have a value to be eligible as a weighting variable. Add a weight attribute to the request indicating which variable alias ("weight" in this example) represents the weightings to be used:
{
"dataset": f"<dataset ID>",
"tabulation": {
"rows": ["web1a", ..., "web1i"],
"columns": ["sex"],
"weight": "weight"
}
}The CSV will contain the weighted and unweighted values and the column_proportion will be calculated on the weighted data.
row_alias,row_axis,row_code,column_alias,column_axis,column_code,column_proportion,count_unweighted,column_base_unweighted,count_weighted,column_base_weighted
web1a,,1,sex,,1,23.871648200406494,199,828,177,741
web1a,,1,sex,,2,20.544972708619554,128,674,156,761
web1a,,2,sex,,1,76.12835179959362,629,828,564,741
web1a,,2,sex,,2,79.45502729138049,546,674,605,761
web1b,,1,sex,,1,31.054658674376157,240,828,230,741
web1b,,1,sex,,2,42.78718422949066,253,674,326,761
web1b,,2,sex,,1,68.83509604947523,587,828,510,741
web1b,,2,sex,,2,56.92914225734111,419,674,433,761
web1b,,8,sex,,1,0.11024527614881985,1,828,1,741
web1b,,8,sex,,2,0.28367351316818723,2,674,2,761
...Similar to filters, a default weight variable can be overridden by specifying "weight": None.
Long Format Tabulation Results
The long format CSV result file will have the following columns:
| Column | Explanation |
| row_alias | Alias of the variable displayed on the row dimension of the table. |
| row_axis | Axis name for the row variable in the tabulation. |
| row_code | Coded value of the row variable representing a specific response category. |
| column_alias | Alias of the variable displayed on the column dimension of the table. |
| column_axis | Axis name for the column variable in the tabulation. |
| column_code | Coded value of the column variable representing a specific response category. |
| col_prop | Column percentage: the proportion of respondents in the column category who fall into the row category, expressed as a percent. |
| count_unweighted | Unweighted number of respondents in the cell. |
| col_base_unweighted | Total unweighted number of respondents in the column category (column base). |
| count_weighted | Weighted number of respondents in the cell. |
| col_base_weighted | Total weighted number of respondents in the column category (weighted column base). |
Each row in the CSV captures the results for one row and column category combination, reporting the count, total column population, and the proportion expressed as a percentage.
API References
- Tabulate API Documentation
Python Code Examples
The following code examples rely on the public Crunch client pycrunch package.
Installation
pip install pycrunch
Basic Tabulation
This example uses data from Example files and cross-tabulates the social media variables (web*) against gender (sex)
import pycrunch
import requests
import time
from pycrunch.shoji import as_entity, wait_progress, as_value, TaskError
from pycrunch.progress import DefaultProgressTracking
datasource_id = "<datasource ID>"
# or dataset_id = "<dataset ID>"
api_key = "<crunch_api_key>"
site = pycrunch.connect(
api_key=api_key,
site_url="<crunch_api_url>",
)
body = {
"datasource": f"{site_url}/datasources/{datasource_id}/",
# or "dataset": f"{site.self}/datasets/{dataset_id}/",
"tabulation": {
"rows": [
"web1a", "web1b", "web1c", "web1d", "web1e",
"web1f", "web1g", "web1h", "web1i"
],
"columns": [
"sex"
],
},
}
response = site.datasources.tabulate.post(as_entity(body))
assert response.status_code == 202
workflow_url = response.json()["value"]
tracker = DefaultProgressTracking(timeout=3600)
try:
wait_progress(response, site.session, progress_tracker=tracker)
except TaskError as exc:
print(exc)
response = requests.get(url=workflow_url, headers={"Authorization": f"Bearer {api_key}"})
assert response.status_code == 200
print(f'S3 URL: {response.json()["value"]["result"]["data"]}')
Filtered Tabulation
This example uses data from Example files and cross-tabulates the social media variables (web*) against gender (sex), additionally filtered to state Illinois (value 17)
import pycrunch
import requests
import time
from pycrunch.shoji import as_entity, wait_progress, as_value, TaskError
from pycrunch.progress import DefaultProgressTracking
datasource_id = "<datasource ID>"
# or dataset_id = "<dataset ID>"
api_key = "<crunch_api_key>"
site = pycrunch.connect(
api_key=api_key,
site_url="<crunch_api_url>",
)
body = {
"datasource": f"{site_url}/datasources/{datasource_id}/",
# or "dataset": f"{site.self}/datasets/{dataset_id}/",
"tabulation": {
"rows": [
"web1a", "web1b", "web1c", "web1d", "web1e",
"web1f", "web1g", "web1h", "web1i"
],
"columns": [
"sex"
],
"filter": {"state": {"==": 17}}
},
}
response = site.datasources.tabulate.post(as_entity(body))
assert response.status_code == 202
workflow_url = response.json()["value"]
print(f"tabulation task: {workflow_url}")
tracker = DefaultProgressTracking(timeout=3600)
try:
wait_progress(response, site.session, progress_tracker=tracker)
except TaskError as exc:
print(exc)
response = requests.get(url=workflow_url, headers={"Authorization": f"Bearer {api_key}"})
assert response.status_code == 200
print(f'S3 URL: {response.json()["value"]["result"]["data"]}')