Tabulating Data on the Lakehouse

You can retrieve tabulated data from the Lakehouse using the Tabulation API by specifying variable aliases for rows and columns..

The API returns a pre-signed Amazon S3 URL that provides temporary access to the results. The output is delivered as a single CSV file in long format. See the section below for details on the long-format CSV structure

The Tabulation API operates asynchronously. When a request completes, the result includes the S3 URL allowing you to download the generated results.

This article uses data from Example files to demonstrate how the Tabulation API works in practice.

At this time, the tabulation API only supports categorical variable types.

Categorical Tabulation

Let’s say you want to retrieve data showing which social media sites respondents use, cross-tabulated by gender. The dataset includes nine variables for the question “Please tell me if you ever use any of the following social media sites. Do you ever use…”, with aliases 'web1a' through 'web1i', and a gender variable with the alias 'sex'. These variables are placed in the columns and rows, respectively, as shown in the example requests:

Crunch Datasets

{
    "dataset": "<dataset ID>",
    "tabulation": {
        "rows": [
            "web1a", "web1b", "web1c", "web1d", "web1e",
            "web1f", "web1g", "web1h", "web1i"
        ], 
        "columns": [
            "sex"
        ],
    }
}

Lakehouse Datasources

{
    "datasource": "<datasource ID>",
    "tabulation": {
        "rows": [
            "web1a", "web1b", "web1c", "web1d", "web1e",
            "web1f", "web1g", "web1h", "web1i"
        ], 
        "columns": [
            "sex"
        ],
    }
}

Resulting Tabulation

The API returns a CSV that looks like:

row_alias,row_axis,row_code,column_alias,column_axis,column_code,column_proportion,count_unweighted,column_base_unweighted,count_weighted,column_base_weighted
web1a,,1,sex,,1,24.033816425120776,199,828,,
web1a,,1,sex,,2,18.991097922848667,128,674,,
web1a,,2,sex,,1,75.96618357487924,629,828,,
web1a,,2,sex,,2,81.00890207715133,546,674,,
web1b,,1,sex,,1,28.985507246376812,240,828,,
web1b,,1,sex,,2,37.53709198813057,253,674,,
web1b,,2,sex,,1,70.89371980676329,587,828,,
web1b,,2,sex,,2,62.166172106824924,419,674,,
web1b,,8,sex,,1,0.12077294685990338,1,828,,
web1b,,8,sex,,2,0.2967359050445104,2,674,,
...

Each row is a single comparison of a row and a column variable value

Filtering Data

The tabulation request can include a filter expression along side the rows and columns.

{
    "dataset": f"<dataset ID>",
    "tabulation": {
        "rows": ["web1a", ..., "web1i"], 
        "columns": ["sex"],
        "filter": {"state": {"==": "17"}}
    }
}

Variables used in filters do not have to be included in rows or columns.

For a detailed explanation of constructing filter expressions, see Lakehouse Filtering Specification.

Cross-tabulating Arrays

The Tabulation API handles arrays the same as any other categorical variable, except the Axis column is filled out (in either rows or columns) with the respective axis name. This will naturally create more rows in the result as it calculates the proportion for every axis value row/column combination. The row_axis or column_axis column in the CSV will identify the corresponding axis for each row or column.

Weighted Data

Any tabulation may be weighted by a numeric variable. Every respondent in the weight variable must have a value to be eligible as a weighting variable. Add a weight attribute to the request indicating which variable alias ("weight" in this example) represents the weightings to be used:

{
    "dataset": f"<dataset ID>",
    "tabulation": {
        "rows": ["web1a", ..., "web1i"], 
        "columns": ["sex"],
        "weight": "weight"
    }
}

The CSV will contain the weighted and unweighted values and the column_proportion will be calculated on the weighted data.

row_alias,row_axis,row_code,column_alias,column_axis,column_code,column_proportion,count_unweighted,column_base_unweighted,count_weighted,column_base_weighted
web1a,,1,sex,,1,23.871648200406494,199,828,177,741
web1a,,1,sex,,2,20.544972708619554,128,674,156,761
web1a,,2,sex,,1,76.12835179959362,629,828,564,741
web1a,,2,sex,,2,79.45502729138049,546,674,605,761
web1b,,1,sex,,1,31.054658674376157,240,828,230,741
web1b,,1,sex,,2,42.78718422949066,253,674,326,761
web1b,,2,sex,,1,68.83509604947523,587,828,510,741
web1b,,2,sex,,2,56.92914225734111,419,674,433,761
web1b,,8,sex,,1,0.11024527614881985,1,828,1,741
web1b,,8,sex,,2,0.28367351316818723,2,674,2,761
...

Similar to filters, a default weight variable can be overridden by specifying "weight": None.

Comparing Audiences

The tabulation API may be used to compare variables against an audience (sometimes known as the population) rather than a direct comparison of the variable categories. A named expression may be used to define an audience which results in a tabulation of respondents either "in" (value 1) or "not in" (value 0) the audience. A named expression is defined as:

{
    "name": "<expression name>",
    "expression": <filter expression>
}

Let's say you want to define an audience as the group of respondents who use the internet at least once a day using variable "intfreq" categories "Almost constantly", "Several times a day" and "About once a day". A named expression would look like:

{
    "name": "frequent-usage",
    "expression": {"intfreq": {"in": [1, 2, 3]}}
}

A tabulation request against variable "sex" would look like:

{
    "dataset": f"<dataset ID>",
    "tabulation": {
        "rows": ["sex"], 
        "columns": [
            {
                "name": "frequent-usage",
                "expression": {"intfreq": {"in": [1, 2, 3]}}
            }
        ]
    }
}

which results in a tabulation that looks like:

row_alias,row_axis,row_code,column_alias,column_axis,column_code,column_proportion,count_unweighted,column_base_unweighted,count_weighted,column_base_weighted
sex,,1,frequent-usage,,0,48.86363636363637,129,264,,
sex,,2,frequent-usage,,0,51.13636363636363,135,264,,
sex,,1,frequent-usage,,1,56.462035541195476,699,1238,,
sex,,2,frequent-usage,,1,43.537964458804524,539,1238,,

This shows 56% of respondents "in" (value 1) the audience "frequent-usage" (internet users who use the internet at least once per day) are Male. Correspondingly, the 48% of respondents “not in” (value 0) the “frequent-usage” audience are Male. Note: respondents with missing answers are included in the “not in” audience.

Audiences (or named expressions) may be specified in rows or columns.

Formatting Tabulations

Including Labels

The corresponding labels for each row variable or column variable can be included in the resulting file by adding the include_columns parameter to the request in the format block. Supported columns are row_label, column_label.

Setting Precision

The number of decimal places used in the tabulation can be defined be adding the num_decimal_places attribute to the format block. By default, the maximum precision available will be used.

Example

"format": {
  "num_decimal_places": 2,
  "include_columns": [
    "row_label",
    "column_label"
  ]
}

Long Format Tabulation Results

The long format CSV result file will have the following columns:

Column	Explanation
row_alias	Alias of the variable displayed on the row dimension of the table.
row_axis	Axis name for the row variable in the tabulation.
row_code	Coded value of the row variable representing a specific response category.
row_label	The label associated with the `row_code` for a categorical variable. [optional]
column_alias	Alias of the variable displayed on the column dimension of the table.
column_axis	Axis name for the column variable in the tabulation.
column_code	Coded value of the column variable representing a specific response category.
column_label	The label associated with the column`_code` for a categorical variable. [optional]
col_prop	Column percentage: the proportion of respondents in the column category who fall into the row category, expressed as a percent.
count_unweighted	Unweighted number of respondents in the cell.
col_base_unweighted	Total unweighted number of respondents in the column category (column base).
count_weighted	Weighted number of respondents in the cell.
col_base_weighted	Total weighted number of respondents in the column category (weighted column base).

Each row in the CSV captures the results for one row and column category combination, reporting the count, total column population, and the proportion expressed as a percentage.

API References

Tabulate API Documentation

Python Code Examples

The following code examples rely on the public Crunch client pycrunch package.

Installation

pip install pycrunch

Basic Tabulation

This example uses data from Example files and cross-tabulates the social media variables (web*) against gender (sex)

import pycrunch
import requests
import time

from pycrunch.shoji import as_entity, wait_progress, as_value, TaskError
from pycrunch.progress import DefaultProgressTracking


datasource_id = "<datasource ID>"
# or dataset_id = "<dataset ID>"

api_key = "<crunch_api_key>"

site = pycrunch.connect(
	api_key=api_key,
	site_url="<crunch_api_url>",
)

body = {
    "datasource": f"{site_url}/datasources/{datasource_id}/",
    # or "dataset": f"{site.self}/datasets/{dataset_id}/",
    "tabulation": {
        "rows": [
            "web1a", "web1b", "web1c", "web1d", "web1e",
            "web1f", "web1g", "web1h", "web1i"
        ], 
        "columns": [
            "sex"
        ],
    },
}

response = site.datasources.tabulate.post(as_entity(body))
assert response.status_code == 202

workflow_url = response.json()["value"]

tracker = DefaultProgressTracking(timeout=3600)
try:
    wait_progress(response, site.session, progress_tracker=tracker)
except TaskError as exc:
    print(exc)

response = requests.get(url=workflow_url, headers={"Authorization": f"Bearer {api_key}"})
assert response.status_code == 200

print(f'S3 URL: {response.json()["value"]["result"]["data"]}')

Filtered Tabulation

This example uses data from Example files and cross-tabulates the social media variables (web*) against gender (sex), additionally filtered to state Illinois (value 17)

import pycrunch
import requests
import time

from pycrunch.shoji import as_entity, wait_progress, as_value, TaskError
from pycrunch.progress import DefaultProgressTracking

datasource_id = "<datasource ID>"
# or dataset_id = "<dataset ID>"

api_key = "<crunch_api_key>"

site = pycrunch.connect(
	api_key=api_key,
	site_url="<crunch_api_url>",
)

body = {
    "datasource": f"{site_url}/datasources/{datasource_id}/",
    # or "dataset": f"{site.self}/datasets/{dataset_id}/",
    "tabulation": {
        "rows": [
            "web1a", "web1b", "web1c", "web1d", "web1e",
            "web1f", "web1g", "web1h", "web1i"
        ], 
        "columns": [
            "sex"
        ],
        "filter": {"state": {"==": 17}}
    },
}

response = site.datasources.tabulate.post(as_entity(body))
assert response.status_code == 202

workflow_url = response.json()["value"]
print(f"tabulation task: {workflow_url}")

tracker = DefaultProgressTracking(timeout=3600)
try:
    wait_progress(response, site.session, progress_tracker=tracker)
except TaskError as exc:
    print(exc)

response = requests.get(url=workflow_url, headers={"Authorization": f"Bearer {api_key}"})
assert response.status_code == 200

print(f'S3 URL: {response.json()["value"]["result"]["data"]}')

Help Center