Crunch Lakehouse Developer Guide (Overview)

The Crunch Lakehouse is designed to efficiently and robustly handle survey data irrespective of its source. The primary use cases, at the time of writing, are SPSS and Parquet based datasets, though handling data from integrations such as Qualtrics, SurveyMonkey and Decipher, to name a few, are planned.

The Lakehouse normalizes survey data in to a long format storage which is efficient to store and query.

There are currently two primary use cases for the Lakehouse:

A survey data pipeline.
An open, standards based access to data in the traditional Crunch Data Platform, aka YouGov syndicated data such as Profiles.

To understand the structure of datasets in the Crunch Lakehouse, it is important to understand the Crunch Logical Schema — this document includes examples of various variable types and how they are defined.

The Crunch Lakehouse APIs require API keys. Refer to the API documentation on how to retrieve your API key.

See the dedicated developer guides for more details about Ingesting, Updating and export data from the Lakehouse

Lakehouse schema and metadata

The Crunch Lakehouse provides APIs to retrieve schema and metadata for any data that exists in the Lakehouse. Refer to Crunch Logical Schema for details about the structure of survey data schema and metadata.

The following code example fetches the latest schema and metadata from the specified datasource:

import json
import pycrunch

site = pycrunch.connect(
	api_key="<crunch_api_key>",
	site_url="<crunch_api_url>",
)

# get datasource entity
datasources = site.follow("datasources", {"name": "<name>"})
datasource = datasources.index[next(iter(datasources.index))].entity

# get schema
response = datasource.schema.latest["value"]
version, schema = response["version_id"], json.loads(response["content"].json)

# get metadata
response = datasource.metadata.latest["value"]
_, metadata = response["version_id"], json.loads(response["content"].json)

Tying together ingestion and exports

The following describes an example use case where two applications interact to create a data pipeline - one sending data, the other consuming data, demonstrating how DataSources, DataDestinations, and Notifications interact with applications:

Tracking progress of asynchronous operations

Ingestion and Export operations are asynchronous. The APIs return a workflow ID that can be used to retrieve progress indications. The pycrunch package simplifies this by providing tracking functions to wait for completion and retrieve progress.

Variable folders on Datasources

Lakehouse datasources mirror variable folders in the Crunch Data Platform.

Ingestion supports including a folders.json file along-side schema and metadata in the meta/ folder.

Exports include a folders.json file in the delivered data.

A typical folders.json looks like

{
      "personal": [],
      "hidden": [],
      "secure": [],
      "public": [
          {
          "name": "Demographics",
          "type": "folder",
          "children": [      
              {
              "name": "age",
              "type": "variable"
              },
              {
              "name": "gender",
              "type": "variable"
              }
          ]},
          {
          "name": "q1",
          "type": "variable"
          },
          {
          "name": "q2",
          "type": "variable"
          },...
      ],
      "shared": []
  }

References

Organizing Variables

Help Center

Lakehouse schema and metadata

Tying together ingestion and exports

Tracking progress of asynchronous operations

Variable folders on Datasources

Related articles