The Crunch Lakehouse is designed to efficiently and robustly handle survey data irrespective of its source. The primary use cases, at the time of writing, are SPSS and Parquet based datasets, though handling data from integrations such as Qualtrics, SurveyMonkey and Decipher, to name a few, are planned.
The Lakehouse normalizes survey data in to a long format storage which is efficient to store and query.
There are currently two primary use cases for the Lakehouse:
- A survey data pipeline.
- An open, standards based access to data in the traditional Crunch Data Platform, aka YouGov syndicated data such as Profiles.
To understand the structure of datasets in the Crunch Lakehouse, it is important to understand the Crunch Logical Schema — this document includes examples of various variable types and how they are defined.
The Crunch Lakehouse APIs require API keys. Refer to the API documentation on how to retrieve your API key.
See the dedicated developer guides for more details about Ingesting, Updating and export data from the Lakehouse
Lakehouse schema and metadata
The Crunch Lakehouse provides APIs to retrieve schema and metadata for any data that exists in the Lakehouse. Refer to Crunch Logical Schema for details about the structure of survey data schema and metadata.
The following code example fetches the latest schema and metadata from the specified datasource:
import json
import pycrunch
site = pycrunch.connect(
api_key="<crunch_api_key>",
site_url="<crunch_api_url>",
)
# get datasource entity
datasources = site.follow("datasources", {"name": "<name>"})
datasource = datasources.index[next(iter(datasources.index))].entity
# get schema
response = datasource.schema.latest["value"]
version, schema = response["version_id"], json.loads(response["content"].json)
# get metadata
response = datasource.metadata.latest["value"]
_, metadata = response["version_id"], json.loads(response["content"].json)
Tying together ingestion and exports
The following describes an example use case where two applications interact to create a data pipeline - one sending data, the other consuming data, demonstrating how DataSources, DataDestinations, and Notifications interact with applications:
Tracking progress of asynchronous operations
Ingestion and Export operations are asynchronous. The APIs return a workflow ID that can be used to retrieve progress indications. The pycrunch package simplifies this by providing tracking functions to wait for completion and retrieve progress.