Importing Data
There are several ways to build a Crunch dataset. The most appropriate method for you to create a dataset depends primarily on the format in which the data is currently stored.
There are several ways to build a Crunch dataset. The most appropriate method for you to create a dataset depends primarily on the format in which the data is currently stored.
Import from a data file
In some cases, you already have a file sitting on your computer which has source data, in CSV or SPSS format (or a Zip file containing a single file in CSV or SPSS format). You can upload these to Crunch and then attach them to datasets by following these three steps.
Step one: Create a dataset entity
POST /datasets/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 974
...
{
"element": "shoji:entity",
"body": {
"name": "my survey",
...
}
}
--------
201 Created
Location: /datasets/{dataset_id}/
ds <- newDataset("my.csv", name="my survey")
# All three steps are handled within newDataset()
POST a dataset Entity to the datasets catalog. See the documentation for POST /datasets/ for details on valid attributes to include in the POST.
The returned location header refers to the new dataset resource:
- A 201 Created response indicates success.
- A 202 Accepted response indicates that the dataset creation is happening in the background—read the response body to get a progress URL. See Progress for details.
For a 202 Accepted response, the response body contains a progress URL. You can poll this progress URL to confirm that the dataset has been created successfully:
202 Accepted
Location: /datasets/{dataset_id}/
{
"element": "shoji:view",
"self": "/datasets/",
"value": "/progress/{progress_id}/"
}
Step two: Upload the file
POST /sources/ HTTP/1.1
Content-Length: 8874357
Content-Type: multipart/form-data; boundary=df5b17ff463a4cb3aa61cf02224c7303
--df5b17ff463a4cb3aa61cf02224c7303
Content-Disposition: form-data; name="uploaded_file"; filename="my.csv"
Content-Type: text/csv
"case_id","q1","q2"
234375,3,"sometimes"
234376,2,"always"
...
--------
201 Created
Location: /sources/{source_id}/
POST the file to the sources catalog.
To import from a URL (rather than a local file), use a JSON body with a location property giving the URL.
POST /sources/ HTTP/1.1
Content-Length: 71
Content-Type: application/json
{"location": "https://www.dropbox.com/s/znpoawnhg0rdzhw/iris.csv?dl=1"}
Step three: Add the Source to the Dataset
POST /datasets/{dataset_id}/batches/ HTTP/1.1
Content-Type: application/json
...
{
"element": "shoji:entity",
"body": {
"source": "/sources/{source_id}/"
}
}
--------
202 Accepted
Location: /datasets/{dataset_id}/batches/{batch_id}/
...
{
"element": "shoji:view",
"value": "/progress/{progress_id}/"
}
POST the URL of the just-created source entity (the Location in the 201 response from the previous step) to the batches catalog of the dataset entity created in step 1.
The POST to the batches catalog will return 202 (Accepted) status, and the response body contains a progress URL. Poll that URL to monitor the completion of the batch addition. See Progress for details. The 202 response will also contain a Location header with the URL of the newly created batch.
Metadata document + CSV
This approach may be most natural for importing data from databases that store data by rows. You can dump or export your database to Crunch’s JSON metadata format, plus a CSV of data, and upload those to Crunch, without requiring much back-and-forth with the API.
Step one: Create a Dataset entity with variable definitions
POST /datasets/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 974
...
{
"element": "shoji:entity",
"body": {
"name": "my survey",
...,
"table": {
"element": "crunch:table",
"metadata": {
"educ": {"name": "Education", "alias": "educ", "type": "categorical", "categories": [...], ...},
"color": {"name": "Favorite color", "alias": "color", "type": "text", ...},
"postcode": {"name": "Postal Code", "alias": "postcode", "type": "text", ...},
"state": {"name": "State", "alias": "state", "view": {"geodata": [{"geodatum": , "feature_key": "properties.postal-code"}]}}
},
"tags": {"global": ["educ", "color"]},
"order": ["educ", {'my group": "color"}],
"hidden": [{"My system variables": ["state"]}],
"secure": [{"PII": ["postcode"]}]
},
}
}
--------
202 Accepted
Location: /datasets/{dataset_id}/
...
{
"element": "shoji:view",
"value": "/progress/{progress_id}/"
}
POST a Dataset Entity to the datasets catalog, and in the “body”, include a Crunch Table object with variable definitions and order.
The POST to the dataset catalog will return 202 (Accepted) status, and the response body contains a progress URL. Poll that URL to monitor the completion of dataset creation. See Progress for details. The 202 response will also contain a Location header with the URL of the newly created dataset.
The “metadata” member in the table is an object containing all variable definitions, keyed by variable alias. See Object Reference: Variable Definitions for specific requirements for defining variables of various types, as well as the example below.
The “tags” member in the table is an object containing variables keyed by tag. See variable tagging.
The “order” member is a Shoji Order object specifying the order, potentially hierarchically nested, of the variables in the dataset. The example below illustrates how this can be used. Shoji is JSON, which means the “metadata” object is explicitly unordered. If you wish the variables to have an order, you must supply an order object rather than relying on any order of the “metadata” object.
Additionally, an optional “hidden” member is allowed, which receives a Shoji Order object just like the “order” member. This structure will be used to construct the Hidden folder subfolders and the variables present inside it will be considered hidden variables.
Likewise, an optional “secure” member is allowed, which will be used to construct the secure folder. Variables referenced in it will be made secure, and users with view only access on the dataset will not be allowed to see or use these variables.
Any variable that has been defined with "discarded": true will automatically be appended to the Hidden folder order structure generated from the “order” member. Defining a variable as "discarded": true is a legacy option that cannot be combined with the “hidden” member.
Following the Variable Folders rules, a variable cannot be in two folders simultaneously, so the server will raise validation errors if any variable is present in both the public order and the hidden order.
It is possible to create derived variables using any of the derivation functions available simultaneously in one request when creating the dataset along its metadata. The variable references inside the derivation expressions must point to declared aliases of variables or subvariables.
POST /datasets/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 3294
...
{
"element": "shoji:entity",
"body": {
"name": "Dataset with derived arrays",
"settings": {
"viewers_can_export": true,
"viewers_can_change_weight": false,
"min_base_size": 3,
"weight": "weight_variable",
"dashboard_deck": null
},
"table": {
"metadata": {
"element": "crunch:table"
"weight_variable": {
"name": "weight variable",
"alias": "weight_variable",
"type": "numeric"
},
"combined": {
"name": "combined CA",
"derivation": {
"function": "combine_categories",
"args": [
{
"variable": "CA1"
},
{
"value": [
{
"combined_ids": [2],
"numeric_value": 2,
"missing": false,
"name": "even",
"id": 1
},
{
"combined_ids": [1],
"numeric_value": 1,
"missing": false,
"name": "odd",
"id": 2
}
]
}
]
}
},
"numeric": {
"name": "numeric variable",
"type": "numeric"
},
"numeric_copy": {
"name": "Copy of numeric",
"derivation": {
"function": "copy_variable",
"args": [{"variable": "numeric"}]
}
},
"MR1": {
"name": "multiple response",
"derivation": {
"function": "select_categories",
"args": [
{
"variable": "CA3"
},
{
"value": [
1
]
}
]
}
},
"CA3": {
"name": "cat array 3",
"derivation": {
"function": "array",
"args": [
{
"function": "make_frame",
"args": [
{
"map": {
"var1": {
"variable": "ca2-subvar-2",
"references": {
"alias": "subvar2",
"name": "Subvar 2"
}
},
"var0": {
"variable": "ca1-subvar-1",
"references": {
"alias": "subvar1",
"name": "Subvar 1"
}
}
}
},
{
"value": ["var1", "var0"]
}
]
}
]
}
},
"CA2": {
"subvariables": [
{
"alias": "ca2-subvar-1",
"name": "ca2-subvar-1"
},
{
"alias": "ca2-subvar-2",
"name": "ca2-subvar-2"
}
],
"type": "categorical_array",
"name": "cat array 2",
"categories": [
{
"numeric_value": null,
"missing": false,
"id": 1,
"name": "yes"
},
{
"numeric_value": null,
"missing": false,
"id": 2,
"name": "no"
},
{
"numeric_value": null,
"missing": true,
"id": -1,
"name": "No Data"
}
]
},
"CA1": {
"subvariables": [
{
"alias": "ca1-subvar-1",
"name": "ca1-subvar-1"
},
{
"alias": "ca1-subvar-2",
"name": "ca1-subvar-2"
},
{
"alias": "ca1-subvar-3",
"name": "ca1-subvar-3"
}
],
"type": "categorical_array",
"name": "cat array 1",
"categories": [
{
"numeric_value": null,
"missing": false,
"id": 1,
"name": "yes"
},
{
"numeric_value": null,
"missing": false,
"id": 2,
"name": "no"
},
{
"numeric_value": null,
"missing": true,
"id": -1,
"name": "No Data"
}
]
}
}
}
}
}
--------
202 Accepted
Location: /datasets/{dataset_id}/
...
{
"element": "shoji:view",
"value": "/progress/{progress_id}/"
}
The example above does a number of things:
- Creates variables numeric and arrays CA1 and CA2.
- Makes a shallow copy of variable numeric as numeric_copy.
- Makes an ad hoc array CA3 reusing subvariables from CA1 and CA2.
- Makes a multiple response view MR1 selecting category 1 from categorical array CA3.
Validation rules
- Each variable mentioned in the metadata must contain a valid variable definition with a matching alias.
- Each array variable definition should contain valid subvariable or subreferences members.
- An attribute with value null will receive the default value for that attribute.
- An empty order for the dataset will be handled as if no order was passed in.
- An empty hidden (list) value for the dataset will signify there are no hidden variables.
- Each variable must belong to one and only one of the two orders (order or hidden).
- Defining a variable as "discarded": true is a legacy option that cannot be combined with the hidden order for the dataset.
Step two: Add row data by file
POST /datasets/{dataset_id}/batches/ HTTP/1.1
Content-Type: text/csv
Content-Length: 8874357
Content-Disposition: form-data; name="file"; filename="thedata.csv"
...
"educ","color"
3,"red"
2,"yellow"
...
--------
202 Accepted
Location: /datasets/{dataset_id}/batches/{batch_id}/
...
{
"element": "shoji:view",
"value": "/progress/{progress_id}/"
}
By S3 URL:
POST /datasets/{dataset_id}/batches/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 341
...
{
"element": "shoji:entity",
"body": {
"url": "s3://bucket_name/dir/subdir/?accessKey=ASILC6CBA&secretKey=KdJy7ZRK8fDIBQ&token=AQoDYXdzECAa%3D%3D"
}
}
--------
202 Accepted
Location: /datasets/{dataset_id}/batches/{batch_id}/
...
{
"element": "shoji:view",
"value": "/progress/{progress_id}/"
}
POST a CSV file or URL to the new dataset’s batches catalog. The CSV must include a header row of variable identifiers, which should be the aliases of the variables (and array subvariables) defined in step (1).
The CSV may be gzipped. In fact, you are encouraged to gzip it.
The values in the CSV MUST be the same format as the values you get out of Crunch, and it must match the metadata specified in the previous step. This includes:
- Categorical variables should have data identified by the integer category ids, not strings, and all values must be defined in the “categories” metadata for each variable.
- Datetimes must all be valid ISO 8601 strings
- Numeric variables must have only (unquoted) numeric values
- The only special value allowed is an empty “cell” in the CSV, which will be read as the missing value “No Data”
Violation of any of these validation criteria will result in a 409 (Conflict) response status. To resolve, you can either (1) fix your CSV locally and re-POST it, or (2) PATCH the variable metadata that is invalid and then re-POST the CSV.
Imports are done in “strict” mode by default. Strict imports are faster, and using strict mode will alert you if there is any mismatch between data and metadata. However, in some cases, it may be convenient to be more flexible and silently ignore or resolve inconsistencies. For example, you may have a large CSV dumped out of a database, and the data format isn’t exactly Crunch’s format, but it would be costly to read-munge-write the whole file for minor changes. In cases like this, you may append
?strict=0 to the URL of the POST request to loosen that strictness.
With non-strict imports:
- The CSV may contain columns not described by the metadata; these columns will be ignored, rather than returning an error response
- The metadata may describe variables not contained in the CSV; these variables will be filled with missing values, rather than returning an error response
- And more things to come
The CSV can be sent in one of two ways:
- Upload a file by POSTing a multipart form
- POST a Shoji entity with a “url” in the body, containing all necessary auth keys as query parameters. If the URL points to a single file, it should be a CSV or gzipped CSV, as described above. If the URL points to a directory, the contents will be assumed to be (potentially zipped) batches of a CSV and will be concatenated for appending. In the latter case, only the first CSV in the directory listing should contain a header row.
A 201 response to the POST request indicates success. All rows added in a single request become part of a new Batch, whose URL is returned in the response Location.
Example
Here’s an example dataset metadata and corresponding csv.
Several things to note:
- Everything–metadata, order, and data–is keyed by variable “alias”, not “name”, because Crunch believes that names are for people, not computers, to understand. Aliases must be unique across the whole dataset, while variable “names” must only be unique within their group or array variable.
- For categorical variables, all values in the CSV correspond to category ids, not category names, and also not “numeric_values”, which need not be unique or present for all categories in a variable.
- The array variables defined in the metadata (“allpets” and “petloc”) don’t themselves have columns in the CSV, but all of their “subvariables” do, keyed by their aliases.
- With the exception of those array variable definitions, all variables and subvariables defined in the metadata have columns in the CSV, and there are no columns in the CSV that are not defined in the metadata.
- For internal variables, such as a case identifier in this example, that you don’t want to be visible in the UI, you can add them as “hidden” from the beginning by including them in the “hidden” order for the dataset, (see “caseid” in this example).
- Missing values
- Variables with categories (categorical, multiple_response, categorical_array) have missing values defined as categories with "missing": "true"
- Text, numeric, and datetime variables have missing variables defined as “missing_rules”, which can be “value”, “set”, or “range”. See, for example, “q3” and “ndogs”.
- Empty cells in the CSV, if present, will automatically be translated as the “No Data” missing value in Crunch. See, for example, “ndogs_b”.
- Order
- All variables should be referenced by alias in the “order” object, inside a group’s “entities” key. Any omitted variables (in this case, the hidden variable “caseid”) will automatically be added to a group named “ungrouped”.
- Variables may appear in multiple groups.
- Groups may be nested within each other.
Creating new datasets from Parquet Files
Crunch can use the columnar serialized file format Parquet.
The advantage of using the Parquet format is that integration performance of your existing data within Crunch will happen quicker, since it allows you to skip the column-to-row transformation that needs to happen with row-based formats.
The following describes how you can create a new dataset with your existing data.
Step one: Create a new empty dataset using our datasets endpoint
$ curl -v -b cookie-jar -X POST https://[workspace].crunch.io/api/datasets/ -H 'Content-Type: application/json' -d "{}"
> POST /api/datasets/ HTTP/2
> Host: [workspace].crunch.io
> User-Agent: curl/7.64.1
> Accept: */*
> Cookie: token=c75267f4240d182deca8fbdf73276adf
> Content-Type: application/json
> Content-Length: 2
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
* We are completely uploaded and fine
< HTTP/2 201
< date: Fri, 29 May 2020 21:03:44 GMT
< content-type: application/json;charset=utf-8
< content-length: 4
< location: https://[workspace].crunch.io/api/datasets/a4b05b6837224707b694932fd520d5be/
< server: nginx
< vary: Cookie, Accept-Encoding
< allow: GET, HEAD, OPTIONS, PATCH, POST
< x-timing:
* Replaced cookie token="c75267f4240d182deca8fbdf73276adf" for domain crunch.io, path /, expire 1622322224
< set-cookie: token=c75267f4240d182deca8fbdf73276adf; Domain=.crunch.io; Max-Age=31536000; Path=/
<
Step two: Upload your parquet file to the parquet endpoint
$ curl -v -b cookie-jar -X POST -H "Content-Type:multipart/form-data" -F "uploaded_file=@varied.pqt;type=application/parquet" https://[workspace].crunch.io/api/sources/
> POST /api/sources/ HTTP/2
> Host: [workspace].crunch.io
> User-Agent: curl/7.64.1
> Accept: */*
> Cookie: token=c75267f4240d182deca8fbdf73276adf
> Content-Length: 4547
> Content-Type: multipart/form-data; boundary=------------------------a675784fbe4be8f9
>
< HTTP/2 201
< date: Fri, 29 May 2020 21:03:55 GMT
< content-type: application/json;charset=utf-8
< content-length: 73
< location: https://[workspace].crunch.io/api/sources/6874b6ed057d442f94aec77e53e8796b/
< server: nginx
< vary: Cookie, Accept-Encoding
< allow: GET, HEAD, OPTIONS, POST
< x-timing:
* Replaced cookie token="c75267f4240d182deca8fbdf73276adf" for domain crunch.io, path /, expire 1622322235
< set-cookie: token=c75267f4240d182deca8fbdf73276adf; Domain=.crunch.io; Max-Age=31536000; Path=/
<
* Connection #0 to host [workspace].crunch.io left intact
{"element": "shoji:view", "self": "https://[workspace].crunch.io/api/sources/"}*
Step three (optional): Upload the metadata you would like to use for your new dataset
Uploading your metadata is necessary if you want to create categorical variables, categorical arrays, or multiple response variables, since Crunch cannot currently auto-generate schema for these data types directly from Parquet:
$ curl -v -b cookie-jar -X POST https://[workspace].crunch.io/api/sources/ -F uploaded_file=@varied.custom.metadata.json
Note: Unnecessary use of -X or --request, POST is already inferred.
> POST /api/sources/ HTTP/2
> Host: [workspace].crunch.io
> User-Agent: curl/7.64.1
> Accept: */*
> Cookie: token=c75267f4240d182deca8fbdf73276adf
> Content-Length: 2057
> Content-Type: multipart/form-data; boundary=------------------------f79c0a47be4adfd8
>
< HTTP/2 201
< date: Fri, 29 May 2020 21:04:16 GMT
< content-type: application/json;charset=utf-8
< content-length: 73
< location: https://[workspace].crunch.io/api/sources/b1e1951933a04c75adcc2bc93d09f548/
< server: nginx
< vary: Cookie, Accept-Encoding
< allow: GET, HEAD, OPTIONS, POST
< x-timing:
* Replaced cookie token="c75267f4240d182deca8fbdf73276adf" for domain crunch.io, path /, expire 1622322256
< set-cookie: token=c75267f4240d182deca8fbdf73276adf; Domain=.crunch.io; Max-Age=31536000; Path=/
<
{"element": "shoji:view", "self": "https://[workspace].crunch.io/api/sources/"}* Closing connection 0
Step four: Send a request to the batches/ endpoint of your new dataset
You must specify the Parquet source (described in step two) and the optional metadata source (described in step three):
$ curl -v -b cookie-jar -X POST -H "content-type:application/parquet" https://[workspace].crunch.io/api/datasets/a4b05b6837224707b694932fd520d5be/batches/ -d '{"body": {"source":"https://[workspace].crunch.io/api/sources/6874b6ed057d442f94aec77e53e8796b/", "schema": "https://[workspace].crunch.io/api/sources/b1e1951933a04c75adcc2bc93d09f548/"}}'
> POST /api/datasets/a4b05b6837224707b694932fd520d5be/batches/ HTTP/2
> Host: [workspace].crunch.io
> User-Agent: curl/7.64.1
> Accept: */*
> Cookie: token=c75267f4240d182deca8fbdf73276adf
> content-type:application/parquet
> Content-Length: 175
>
< HTTP/2 202
< date: Fri, 29 May 2020 21:05:53 GMT
< content-type: application/json;charset=utf-8
< content-length: 252
< location: https://[workspace].crunch.io/api/datasets/a4b05b6837224707b694932fd520d5be/batches/import_batch%3Aa4b05b6837224707b694932fd520d5be%243a968fd9-c1fd-44bc-b6dd-2ca6c7e4a555/
< server: nginx
< vary: Cookie, Accept-Encoding
< allow: GET, HEAD, OPTIONS, POST
< x-timing:
* Replaced cookie token="c75267f4240d182deca8fbdf73276adf" for domain crunch.io, path /, expire 1622322353
< set-cookie: token=c75267f4240d182deca8fbdf73276adf; Domain=.crunch.io; Max-Age=31536000; Path=/
<
* Connection #0 to host [workspace].crunch.io left intact
{"element": "shoji:view", "self": "https://[workspace].crunch.io/api/datasets/a4b05b6837224707b694932fd520d5be/batches/", "value": "https://[workspace].crunch.io/api/progress/import_batch%3Aa4b05b6837224707b694932fd520d5be%243a968fd9-c1fd-44bc-b6dd-2ca6c7e4a555/"}*
Step five: Monitor the progress endpoint
You'll want to monitor the progress endpoint in order to watch the progress of your dataset’s import.
When the progress completes, visit your fully populated dataset—you will not be able to see any of the new data until the import has completed, because the import acts as a single operation in the Crunch system, to maintain transactional integrity. If your import fails along the way, the progress endpoint will provide an error message and the entire import will be aborted:
$ curl -v -b cookie-jar "https://[workspace].crunch.io/api/progress/import_batch%3Aa4b05b6837224707b694932fd520d5be%243a968fd9-c1fd-44bc-b6dd-2ca6c7e4a555/"
> GET /api/progress/import_batch%3Aa4b05b6837224707b694932fd520d5be%243a968fd9-c1fd-44bc-b6dd-2ca6c7e4a555/ HTTP/2
> Host: [workspace].crunch.io
> User-Agent: curl/7.64.1
> Accept: */*
> Cookie: token=c75267f4240d182deca8fbdf73276adf
>
< HTTP/2 200
< date: Wed, 10 Jun 2020 22:10:17 GMT
< content-type: application/json;charset=utf-8
< content-length: 365
< server: nginx
< vary: Cookie, Accept-Encoding
< allow: GET, HEAD, OPTIONS
< x-timing:
* Replaced cookie token="c75267f4240d182deca8fbdf73276adf" for domain crunch.io, path /, expire 1592431817
< set-cookie: token=c75267f4240d182deca8fbdf73276adf; Domain=.crunch.io; Max-Age=604800; Path=/
<
* Connection #0 to host [workspace].crunch.io left intact
{"element": "shoji:view", "self": "https://[workspace].crunch.io/api/progress/import_batch%3Aa4b05b6837224707b694932fd520d5be%243a968fd9-c1fd-44bc-b6dd-2ca6c7e4a555/", "views": {"result": "https://[workspace].crunch.io/api/progress/import_batch%3Aa4b05b6837224707b694932fd520d5be%243a968fd9-c1fd-44bc-b6dd-2ca6c7e4a555/result/"}, "value": {"progress": 100, "complete": "pending"}}* Closing connection 0
Column-by-column
Crunch stores data by column internally, so if your data are stored in a column-major format as well, importing by column may be the most efficient way to import data.
Step one: Create a Dataset entity
POST /datasets/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 974
...
{
"element": "shoji:entity",
"body": {
"name": "my survey",
...
}
}
--------
201 Created
Location: /datasets/{dataset_id}/
ds <- createDataset("my suryey")
POST a Dataset Entity to the datasets catalog, just as in the first import method.
Step two: Add Variable definitions and column data
OST /datasets/{dataset_id}/variables/ HTTP/1.1
Content-Type: application/shoji
Content-Length: 38475
...
{
"element": "shoji:entity",
"body": {
"name": "Gender",
"alias": "gender",
"type": "categorical",
"categories": [
{
"name": "Male",
"id": 1,
"numeric_value": null,
"missing": false
},
{
"name": "Female",
"id": 2,
"numeric_value": null,
"missing": false
},
{
"name": "Skipped",
"id": 9,
"numeric_value": null,
"missing": true
}
],
"values": [1, 9, 1, 2, 2, 1, 1, 1, 1, 2, 9, 1]
}
}
--------
201 Created
Location: /datasets/{dataset_id}/variables/{variable_id}/
# Here's a similar example. R's factor type becomes "categorical".
gender.names <- c("Male", "Female", "Skipped")
gen <- factor(gender.names[c(1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 3, 1)],
levels=gender.names)
# Assigning an R vector into a dataset will create a variable entity.
ds$gender <- gen
POST a Variable Entity to the newly created dataset’s variables catalog, and include with that Entity definition a “values” key that contains the column of data. Do this for all columns in your dataset.
If the values attribute is not present, the new column will be filled with “No Data” in all rows.
The data passed in values can correspond to either the full data column for the new variable or a single value, in which case it will be used to fill up the column.
In the case of arrays, the single value should be a list containing the correct categorical values.
If the type of the values passed in does not correspond with the variable’s type, the server will return a 400 response indicating the error and the variable will not be created.
Note that the lengths of the columns of data you include in the “values” key must be the same for all variables, though if you’re importing from a normal, rectangular data store, this should already be the case.
Appending Data
Appending data to an existing Dataset is not much different from uploading the initial data; both use a “Batch” resource which represents the process of importing the data from the source into the dataset. Once you have created a Source for your data, POST its URL to datasets/{id}/batches/ to start the import process. That process may take some time, depending on the size of the dataset. To check when the append process completed use the progress value returned in the body of the response. The returned Location header is the URI of the new Batch. User interfaces may choose here to show a progress meter or some other widget.
Adding a subsequent Source
Once you have created a Dataset, you can upload new files and append rows to the same Dataset as often as you like. If the structure of each file is the same as that of the first uploaded file, Crunch should automatically pass your new rows through exactly the same process as the old rows. If there are any derived variables in your Dataset, new data will be derived in the new rows following the same rules as the old data. You can follow the progress as above via the progress endpoint.
Let’s look at an example: you had uploaded an initial CSV of 3 columns, A, B and C. Then:
- The Crunch system automatically converted column A from the few strings that were found in it to a Categorical type.
- You derived a new column D that consisted of B * C.
Then you decide to upload another CSV of new rows. What will happen?
When you POST to create the second Batch, the service will: 1) match up the new A with the old A and cast the new strings to existing categories by name, and 2) fill column D for you with B * C for each new row.
However, from time to time, the new source has significant differences: a new variable, a renamed variable, and other changes. When you append the first Source to a Dataset, there is nothing with which to conflict. But a subsequent POST to batches/ may result in a conflict if the new source cannot be confidently reconciled with the existing data. Even though you get a 201 Created response for the new batch resource, it will have a status of “conflict”.
Reporting and Resolving Conflicts
When you append a Source to an existing Dataset, the system attempts to match up the new data with the old. If the source’s schema can be automatically aligned with the target Dataset, the new rows from the Batch are appended. When things go wrong, however, the Batch (or the progress endpoint) can be inspected to see what conflicted with the target (or vice-versa, in some cases!).
GET the new Batch:
GET /api/datasets/{dataset_id}/batches/{batch_id}/ HTTP/1.1
...
--------
200 OK
Content-Type: application/shoji
{
element: "shoji:entity",
self: "https://crunch.io/api/datasets/30e4a927/batches/import_batch%3A30e4a92bca%24aa3f6b85-8215/",
description: "A batch for this Dataset. Each batch is a set of rows which were inserted together.",
body: {
status: "error",
source: null,
error: "Type of 'Count' does not match target and cannot be converted.",
id: "import_batch:30e4a92bca$aa3f6b85-8215",
progress: -1
}
}
If any variable conflicts, the "error" message will contain the description of the conflict. For example, if the new variable “Count” had a different type that could not be converted compared to the existing variable “Count”, the Batch resource would contain the above message. Only unresolvable conflicts will be shown; if a variable is not reported in the conflicts object, it appended cleanly.
See Batches for more details on batch entities and conflicts.
Streaming rows
Existing datasets are best sent to Crunch as a single Source, or a handful of subsequent Sources if gathered monthly or on some other schedule. Sometimes however you want to “stream” data to Crunch as it is being gathered, even one row at a time, rather than in a single post-processing phase. You do not want to make each row its own batch (it’s simply not worth the overhead). Instead, you should make a Stream and send rows to it, then periodically create a Source and Batch from it.
Send rows to a stream
To send one or more rows to a dataset stream, simply POST one or more lines of line-delimited JSON to the dataset’s stream endpoint:
{"var_id_1": 1, "var_id_2": "a"}
by_alias = ds.variables.by('alias')
while True:
row = my_system.read_a_row()
importing.importer.stream_rows(ds, {
'gender': row['gender'],
'age': row['age']
})
Streamed values must be keyed either by id or by alias. The variable ids/aliases must correspond to existing variables in the dataset. The Python code shows how to efficiently map aliases to ids. The data must match the target variable types so that we can process the row as quickly as possible. We want no casting or other guesswork slowing us down here. Among other things, this means that categorical values must be represented as Crunch’s assigned category ids, not names or numeric values.
You may also send more than one row at a time if you prefer. For example, your data collection system may already post-process row data in, say, 5 minute increments. The more rows you can send together, the less overhead spent processing each one and the more you can send in a given time. Send multiple lines of line-delimited JSON, or if using pycrunch, a list of dicts rather than a single dict.
Each time you send a POST, all of the rows in that POST are assembled into a new message which is added to the stream. Each message can contain one or more rows of data.
As when creating a new source, don’t worry about sending values for derived variables; Crunch will fill these out for you for each row using the data you send.
Append the new rows to the dataset
The above added new rows to the Stream resource so that you can be confident that your data is completely safe with Crunch. To append those rows to the dataset requires another step. You could stream rows and then, once they are all assembled, append them all as a single Source to the dataset. However, if you’re streaming rows at intervals it’s likely you want to append them to the dataset at intervals, too. But doing so one row at a time is usually counter-productive; it slows the rate at which you can send rows, balloons metadata, and interrupts users who are analyzing the data.
Instead, you control how often you want the streamed rows to be appended to the dataset. When you’re ready, POST to /datasets/{id}/batches/ and provide the “stream” member, plus any extra metadata the new Source should possess:
{
"stream": null,
"type": "ldjson",
"name": "My streamed rows",
"description": "Yet Another batch from the stream"
}
ds.batches.create({"body": {
"stream": None,
"type": "ldjson",
"name": "My streamed rows",
"description": "Yet Another batch from the stream"
}})
The “stream” member tells Crunch to acquire the data from the stream to form this Batch. The “stream” member must be null, then the system will acquire all currently pending messages (any new messages which arrive during the formation of this Batch will be queued and not fetched). If there are no pending messages, 409 Conflict is returned instead of 201/202 for the new Batch.
Pending rows will be added automatically
Every hour, the Crunch system goes through all datasets, and for each that has pending streamed data, it batches up the pending rows and adds them to the dataset automatically, as long as the dataset is not currently in use by someone. That way, streamed data will magically appear in the dataset for the next time a user loads it, but if a user is actively working with the dataset, the system won’t update their view of the data and disrupt their session.
See Stream for more details on streams.
Combining Datasets
Combining datasets consists of creating a new dataset formed by stacking a list of datasets together. It works under the same rules as a normal append.
To create a new dataset combined from others, it is necessary to POST to the datasets catalog indicating a combine_datasets expression:
POST /api/datasets/
{
"element": "shoji:entity",
"body": {
"name": "My combined dataset",
"description": "Consists on dsA and dsB",
"derivation": {
"function": "combine_datasets",
"args": [
{"dataset": "https://[workspace].crunch.io/api/datasets/dsabc/"},
{"dataset": "https://[workspace].crunch.io/api/datasets/ds123/"}
]
}
}
}
The server will verify that the authenticated user has view permission to all datasets, else will raise a 400 error.
The resulting dataset will consist on the matched union of all included datasets with the rows in the same order. Private/public variable visibility and exclusion filters will be honored in the result.
Transformations during combination
The combine procedures will perform normal append matching rules which means that any mismatch on aliases or types will not proceed, as well limiting the existing union of variables from the present datasets as the result.
It is possible to provide transformations on the datasets to ensure that they line up on the combination phase and to add extra columns with constant dataset metadata per dataset on the resulting combined result.
Each {"dataset"} argument allows for an extra frame key that can contain a function expression on the desired dataset transformation, for example:
{
"dataset": "<dataset_url>",
"frame": {
"function": "make_frame",
"args": [{
"map": {
"*": {"variable": "*"},
"dataset_id": {
"value": "<dataset_id>",
"type": "text",
"references": {
"name": "Dataset ID",
"alias": "dataset_id"
}
}
}
}]
}
}
Selecting a subset of variables to combine
In the same fashion that it is possible to add extra variables to the dataset transforms, it is possible to select which variables only to include.
{
"dataset": "<dataset_url>",
"frame": {
"function": "make_frame",
"args": [{
"map": {
"A": {"variable": "A"},
"B": {"variable": "B"},
"C": {"variable": "C"},
"dataset_id": {
"value": "<dataset_id>",
"type": "text",
"references": {
"name": "Dataset ID",
"alias": "dataset_id"
}
}
}
}]
}
}
In this example, the expression indicates to only include variables with IDs A, B, and C from the referenced dataset as well as add the new extra variable dataset_id. This would effectively append only these 4 variables instead of the full dataset’s variables.
Merging and Joining Datasets
Crunch supports joining variables from one dataset to another by a key variable that maps rows from one to the other. To add a snapshot of those variables to the dataset, POST an adapt function expression to its variables catalog.
POST /api/datasets/{dataset_id}/variables/ HTTP/1.1
Host: [workspace].crunch.io
Content-Type: application/json
{
"function": "adapt",
"args": [{
"dataset": "https://[workspace].crunch.io/api/datasets/{other_id}/"
}, {
"variable": "https://[workspace].crunch.io/api/datasets/{other_id}/variables/{right_key_id}/"
}, {
"variable": "https://[workspace].crunch.io/api/datasets/{dataset_id}/variables/{left_key_id}/"
}]
}
-----
HTTP/1.1 202 Accepted
{
"element": "shoji:view",
"self": "https://[workspace].crunch.io/api/datasets/{dataset_id}/variables/",
"value": "https://[workspace].crunch.io/api/progress/5be82a/"
}
Please note that in the args array above, the right key variable comes before the left key variable.
A successful request returns 202 (Accepted) status with a progress resource in the response body; poll that to track the status of the asynchronous job that adds the data to your dataset.
Currently Crunch only supports left joins: all rows of the left (current) dataset will be kept, and only rows from the right (incoming) dataset that have a key value present in the left dataset will be brought in. If the value of the join key in the left dataset matches multiple values in the right dataset, only the first row matched in the right dataset will be used in the join. Rows in the left dataset that do not have a corresponding row in the right dataset will be filled with missing values for the incoming variables. If there are duplicate join key values in the left dataset, then the same row from the right dataset will be used multiple times in the join.
The join key must be of type “numeric” or “text” and must be the same type in both datasets.
Joining a subset of variables
To select certain variables to bring over from the right dataset, include select function expression around the adapt function described above:
POST /api/datasets/{dataset_id}/variables/ HTTP/1.1
Host: [workspace].crunch.io
Content-Type: application/json
{
"function": "select",
"args": [{
"map": {
"{right_var1_id}/": {
"variable": "https://[workspace].crunch.io/api/datasets/{other_id}/variables/{right_var1_id}/"
},
"{right_var2_id}/": {
"variable": "https://[workspace].crunch.io/api/datasets/{other_id}/variables/{right_var2_id}/"
},
"{right_var3_id}/": {
"variable": "https://[workspace].crunch.io/api/datasets/{other_id}/variables/{right_var3_id}/"
}
}
}],
"frame": {
"function": "adapt",
"args": [{
"dataset": "https://[workspace].crunch.io/api/datasets/{other_id}/"
}, {
"variable": "https://[workspace].crunch.io/api/datasets/{other_id}/variables/{right_key_id}/"
}, {
"variable": "https://[workspace].crunch.io/api/datasets/{dataset_id}/variables/{left_key_id}/"
}]
}
}
-----
HTTP/1.1 202 Accepted
{
"element": "shoji:view",
"self": "https://[workspace].crunch.io/api/datasets/{dataset_id}/variables/",
"value": "https://[workspace].crunch.io/api/progress/5be82a/"
}
Joining a subset of rows
Rows to consider from the right dataset can also be filtered. To do so, include a filter attribute on the payload, containing either a filter expression, wrapped under {"expression": <expr>}, or an existing filter entity URL (from the right-side dataset), wrapped as {"filter": <url>}.
POST /api/datasets/{dataset_id}/variables/ HTTP/1.1
Host: [workspace].crunch.io
Content-Type: application/json
{
"function": "adapt",
"args": [{
"dataset": "https://[workspace].crunch.io/api/datasets/{other_id}/"
}, {
"variable": "https://[workspace].crunch.io/api/datasets/{other_id}/variables/{right_key_id}/"
}, {
"variable": "https://[workspace].crunch.io/api/datasets/{dataset_id}/variables/{left_key_id}/"
}],
"filter": {
"expression": {
"function": "==",
"args": [
{"variable": "https://[workspace].crunch.io/api/datasets/{other_id}/variables/{variable_id}/"},
{"value": "<value>"}
]
}
}
}
You can filter both rows and variables in the same request. Note that the “filter” parameter remains at the top-level function in the expression, which when specifying a variable subset is “select” instead of “adapt”.
Versioning Datasets
All Crunch datasets keep track of the changes you make to them, from the initial import, through name changes and deriving new variables, to appending new rows. You can review the changes to see who did what and when, revert to a previous version, “fork” a dataset to make a copy of it, make changes to the copy, and merge those changes back into the original dataset.
Actions
The list of changes is available in the dataset/{id}/actions/ catalog. GET it and sort/filter by the “datetime” and/or “user” members as desired. Follow the links to an individual action entity to get exact details about what changed.
Viewing Changes Diff
Through the actions catalog it’s possible to retrieve the differences of a “fork” dataset from its “upstream” dataset.
Two endpoints are provided to do so, the dataset/{id}/actions/since_forking and the dataset/{id}/actions/upstream_delta endpoints.
The dataset/{id}/actions/since_forking endpoint will return the state of the fork and the upstream and the the list of actions that were performed on the fork since the two diverged.
>>> forkds.actions.since_forking
pycrunch.shoji.View(**{
"self": "https://[workspace].crunch.io/api/datasets/051ebb979db44523822ffe29236a6670/actions/since_forking/",
"value": {
"dataset": {
"modification_time": "2017-02-16T11:01:41.807000+00:00",
"revision": "58a586950183667486130f0c",
"id": "051ebb979db44523822ffe29236a6670",
"name": "My fork"
},
"actions": [
{
"hash": "2a863871-c809-4cad-a20c-9fea86b9e763",
"state": {
"failed": false,
"completed": true,
"played": true
},
"params": {
"variable": "fab0c81d16b442089cc50019cf610961",
"definition": {
"alias": "var1",
"type": "text",
"name": "var1",
"id": "fab0c81d16b442089cc50019cf610961"
},
"dataset": {
"id": "051ebb979db44523822ffe29236a6670",
"branch": "master"
},
"values": [
"sample sentence",
"sample sentence",
"sample sentence",
"sample sentence",
"sample sentence",
"sample sentence",
"sample sentence"
],
"owner_id": null
},
"key": "Variable.create"
}
],
"upstream": {
"modification_time": "2017-02-16T11:01:40.131000+00:00",
"revision": "58a586940183667486130efc",
"id": "2730c0744cba4d7c9acc9f3551380e49",
"name": "My Dataset"
}
},
"element": "shoji:view"
})
GET /api/datasets/5de96a/actions/since_forking HTTP/1.1
Host: [workspace].crunch.io
Content-Type: application/json
Content-Length: 1769
{
"element": "shoji:view",
"value": {
"dataset": {
"modification_time": "2017-02-16T11:01:41.807000+00:00",
"revision": "58a586950183667486130f0c",
"id": "051ebb979db44523822ffe29236a6670",
"name": "My fork"
},
"actions": [
{
"hash": "2a863871-c809-4cad-a20c-9fea86b9e763",
"state": {
"failed": false,
"completed": true,
"played": true
},
"params": {
"variable": "fab0c81d16b442089cc50019cf610961",
"definition": {
"alias": "var1",
"type": "text",
"name": "var1",
"id": "fab0c81d16b442089cc50019cf610961"
},
"dataset": {
"id": "051ebb979db44523822ffe29236a6670",
"branch": "master"
},
"values": [
"sample sentence",
"sample sentence",
"sample sentence",
"sample sentence",
"sample sentence",
"sample sentence",
"sample sentence"
],
"owner_id": null
},
"key": "Variable.create"
}
],
"upstream": {
"modification_time": "2017-02-16T11:01:40.131000+00:00",
"revision": "58a586940183667486130efc",
"id": "2730c0744cba4d7c9acc9f3551380e49",
"name": "My Dataset"
}
}
}
The dataset/{id}/actions/upstream_delta endpoint usage and response matches the one of the other endpoint, but the returned actions are instead the ones that were performed on the upstream since the two diverged.
Savepoints
You can snapshot the current state of the dataset at any time with a POST to datasets/{id}/savepoints/. This marks the current point in the actions history, allowing you to provide a description of your progress.
The response will contain a Location header that will lead to the new version created.
In case creating the new version can be created fast enough a 201 response will be issued, when the new version takes too long a 202 response will be issued and the creation will proceed in background. In case of a 202 response the body will be a Shoji:view containing a progress URL where you may query the progress.
>>> svp = ds.savepoints.create({"body": {"description": "TestSVP"}})
pycrunch.shoji.Entity(**{
"body": {
"creation_time": "2017-05-09T14:18:07.761000+00:00",
"version": "master__000003",
"user_name": "captain-68305620",
"description": "",
"last_update": "2017-05-09T14:18:07.761000+00:00"
},
"self": "http://local.crunch.io:19404/api/datasets/5283e3f4e3d645c0a750c09e854bdcb1/savepoints/6fbe47c97d8e4290a0c09227d6d6b63a/",
"views": {
"revert": "http://local.crunch.io:19404/api/datasets/5283e3f4e3d645c0a750c09e854bdcb1/savepoints/6fbe47c97d8e4290a0c09227d6d6b63a/revert/"
},
"element": "shoji:entity"
})
There is no guarantee that creating a savepoint will lead to a savepoint that points to the exact revision the dataset was when the POST was issued. This is because the dataset might have moved forward in the meanwhile. For this reason instead of reponding with a Location header that points to an exact savepoint, the POST savepoints endpoint will respond with Location header that points to /progress/{operation_id}/result URL, which when accessed will redirect to the nearest savepoint for that revision.
Reverting savepoints
You can revert to any savepoint version (throwing away any changes since that time) with a POST to /datasets/{dataset_id}/savepoints/{version_id}/revert/.
It will return a 202 response with a Shoji:view containing a progress URL on its value where the asynchronous job’s status can be observed.
Forking and Merging
A common pattern when collaborating on a dataset is for one person to make changes on their own and then, when all is ready, share the whole set of changes back to the other collaborators. Crunch implements this with two mechanisms: the ability to “fork” a dataset to make a copy, and then “merge” any changes made to it back to the original dataset.
To fork a dataset, POST a new fork entity to the dataset’s forks catalog.
>>> ds.forks.index
{}
>>> forked_ds = ds.forks.create({"body": {"name": "My fork"}}).refresh()
>>> ds.forks.index.keys() == [forked_ds.self]
True
>>> ds.forks.index[forked_ds.self]["name"]
"My fork"
The response will be a 201 response if the fork could happen in the allotted time limit for the request or a 202 if the fork requires too much time and is going to continue in background. Both cases will include a Location header with the URL of the new dataset that has been forked from the current one.
POST /api/datasets/{id}/forks/ HTTP/1.1
Host: [workspace].crunch.io
Content-Type: application/json
Content-Length: 231
{
"element": "shoji:entity",
"body": {"name": "My fork"}
}
----
HTTP/1.1 201 Created
Location: https://[workspace].crunch.io/api/datasets/{forked_id}/
In case of a 202, in addition to the Location headers with the URL of the fork that is going to be created, the response will contain a Shoji view with the url of the endpoint that can be polled to track fork completion.
POST /api/datasets/{id}/forks/ HTTP/1.1
Host: [workspace].crunch.io
Content-Type: application/json
Content-Length: 231
{
"element": "shoji:entity",
"body": {"name": "My fork"}
}
----
HTTP/1.1 202 Accepted
Location: https://[workspace].crunch.io/api/datasets/{forked_id}/
...
{
"element": "shoji:view",
"value": "/progress/{progress_id}/"
}
The forked dataset can then be viewed and altered like the original; however, those changes do not alter the original until you merge them back with a POST to datasets/{id}/actions/.
ds.actions.post({
"element": "shoji:entity",
"body": {"dataset": forked_ds.self, "autorollback": True}
})
POST /api/datasets/5de96a/actions/ HTTP/1.1
Host: [workspace].crunch.io
Content-Type: application/json
Content-Length: 231
{
"element": "shoji:entity",
"body": {
"dataset": {forked ds URL},
"autorollback": true
}
}
----
HTTP/1.1 204 No Content
*or*
HTTP/1.1 202 Accepted
{
"element": "shoji:view",
"self": "https://[workspace].crunch.io/api/datasets/5de96a/actions/",
"value": "https://[workspace].crunch.io/api/progress/912ab3/"
}
The POST to the actions catalog tells the original dataset to replay a set of actions; since we specify a “dataset” url, we are telling it to replay all actions from the forked dataset. Crunch keeps track of which actions are already common between the two datasets, and won’t try to replay those. You can even make further changes to the forked dataset and merge again and again.
Use the “autorollback” member to tell Crunch how to handle merge conflicts. If an action cannot be replayed on the original dataset (typically because it had conflicting changes or has been rolled back), then if “autorollback” is true (the default), the original dataset will be reverted to the previous state before any of the new changes were applied. If “autorollback” is false, the dataset is left to the last action that it could successfully play, which allows you to investigate the problem, repair it if possible (in either dataset as needed), and then POST again to continue the merge from that point.
Per-user settings (filters, decks and slides, variable permissions etc) are copied to the new dataset when you fork. However, changes to them are not merged back at this time. Please reach out to us as you experiment so we can fine-tune which details to fork and merge as we discover use cases.
Merging actions may take a few seconds, in which case the POST to actions/ will return 204 when finished. Merging many or large actions, however, may take longer, in which case the POST will return 202 with a Location header containing the URL of a Progress resource.
Filtered Merges
When merging actions it is possible to provide a filter to select which actions should be replayed from the other dataset. It is currently possible to filter them by key and by hash.
When filtering by hash, only the provided actions will be merged:
ds.actions.post({
"element": "shoji:entity",
"body": {"dataset": forked_ds.self,
"filter": {"hash": ["000003"]}}
})
When filtering by key, only the actions that are part of that category will be merged:
ds.actions.post({
"element": "shoji:entity",
"body": {"dataset": forked_ds.self,
"filter": {"key": ["Variable.create"]}}
})
Recording the filtered actions
If you know that you are going to merge from the same two datasets multiple times it is possible to tell crunch to remember the filtered actions so that a subsequent merge to the same target won’t try to apply them again if they were skipped in a previous merge.
This behavior can be changed by providing remember: True option to the filter, which means that the filtered actions will be recorded and a subsequent merge won’t try to apply them to the target if they are not explicitly filtered again.
ds.actions.post({
"element": "shoji:entity",
"body": {"dataset": forked_ds.self,
"remember": True,
"filter": {"key": ["Variable.create"]}}
})
Tagging Datasets
Tags allow to only view a subset of variables inside the dataset.
Adding tags
You can add tags to the dataset by POST to the tags endpoint.
POST /api/datasets/{dataset_id}/tags/ HTTP/1.1
Host: [workspace].crunch.io
Content-Type: application/json
{
"body": {
"name": "<name_of_new_tag>",
"variables": [
"<var_url_1>",
"<var_url_2>",
"..."
]
}
}
-----
HTTP/1.1 202 Accepted
{
"element": "shoji:view",
"self": "https://[workspace].crunch.io/api/datasets/{dataset_id}/tags/{<name_of_new_tag>}"
}
Filtering by tag
You can filter variables by tag using the folder endpoint reference. Some examples are:
- /datasets/{id}/folders/?tag=tag1
- /datasets/{id}/folders/?tag=tag1&tag=tag%202
Set the selected tags
Use the dataset preferences to set which tags are selected. This is can be used by API clients to filter variables using the method explained above.
{
"element": "shoji:entity",
"self": "https://[workspace].crunch.io/api/datasets/223fd4/preferences/",
"body": {
"selected_tags": ["tag name"]
}
}