Marketplaces and Packaging #44

davidgasquez · 2023-04-19T08:04:55Z

davidgasquez
Apr 19, 2023
Maintainer

Frictionless

Frictionless standards provide a lightweight and minimal abstraction layer (data packages are JSON/YAML files) on top of data files to make them easier to use. Adhering to the Frictionless specs makes it easier to integrate into the existing community and interoperate with all the datasets and tools already built.

Another interesting side effect of the Frictionless design fisolophy is that it allows everyone to package datasets in a permissionless way. You don't need to move the data, just wrap it around with a simple metadata file.

It's already being used by organizations like Our World in Data, cooperatives like Catalyst, and many other places.

We need to solve the problem of "packaging data" as a community. Frictionless is a great starting point as it only takes someone to write a plugin/extension to integrate a new platform/format/scheme/portal into the ecosystem.

Why don't you use X instead?

I've tried quite a bunch of Data Package Managers. Frictionless is the simplest and most flexible one. It also has a reasonable adoption and active community.

That said, I'm open to other options. If you have a better idea, let's chat!

Why should people use this instead of doing their own thing?

If everybody could converge to it, e.g: "datapackage.json" as a metadata and schema description standard, then, an ecosystem of utilities and libraries for processing data would take advantage of it.

How would you make datasets immutable?

Datasets could be IPFS native. Clients could either fetch the data via IPFS or use a public Gateway.

name: my-dataset
resources:
  - name: my-data
    type: table
    path: bafkreidgvpkjawlxz6sffxzwgooowe5yt7i6wsyg236mfoks77nywkptdq
    scheme: ipfs

In the end, the Frictionless abstraction is just an URL. We can use anything we want in the backend as long as we provide a way to read the data. In this case:

ipfs_package = Package("my-dataset-datapackage.yaml") # Could even be Package("bafyreca4sf...")
ipfs_resource = ipfs_package.get_resource("my-data")

ipfs_resource.to_pandas()
ipfs_resource.sql("SELECT * FROM my-data")

How would you backup datasets?

Depending on the dataset, this feature could be pushed to the hosting later. If you publish in HuggingFace, you get versioning and backup for free! Once the data in there, we can rely on the _cache property of the Frictionless Specs (or a _backup one) to point to the previous backup.

How would you make datasets discoverable?

This is something we have to do as a community. A great start is to create Catalogs. Storing the Catalog definitions in places like GitHub will make it easy to discover them and surface the best ones. At the end, a data package is only an URL!

datasets:
  - name: airport-codes
    package: https://raw.githubusercontent.com/datasets/airport-codes/master/datapackage.json
  - name: country-codes
    package: https://raw.githubusercontent.com/datasets/country-codes/master/datapackage.json

How would you make datasets interoperable?

The tabular resource representation can be an Arrow tabular dataset. With that, we get access to the Apache Arrow ecosystem. Data should be just a resources.to_arrow() command away!

Additionally, using a file system abstraction like fsspec makes it easy to interact with different "remotes" like S3, GCS, HDFS, etc.

I want to package a dataset on platform X. How would I do that?

The Frictionless ecosystem is extensible via plugins/extensions. You can create a plugin to integrate any platform with the Frictionless ecosystem. For example, you can create a plugin to integrate HuggingFace datasets so your package looks something like this:

name: hf-dataset
title: Hugging Face Dataset
resources:
  - name: rotten_tomatoes
    type: table
    path: rotten_tomatoes
    format: huggingface
    schema:
      fields:
        - name: text
          type: string
        - name: label
          type: integer

Some interesting plugins ideas might be to integrate with Socrata (Simon Wilson did something similar), with Kaggle Datasets, or with Datalad.

davidgasquez · 2023-10-18T14:48:01Z

davidgasquez
Oct 18, 2023
Maintainer Author

You could bootstrap all datapackages.json to create a starting collection of datasets. Basically, get these files: https://github.com/search?q=path%3A**%2Fdatapackage.json&type=code

Some quick code to do that. 👇

import requests

def search_code(query, page=0):
    url = "https://api.github.com/search/code"
    headers = {
        "Accept": "application/vnd.github+json",
        "Authorization": f"Bearer {os.getenv("BEARER")}"
    }

    # Define the query parameters
    params = {
        "q": query,
        "page": page
    }

    # Make the GET request
    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
    return response.json()

data = search_code("filename:datapackage.yaml")

all_items = []
page = 1

while page:
    items = search_code("filename:datapackage.yaml", page)['items']
    all_items.extend(items)
    if not items:
        page = None
    else:
        page = page + 1

0 replies

davidgasquez · 2025-10-13T07:43:48Z

davidgasquez
Oct 13, 2025
Maintainer Author

The ideal would be to have marketplaces to be git repositories, URLs with a properly formatted /marketplace.json file, local folder, ...

A potential CLI would be able to run cli marketplace add user-or-org/repo-name and enable that.

1 reply

davidgasquez Mar 26, 2026
Maintainer Author

Alternatively, reuse AT Protocol and have collections that do that (e.g: cli marketplace add nasa.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datonic

Marketplaces and Packaging #44

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Datonic

Marketplaces and Packaging #44

Uh oh!

Uh oh!

davidgasquez Apr 19, 2023 Maintainer

Frictionless

Why don't you use X instead?

Why should people use this instead of doing their own thing?

How would you make datasets immutable?

How would you backup datasets?

How would you make datasets discoverable?

How would you make datasets interoperable?

I want to package a dataset on platform X. How would I do that?

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

davidgasquez Oct 18, 2023 Maintainer Author

Uh oh!

davidgasquez Oct 13, 2025 Maintainer Author

Uh oh!

davidgasquez Mar 26, 2026 Maintainer Author

davidgasquez
Apr 19, 2023
Maintainer

Replies: 2 comments 1 reply

davidgasquez
Oct 18, 2023
Maintainer Author

davidgasquez
Oct 13, 2025
Maintainer Author

davidgasquez Mar 26, 2026
Maintainer Author