Marketplaces and Packaging #44
Unanswered
davidgasquez
asked this question in
Q&A
Replies: 2 comments 1 reply
-
|
You could bootstrap all Some quick code to do that. 👇 import requests
def search_code(query, page=0):
url = "https://api.github.com/search/code"
headers = {
"Accept": "application/vnd.github+json",
"Authorization": f"Bearer {os.getenv("BEARER")}"
}
# Define the query parameters
params = {
"q": query,
"page": page
}
# Make the GET request
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
return response.json()
data = search_code("filename:datapackage.yaml")
all_items = []
page = 1
while page:
items = search_code("filename:datapackage.yaml", page)['items']
all_items.extend(items)
if not items:
page = None
else:
page = page + 1 |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
The ideal would be to have marketplaces to be A potential CLI would be able to run |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Frictionless
Frictionless standards provide a lightweight and minimal abstraction layer (data packages are JSON/YAML files) on top of data files to make them easier to use. Adhering to the Frictionless specs makes it easier to integrate into the existing community and interoperate with all the datasets and tools already built.
Another interesting side effect of the Frictionless design fisolophy is that it allows everyone to package datasets in a permissionless way. You don't need to move the data, just wrap it around with a simple metadata file.
It's already being used by organizations like Our World in Data, cooperatives like Catalyst, and many other places.
We need to solve the problem of "packaging data" as a community. Frictionless is a great starting point as it only takes someone to write a plugin/extension to integrate a new platform/format/scheme/portal into the ecosystem.
Why don't you use X instead?
I've tried quite a bunch of Data Package Managers. Frictionless is the simplest and most flexible one. It also has a reasonable adoption and active community.
That said, I'm open to other options. If you have a better idea, let's chat!
Why should people use this instead of doing their own thing?
If everybody could converge to it, e.g: "datapackage.json" as a metadata and schema description standard, then, an ecosystem of utilities and libraries for processing data would take advantage of it.
How would you make datasets immutable?
Datasets could be IPFS native. Clients could either fetch the data via IPFS or use a public Gateway.
In the end, the Frictionless abstraction is just an URL. We can use anything we want in the backend as long as we provide a way to read the data. In this case:
How would you backup datasets?
Depending on the dataset, this feature could be pushed to the hosting later. If you publish in HuggingFace, you get versioning and backup for free! Once the data in there, we can rely on the
_cacheproperty of the Frictionless Specs (or a_backupone) to point to the previous backup.How would you make datasets discoverable?
This is something we have to do as a community. A great start is to create Catalogs. Storing the Catalog definitions in places like GitHub will make it easy to discover them and surface the best ones. At the end, a data package is only an URL!
How would you make datasets interoperable?
The tabular resource representation can be an Arrow tabular dataset. With that, we get access to the Apache Arrow ecosystem. Data should be just a
resources.to_arrow()command away!Additionally, using a file system abstraction like
fsspecmakes it easy to interact with different "remotes" like S3, GCS, HDFS, etc.I want to package a dataset on platform X. How would I do that?
The Frictionless ecosystem is extensible via plugins/extensions. You can create a plugin to integrate any platform with the Frictionless ecosystem. For example, you can create a plugin to integrate HuggingFace datasets so your package looks something like this:
Some interesting plugins ideas might be to integrate with Socrata (Simon Wilson did something similar), with Kaggle Datasets, or with Datalad.
Beta Was this translation helpful? Give feedback.
All reactions