Skip to content

Latest commit

 

History

History
90 lines (50 loc) · 4.75 KB

File metadata and controls

90 lines (50 loc) · 4.75 KB

GitHub

Host open data sets on GitHub

Open datasets contains three parts:

  1. The dataset, in format of .csv, .json, or a directory of those files.
  2. The scripts involved for generating the dataset, e.g. scraper, data cleaning logics, data transformations. See Dataprep for more information.
  3. A README.md to show the basic information of this dataset.

Create a new file called README.md:

GitHub new readme

Write the description file which usually contains introduction of data source, the background of research, data fields, and data size, limitation and license. If you do not know what licenses are available, we suggest you to use CC 4.0. The file is written in markdown language, which has simple syntax that is legible in either plaintext format or rendered HTML format.

GitHub html rendered

The overall shape of an open dataset looks like this: (you are looking at the rendered version of the markdown file)

Shape of dataset

See homework2 for a complete example.

Note: The "limitation" is an important section in your README file. For example, you may only be able to crawl 95% of the original dataset due to technical problems. Highlighting that in your description file is crucial for other people to base their analysis on your dataset. No dataset is ideal. Incomplete dataset is also valuable. The principle is full reporting.

How to download a file from GitHub web page

Example: We will use the data from Openrice as an example and do the restaurant analysis. Assuming that we have already got certain amount of data from Openrice and saved it into csv file.

Here is the link of csv file which can be downloaded here.

Pandas Csv Sample

Click "raw" on the right upper corner.

Pandas Csv Raw

You can see the raw csv file as below.

Csv Raw Data

Right click(or control+click in Mac) and choose "save as"

Csv Save As

Then the csv file can be saved as csv(comma-separated values).

Csv Saved

why we should preview Jupyter notebook on NBview? Are there any relationship with Github?

One can directly preview a Python notebook on GitHub. However, GitHub prohibits Javascript execution for security reasons. If you have interactive chart, e.g. from echart, plotly, those will not render on GitHub. NBViewer supports javascript and it is the first free online tool to preview Python notebook, so we recommend it. For concrete examples of dynamic charts, @ChicoXYC can find one notebook from our project archive: https://github.com/data-projects-archive .

How to change default branch for GitHub pages?

please see here

gh-pages

What is index.html

Basically, index.html is the default file served by the web server. So it is equivalent to visit example.com and example.com/index.html. Naming your file as index.html can lead to this more concise notation in browser's address bar and in communication campaigns -- the naming in the world of web is usually the shorter the better. More explanations are here .

Any real world example of using GitHub issue tracker?

Use issue tracker as Q/A forum:

Use issue tracker as blog post backend:

  • @fouber's blog, written in Chinese, from a senior frontend engineer.

Use issue tracker as web comment store: