Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
607055b
Add file for listing used libraries
hechmik May 14, 2019
f0e242c
Add method for random headers generation, sleep between requests.
hechmik May 14, 2019
f68ea44
Externalise the get genre part, wrap all code into a function
hechmik May 14, 2019
65a0596
fixed bugs such as libraries, get_page usage and os random pick for u…
hechmik May 14, 2019
2abab60
first working version with random user agent and sleep between requests
hechmik May 14, 2019
ea44a69
Add basic documentation, modularized code
hechmik May 14, 2019
a55b284
Example property file
hechmik May 14, 2019
70922cf
Completed refactor, read many parameter from property file
hechmik May 14, 2019
e65e51a
Read vgchartz url from config json
hechmik May 15, 2019
d5ab1f4
Add entry for log filename
hechmik May 15, 2019
4ed64bf
Improved documentation, add logging to both stdout and file
hechmik May 15, 2019
8a3e28f
Upgraded to HTTPS
hechmik Mar 4, 2020
b31b21d
Use https in lambda for skipping first 10 elements
hechmik Mar 4, 2020
1c86411
Update README.md
Pelirrojo Mar 30, 2020
fb81a13
Update README.md
Pelirrojo Mar 30, 2020
dd64927
Create requirements.txt
Pelirrojo Mar 30, 2020
c732358
Update .gitignore
Pelirrojo Mar 30, 2020
35cff1d
Update README.md
Pelirrojo Mar 30, 2020
0f5aec3
Update README.md
Pelirrojo Mar 30, 2020
d8f2173
Update README.md
Pelirrojo Mar 30, 2020
6685ad3
Update .gitignore
Pelirrojo Mar 30, 2020
33a53d8
Refactor in functions
Pelirrojo Mar 30, 2020
05824e2
Merge branch 'develop-hechmik'
Pelirrojo Mar 30, 2020
414602c
Add an exception to manage the main loop wrapping the functions call
Pelirrojo Mar 30, 2020
43dbe5e
Explode query parameters
Pelirrojo Mar 30, 2020
779fcad
I love functions with named parameters sorry XD
Pelirrojo Mar 30, 2020
826476d
Fix padding spaces due to html parsing
Pelirrojo Mar 30, 2020
a8bf656
Updating doc
Pelirrojo Mar 30, 2020
6aa2b6c
Update README.md
Pelirrojo Mar 31, 2020
6a8c2a2
Folder reorganize and bump dependencies
Pelirrojo Mar 31, 2020
c4f9ff8
script to easy run
Pelirrojo Mar 31, 2020
0e48d8d
Update documentation and add some TODOs
Pelirrojo Mar 31, 2020
7f5719e
Improve script output
Pelirrojo Mar 31, 2020
214a853
Refactor data saving to add more data and parse full dates instead on…
Pelirrojo Mar 31, 2020
381c264
Updating Doc
Pelirrojo Mar 31, 2020
1b88322
Updating Doc
Pelirrojo Mar 31, 2020
510aa49
Merge pull request #1 from Machine-Learning-Labs/master
hechmik May 3, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
.idea
.vcs

*.csv

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
69 changes: 61 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,66 @@
vgchartzfull is a python script based on BeautifulSoup.
It creates a dataset based on data from
http://www.vgchartz.com/gamedb/
# vgchartzfull - A crawler to download data from Global Videogame Sales

The dataset is saved as vgsales.csv.
vgchartz-full-crawler.py is a python@3 crawler script based on BeautifulSoup.
It creates a csv dataset with data from more than 57,000 games. based on data from [VGChartz Site](http://www.vgchartz.com/gamedb/).

## Output

The dataset is saved in the file specified at cfg/resources.json, by default "dataset/vgsales.csv".

## Install & execution

You will need to have some depencies compiled at **requirements.txt**.

You will need to have BeautifulSoup added.
It can be installed by pip.

sudo pip install BeautifulSoup
```bash

# Install dependencies
$> pip install -r requirements.txt

# Run
$> python vgchartzfull.py


```

## Dictionary

The dataset it's composed by this fields, and the data is collected with this [methodology](https://www.vgchartz.com/methodology.php).

| Field | Description |
|-------|--------------------------|
| Rank | Ranking of overall sales |
| Name | The games name |
| Genre | Genre of the game |
| Platform | Platform of the games release (i.e. PC,PS4, etc.) |
| Developer | Developer of the game |
| Publisher | Publisher of the game |
| Vgchartz_Score | Score at VGcharz site |
| Critic_Score | Score at Critic |
| User_Score | Score by VGcharts users' site |
| Total_Shipped | Total worldwide shipments (in millions) |
| Total_Sales | Total worldwide sales (in millions) |
| NA_Sales | Sales in North America (in millions) |
| EU_Sales | Sales in Europe (in millions) |
| JP_Sales | Sales in Japan (in millions) |
| Other_Sales | Sales in the rest of the world (in millions) |
| Release_Date | Year of the game's release |
| Last_Update | Last update of this register |

## TODO

- [ ] Remap the columns according the selected values at resources.json
- [ ] Add some unit testing
- [ ] Dockerize (w/ alpine-python) to ease use and avoid intallations
- [ ] Publish at Docker hub

## Links

* [vgchartz.com](https://www.vgchartz.com)
* [Original Crawler](https://github.com/GregorUT/vgchartzScrape)
* [Kaggle Dataset](https://www.kaggle.com/gregorut/videogamesales)

## Greetings

Thanks to Chris Albon.
http://chrisalbon.com/python/beautiful_soup_scrape_table.html
Thanks to [Chris Albon](http://chrisalbon.com/python/beautiful_soup_scrape_table.html)
44 changes: 44 additions & 0 deletions cfg/resources.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"application_log_filename": "log/app.log",
"output_filename": "dataset/vgsales.csv",
"separator": ",",
"encoding": "utf-8",
"start_page": 1,
"end_page": 2,
"include_genre": false,
"base_page_url": "https://www.vgchartz.com/gamedb/?page=",
"query_parameters": {
"results": 100,
"region": "All",
"boxart": "Both",
"banner": "Both",
"ownership": "Both",
"showmultiplat": "No",
"order": "Sales",
"showtotalsales": 1,
"showpublisher": 1,
"showvgchartzscore": 1,
"shownasales": 1,
"showdeveloper": 1,
"showcriticscore": 1,
"showpalsales": 1,
"showreleasedate": 1,
"showuserscore": 1,
"showjapansales": 1,
"showlastupdate": 1,
"showothersales": 1,
"showshipped": 1,
"keyword": null,
"console": null,
"developer": null,
"publisher": null,
"goty_year": null,
"genre": null
},
"minimum_sleep_time": 6,
"maximum_sleep_time": 15,
"minimum_major_version": 1,
"maximum_major_version": 56,
"minimum_minor_version": 1,
"maximum_minor_version": 10
}
1 change: 1 addition & 0 deletions dataset/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Git doesn't like empty folders
1 change: 1 addition & 0 deletions log/.gitkeep
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Git doesn't like empty folders
8 changes: 8 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
beautifulsoup4==4.8.2
bs4==0.0.1
numpy==1.18.2
pandas==1.0.3
python-dateutil==2.8.1
pytz==2019.3
six==1.14.0
soupsieve==2.0
13 changes: 13 additions & 0 deletions run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env bash

python --version >/dev/null 2>&1 || { echo >&2 "I require python@3 utility but it's not installed. ¯\_(ツ)_/¯ Aborting."; exit 1; }
pip --version >/dev/null 2>&1 || { echo >&2 "I require pip utility but it's not installed. ¯\_(ツ)_/¯ Aborting."; exit 1; }

clear

echo "\nInstalling deps... "
pip install -r requirements.txt

echo "\nStart crawling... (remember a crawler is the friend nobody likes)"
python vgchartz-full-crawler.py

Loading