Collectiontools

This tool is a web scraping tool that retrieves only text from sites such as FAQs. Suitable for collecting sites that generate a page for each inquiry.

Operation check environment

OS: CentOS7.9(3.10.0-1160.45.1.el7.x86_64)
Python Version: 3.6.8
pip Version: 9.0.3 from /usr/lib/python3.6/site-packages (python 3.6)

Advance preparation

yum install python3
pip3.6 install bs4
pip3.6 install urllib3
pip3.6 install chardet

How to use

git clone https://github.com/nw-engineer/collectiontools.git
cd collectiontools/bin
vim collect.sh

INT=1
Please decide the position of the acquisition start email.
MAX=2
Please set the number of emails you want to get.
URL=http://xxxxxx.co.jp/list
Please set the base URL of the inquiry site.

bash collect.sh

The data retrieved is the data for the <title> and <pre> directives. If the text is something other than a <pre> directive, modify the following Python script appropriately depending on your site's configuration.

collectiontools/bin/collect.py

The execution result is saved in the following directory for each email.

collectiontools/collect-mail/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collectiontools

This tool is a web scraping tool that retrieves only text from sites such as FAQs. Suitable for collecting sites that generate a page for each inquiry.

Operation check environment

Advance preparation

How to use

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Collectiontools

This tool is a web scraping tool that retrieves only text from sites such as FAQs. Suitable for collecting sites that generate a page for each inquiry.

Operation check environment

Advance preparation

How to use