All notable changes to this project will be documented in this file. Format of this file follows these guidelines. This project adheres to Semantic Versioning.
- Nothing
selenium4.1.0 -> 4.5.0- CircleCI setup_remote_docker version 19.03.13 -> 20.10.17
- Nothing
- Nothing
install-chrome.shwas failing - error message suggested it needs anapt-get update -ybeforeapt-get installso added the updatedev-env0.6.19 -> 0.6.21- change
generate-circleci-config.pyto start using CircleCI Scheduled Pipelines
- Nothing
- added
resource_class: mediumto the CircleCI config generated bygenerate-circleci-config.py
dev-env0.6.17 -> 0.6.19
- Nothing
- added sample spider
alpine_releases.py - added
--prettycommand line option torun-sample.sh - simple approach to skipping CircleCI build, test and deploy of runtime and runtime lite docker images - very useful during development when upgrading major things like Python and/or OS versions
- added explicit resource class to CircleCI config
dev-env0.6.13 -> 0.6.17python-dateutil2.8.1 -> 2.8.2selenium3.141.0 -> 4.1.0bin/install-chromedriver.shwas failing for newer versions of chromium because the format returned by "chromium-browser --version" changed - fix this problem- for runtime lite Alphe base image 3.12 -> 3.15
- refined
bin/install-chromedriver.shoutput when installing on Alpine - simonsdave/bionic-dev-env:v0.6.14 -> simonsdave/focal-dev-env:v0.6.16
- fixed
install-chrome.shusage message - added 2022 to
License
- removed LGTM workflows and badges in main README.md
- Nothing
- fixed how
generate-circleci-config.pyuses/callsint-test-run-all-spiders-in-ci-pipeline.py
- Nothing
- Nothing
- fixed a silly bug in
int-test-run-all-spiders-in-ci-pipeline.pyand how it made the command unusable - also put in real python logging for this command and real command line option handling
- Nothing
- Nothing
- update
generate-circleci-config.pyto eliminate the need for requirements.txt in spider repos - runtime docker images no longer samples init.py as executable
- Nothing
- added optional
--samplescommand line option tospiders.py - added optional
samplesargument toSpiderDiscovery()constructor - added
categoriesto spider metadata - if no categories are specified then the name of the package containing the spider is assumed to be the category name - only place that categories are current used is in the API as a means to group spiders - added
absoluteFilenameproperty to spider metadata - this value is generated by Cloudfeaster - added
fullyQualifiedClassNameproperty to spider metadata - this value is generated by Cloudfeaster
- docker based development environment now parses repo's setup.py for pre-reqs that need to be install when the development docker image is built - this change enabled the removal of requirements.txt from the repo's root directory
- change format of metadata returned by
spiders.pyandcloudfeaster.spider.Spider
- Nothing
- Circle CI pipeline now saves generated python distributions as Circle CI artifacts
- added
int-test-run-all-spiders-in-ci-pipeline.pywhich is intended for use in spider repo CI pipelines
- Nothing
- Nothing
- use
update-alternativesin runtime docker image sopython"points to"python3.7
-
cloudfeaster-litedocker image is now based on Alpine 3.12 (used to be Alpine 3.8) -
install-chrome.shnow able to install both Chrome and Chromium based on command line switches -
install-chromedriver.shdetermines which version of chromedriver to install based on which version of Chrome or Chromium is installed - see this for a complete description of the version selection process -
default Chrome command line options are now
- --headless
- --window-size=1280x1024
- --no-sandbox
- --disable-dev-shm-usage
- --disable-gpu
- --disable-software-rasterizer
- --single-process
- --user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36
- Nothing
- Nothing
install-dev-env-scripts.shnow requires virtual envdev-envv0.6.12 -> v0.6.13
- Nothing
- Nothing
dev-env0.6.11 -> 0.6.12
- Nothing
- Nothing
- fixed CircleCI pipeline for releases
- Nothing
- generate
cloudfeasterandcloudfeaster-litedocker images which can be used as the basis for building docker images of spiders
- remove extra whitespace @ EO
generate-circleci-config.pyoutput spiders.pyoutput now includes a_metadataproperty
- Nothing
- added
spiders.pyto enable infrastructure spider discovery - added
get-clf-version.shthe encapsulate the pattern of parsingsetup.pyto extract the cloudfeaster version - added runtime docker image for running spiders
_metadata.spider.namein spider output is now name of file containing spider rather than spider's class name. This change was made as a result of learning more about the spider hosting infrastructure.- selenium 3.14.0 -> 3.141.0
generate-circleci-config.pynow generates a CircleCI config file that packages all spiders in a docker imageinstall-chromedriver.shnow install ChromeDriver version based on Google Chrome version
- removed
Browser.wait_for_login_to_complete()andBrowser.wait_for_signin_to_complete()because they used an old sync pattern and the methods really weren't being used anymore
- add CircleCI docker executor authenticated pull
- per this article, added
explicit version to
setup_remote_dockerin CircleCI pipeline - add CircleCI docker executor authenticated pull
for CircleCI config generated by
generate-circleci-config.py
- Nothing
- Nothing
- Nothing
- logging level in
generate-circleci-config.pychanged fromINFOtoDEBUGwhich is intended to make it simplier to debug CI pipeline crawl failures
- Nothing
- Nothing
- fix:
generate-circleci-config.pywhich was generating references to the docker imagesimonsdave/cloudfeaster-bionic-dev-envinstead ofsimonsdave/cloudfeaster-dev-env - fix: docker image badge in main
README.md
- Nothing
- add clair-cicd docker image vulnerability assessment to CircleCI pipeline
- add LGTM badges to main
README.md
dev-env0.6.10 -> 0.6.11- changes to
generate-circleci-config.pyto improve reliability of capturing crawl results when crawls fail
- Nothing
- Nothing
- fix
bin/install-dev-env-scripts.shto work withoutdev_env/dev-env-version.txt
- Nothing
- Nothing
- dev-env v0.6.8 -> v0.6.10
- eliminated the nasty looking
Warning: apt-key output should not be parsed (stdout is not a terminal)message generated bybin/install-chrome.sh
- Nothing
- Nothing
run-spider.shnow outputs only json- dev-env v0.6.7 -> v0.6.8
- Nothing
- Nothing
generate-circleci-config.pyadds backrun-pip-check.shto generated CircleCI pipeline which now works after upgrade to Python 3.7
- Nothing
- Nothing
pip3 install->python3.7 -m pip install- dev-env v0.6.6 -> v0.6.7
- Nothing
- Nothing
- fix bug in
generate-circleci-config.pywhich was generating aKeyError: 'CRAWL_OUTPUT'error.
- Nothing
- add comprehensive artifact storage for spiders run by the CircleCI
workflow generated by
generate-circleci-config.py
- remove debugging statement from
run-all-spiders.sh
- Nothing
- Nothing
- fix:
generate-circleci-config.pyhad oustanding problems from Python 2.7 -> 3.7 - usability: improve usability of
run-all-spiders.shandrun-spider.shin spider repos - docs: fix docker image badge in main README.md
- Nothing
- add
--verbosecommand line argument todocker_image_integration_tests.sh - add
--verboseand--debugcommand line options torun-sample.sh - add
CrawlDebuggerand use in sample spiders - start of improving debugging - setting
CLF_DEBUGcan now be used to generatespiderLogandchromeDriverLogin spider output - add
CrawlResponse.SC_UNKNOWN - when a spider fails all attempts are made to take a screenshot of the browser window
- spiderArgs in crawl results now crawlArgs
run-spider.shnow accepts full file name of spider rather than just base name - sorun-spider.sh xe_exchange_ratesis nowrun-spider.sh xe_exchange_rates.py- python-dateutil 2.8.0 -> 2.8.1
- dev-env v0.5.25 -> v0.6.6
- :MATERIAL CHANGE: Python 2.7 -> Python 3.7
- Nothing
- Nothing
- Nothing
- remove Snyk from CI pipeline & docs
- add more
BeautifulSoupandScrapydoc references - dev-env 0.5.21 -> 0.5.25
- add Codecov upload to CircleCI pipeline
SpiderCrawlerhaschromedriver_log_fileallowing callers access to the ChromeDriver debug logs when thedebugproperty forSpiderCrawleris set toTrue
_debugproperty in crawl response under all circumstances
- Nothing
- fix logging of CLF_CHROME value
- Nothing
- Nothing
bin/install-chromedriver.shinstalls chromedriver 2.46 -> 2.43 motivated by this
- Nothing
- Nothing
- selenium 3.141.0 -> selenium==3.14.0 motivated by this
- Nothing
- Nothing
dev-env0.5.20 -> 0.5.21install-dev-env-scripts.shnow uses theinstall-dev-env.sh--dev-env-versioncommand line option
- Nothing
- Nothing
- install
dev-envusinginstall-dev-env.shinstead ofpip install
- Nothing
- Nothing
- dev-env 0.5.19 -> 0.5.20
- Nothing
- add
bin/generate-circleci-config.pyto setup.py
- Nothing
- Nothing
- Nothing
- fix
bin/generate-circleci-config.pythat was generating incorrect CircleCI config
- Nothing
- add
bin/check-circleci-config.shto setup.py as script - should have done this when addingbin/check-circleci-config.sh
- Nothing
- Nothing
- add
bin/check-consistent-clf-version.sh - add
bin/generate-circleci-config.py
bin/install_chrome.sh->bin/install-chrome.shbin/install_chromedriver.sh->bin/install-chromedriver.sh
- remove
bin/chromedriver_version.shsince this script was no longer used - remove
dev-env-versionlabel from docker image since this label is no longer used
- add
check-consistent-clf-version.shtosetup.pyas script which is installed as part of the Cloudfeaster python package
- Nothing
- Nothing
- add
install-dev-env-scripts.shfor use in CircleCI pipeline
- Nothing
- Nothing
- add
check-consistent-dev-env-version.shto CircleCI pipeline - add
run-bandit.shto CircleCI pipeline - add
.cut-release-version.shin support of using new revs of dev-env
- renamed
run_sample.sh->run-sample.sh - the
ttlInSecondsproperty in spider metadata is nowttland the value associated with the property is now a string instead of an integer - the string has the form<number><duration>where<number>is a non-zero integer and<duration>is one ofs,m,hordrepresenting seconds, minutes, hours and days respectively - the
maxCrawlTimeInSecondsspider metadata property is nowmaxCrawlTimeand is also a string instead of an integer - the string has the form<number><duration>where<number>is a non-zero integer and<duration>is one ofsormrepresenting seconds and minutes respectively dev-env0.5.15 -> 0.5.19- sha1 -> sha256 after running bandit
- Nothing
bin/install-dev-env-scripts.shcan now be used by spider repos to add dev env host scripts to a spider repo's host env
- Nothing
- Nothing
- added
run-all-spiders.shandrun-spider.sh - by default Chrome now started with
--no-sandboxwhich should mean that Chrome can run as root which simplifies a whole host of complexity
- Nothing
- Nothing
- .travis.yml now runs
run_repo_security_scanner.sh - added
xe_exchange_rates.pysample spider - added sha1 hash of spiders args to spider output
- ChromeDriver 2.38 -> 2.46
- Selenium 3.12.0 -> 3.141.0
- twine 1.11.0 -> 1.12.1
- dateutil 2.7.3 -> 2.7.5
- material simplifcation of way to use
run_sample.sh
- removed
bank_of_canada_daily_exchange_rates.pysample spider - removed
spiderhost.py,spiderhost.sh,spiders.pyandspiders.sh
- Nothing
- Selenium 3.11.0 -> 3.12.0
- python-dateutil 2.7.2 -> 2.7.3
- spider metadata changed to camel case instead of snake case to get closer to these JSON style guidelines
- crawl results metadata now grouped in the
_metadataproperty and use camel case instead of snake case - crawl results are now validated against this jsonschema
- added spiders.sh and spiderhost.sh to enable the API for a docker image container spiders to be expressed in a manner that's independant from Python and Webdriver
- Nothing
- support pip 10.x
- simonsdave/cloudfeaster docker image is now based on Ubuntu 16.04
- ChromeDriver 2.37 -> 2.38
- Nothing
- added cloudfeaster/samples/pypi.py sample spider
- spiders meta data - url string property is now validated using jsonschema uri format instead of pattern
- selenium 3.9.0 -> 3.11.0
- python-dateutil 2.6.1 -> 2.7.2
- ChromeDriver 2.35 -> 2.37
- twine 1.10.0 -> 1.11.0
- identifying_factors and authenticating_factors properties will now always appear
in
spiders.pyoutput
- Nothing
- Nothing
- samples/pypi_spider.py -> samples/pythonwheels_spider.py
- spider metadata property name change = max_concurrency -> max_concurrent_crawls
- Nothing
- Nothing
- Selenium 3.8.1 -> 3.9.0
- Nothing
- Nothing
- ChromeDriver 2.34 -> 2.35
- Nothing
max concurrencyper spider property is now part of the output fromSpider.get_validated_metadata()regardless of whether or not it is specified as part of the explicit spider metadata declaration- added
paranoia_levelto spider metadata - added
max_crawl_time_in_secondsto spider metadata ttl_in_secondsnow has an upper bound of 86,400 (1 day in seconds)max_concurrencynow has an upper bound of 25
- Selenium 3.7.0 -> 3.8.1
- ChromeDriver 2.33 -> 2.34
- breaking change
ttl->ttl_in_secondsin spider metadata
- Nothing
- added
.prep-for-release-master-branch-changes.shso package version number is automatically bumped when cutting a relase .prep-for-release-master-branch-changes.shnow generates Python packages for PyPI from release branch
- bug fix in
.prep-for-release-release-branch-changes.shto links in mainREADME.mdwork correctly after a release
- removed
cloudfeaster.utilmodule since it wasn't used
- added --log command line option to spiders.py
- added --samples command line option to spiders.py
cloudfeaster.webdriver_spider.WebElementnow has ais_element_present()method that functions just likecloudfeaster.webdriver_spider.Browser
- per this article
headless Chrome
is now available and
Cloudfeasterwill use it by default which means we're also able to remove the need to Xvfb which is a really nice simplification and reduction in required crawling resources - also, because we're removing Xvfbbin/spiderhost.shwas also removed - selenium 3.3.3 -> 3.7.0
- requests 2.13.0 -> >=2.18.2
- ndg-httpsclient 0.4.2 -> 0.4.3
- ChromeDriver 2.29 -> 2.33
- simonsdave/cloudfeaster docker image now uses the latest version of pip
- removed all code related to Signal FX
- pypi_spider.py now included with distro in cloudfeaster.samples
- upgrade selenium 3.0.2 -> 3.3.3
- upgrade chromedriver 2.27 -> 2.29
- Nothing
- added _crawl_time to crawl results
- upgrade to ChromeDriver 2.27 from 2.24
- Nothing
- Nothing
- fix crawl response key errors - _status & _status_code in crawl
response were missing the leading underscore for the following responses
- SC_CTR_RAISED_EXCEPTION
- SC_INVALID_CRAWL_RETURN_TYPE
- SC_CRAWL_RAISED_EXCEPTION
- SC_SPIDER_NOT_FOUND
- Nothing
- Nothing
- dev env upgraded to docker 1.12
- BREAKING CHANGE = selenium 2.53.6 -> 3.0.1 which resulted in requiring an upgrade to ChromeDriver 2.24 from 2.22 and it turns out 2.22 does not work with selenium 3.0.1
- spider version # in crawl results now include hash algo along with the hash value
- BREAKING CHANGE = the spidering infrastructure augments crawl results with data such as the time to crawl, spider name & version number, etc - in order to more easily differentiate crawl results from augmented data, the top level property names for all augment data is now prefixed with an underscore - as an example, below shows the new output from running the PyPI sample spider
>./pypi_spider.py | jq .
{
"virtualenv": {
"count": 46718553,
"link": "http://pypi-ranking.info/module/virtualenv",
"rank": 5
},
"_status_code": 0,
"setuptools": {
"count": 63758431,
"link": "http://pypi-ranking.info/module/setuptools",
"rank": 2
},
"simplejson": {
"count": 182739575,
"link": "http://pypi-ranking.info/module/simplejson",
"rank": 1
},
"requests": {
"count": 53961784,
"link": "http://pypi-ranking.info/module/requests",
"rank": 4
},
"six": {
"count": 54950976,
"link": "http://pypi-ranking.info/module/six",
"rank": 3
},
"_spider": {
"version": "sha1:ccb6a042dd11f2f7fb7b9541d4ec888fc908a8ef",
"name": "__main__.PyPISpider"
},
"_crawl_time_in_ms": 4773,
"_status": "Ok"
}- upgrade dev env to docker 1.12
- Nothing
- Nothing
- fixed bug that was duplicating crawl response data in
CrawlResponseOk
- Nothing
- support docker 1.12
- version bumps for dependancies:
- chromedriver 2.22
- selenium 2.53.6
- requests 2.11.0
- ndg-httpsclient 0.4.2
- set of simplifications in dev env setup
- temporary removal of authenticated proxy support
- Cloudfeaster spiders can be developed on pretty much
any operating systems/browser combinations that can
run Selenium
but Cloudfeaster Services always runs spiders on Ubuntu and Chrome;
some web sites present different responses to browser
requests based on the originating browser and/or operating system;
if, for example, development of a spider is done on Mac OS X
using Chrome, the xpath expressions embedded in a spider may
not be valid when the spider is run on Ubuntu using Chrome;
to address this disconnect, spider authors can force Cloudfeaster
Services to use a user agent header that matches their development
environment by providing a value for the
user_agentargument ofBrowserclass' constructor.
- added proxy support to permit use of anonymity networks like those listed below - proxy support is exposed
by 2 new flags in
spiderhost.py(--proxyand--proxy-user)
>spiderhost.py --help
Usage: spiderhost.py <spider> [<arg1> ... <argN>]
spider hosts accept the name of a spider, the arguments to run the spider and
optionally proxy server details. armed with all this info the spider host runs
a spider and dumps the result to stdout.
Options:
-h, --help show this help message and exit
--log=LOGGING_LEVEL logging level
[DEBUG,INFO,WARNING,ERROR,CRITICAL,FATAL] - default =
ERROR
--proxy=PROXY proxy - default = None
--proxy-user=PROXY_USER
proxy-user - default = None
>
>spiderhost.py --proxy=abc
Usage: spiderhost.py <spider> [<arg1> ... <argN>]
spiderhost.py: error: option --proxy: required format is host:port
>
>spiderhost.py --proxy-user=abc
Usage: spiderhost.py <spider> [<arg1> ... <argN>]
spiderhost.py: error: option --proxy-user: required format is user:password
>- colorama now req'd to be @ least version 0.3.5 instead of only 0.3.5
- command line args to bin/spiderhost.sh have been simplified - now just take spider name and spider args just as you'd expect - no more url encoding of args and ----- indicating no spider args
- like the changes to bin/spiderhost.sh, bin/spiderhost.py now just accepts regular command line arguments of a spider name and spider args - much easier
- bin/spiders.sh is no longer needed - callers now access bin/spiders.py directly rather that getting at bin/spiders.py through bin/spiders.sh
- not really the initial release but intro'ed CHANGELOG.md late
- initial clf commit to github was 13 Oct '13