Releases · thewizardplusplus/go-crawler

14 Nov 02:04

thewizardplusplus

v1.11.2

80d07ef

v1.11.2 Latest

Latest

Simplify the handlers.ConcurrentHandler structure via the github.com/thewizardplusplus/go-sync-utils package.

Change Log

refactoring:
- update the github.com/thewizardplusplus/go-sync-utils package in the dependencies;
- simplify the handlers.ConcurrentHandler structure via the github.com/thewizardplusplus/go-sync-utils package.

Features

crawling of all relative links for specified ones:
- names of tags and attributes of links may be configured;
- supporting of an outer transformer for the extracted links (optional):
  - data passed to the transformer:
    - extracted links;
    - service data of the HTTP response;
    - content of the HTTP response as bytes;
  - transformers:
    - leading and trailing spaces trimming in the extracted links;
    - resolving of relative links:
      - by the base tag:
        
        tag and attribute names may be configured (<base href="..." /> by default);
        
        tag selection:
        
        first occurrence;
        
        last occurrence;
      - by the header list:
        
        the headers are listed in the descending order of the priority;
        
        Content-Base and Content-Location by default;
      - by the request URI;
  - supporting of grouping of transformers:
    - the transformers are processed sequentially, so one transformer can influence another one;
- supporting of leading and trailing spaces trimming in extracted links (optional):
  - as the transformer for the extracted links (see above);
  - as the wrapper for a link extractor;
- repeated extracting of relative links on error (optional):
  - only the specified repeat count;
  - supporting of a delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - in-memory caching of the loaded sitemap.xml files;
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of the empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
    - supporting of an outer generator for the sitemap.xml links:
      - generators:
        
        hierarchical generator:
        
        returns the suitable sitemap.xml file for each part of the URL path;
        
        supporting of sanitizing of the base link before generating of the sitemap.xml links;
        
        supporting of the restriction of the maximal depth;
        
        generator based on the robots.txt file;
      - supporting of grouping of generators:
        
        result of group generating is merged results of each generator in the group;
        
        processing of each generator is done in a separate goroutine;
  - supporting of a Sitemap index file:
    - supporting of a delay before loading of each sitemap.xml file listed in the index;
  - supporting of a gzip compression of a sitemap.xml file;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each link extractor in the group;
  - processing of each link extractor is done in a separate goroutine;
calling of an outer handler for each extracted link:
- handling of the extracted links directly during the crawling, i.e., immediately after they have been extracted;
- data passed to the handler:
  - extracted link;
  - source link for the extracted link;
- handling only of those extracted links that have been filtered by a link filter (see below; optional);
- handling of the extracted links concurrently, i.e., in the goroutine pool (optional);
- supporting of grouping of handlers:
  - processing of each handler is done in a separate goroutine;
filtering of the extracted links by an outer link filter:
- by relativity of the extracted link (optional):
  - supporting of result inverting;
- by uniqueness of the extracted link (optional):
  - supporting of sanitizing of the link before checking of uniqueness;
- by a robots.txt file (optional):
  - customized user agent;
  - in-memory caching of the loaded robots.txt files;
- supporting of grouping of link filters:
  - the link filters are processed sequentially, so one link filter can influence another one;
  - result of group filtering is successful only when all link filters are successful;
  - the empty group of link filters is always failed;
parallelization possibilities:
- crawling of relative links concurrently, i.e., in the goroutine pool;
- simulation of an unbounded channel of links to avoid a deadlock;
- waiting of completion of processing of all extracted links;
- supporting of stopping of all operations via the context.

Assets 2

15 Oct 23:46

thewizardplusplus

v1.11.1

c268ae8

v1.11.1

Supporting of an outer transformer for the extracted links; waiting of the completion of the processing in the handlers.ConcurrentHandler structure; replacing of the error producing to the logging in the transformers.ResolvingTransformer.TransformLinks() method; adding of the Name field to the extractors.ExtractorGroup structure for using it in the log messages as a prefix; using of the relative link resolving in the examples; adding of the example with all the features; simplifying of the examples; completing of the documentation.

Change Log

crawling of all relative links for specified ones:
- supporting of an outer transformer for the extracted links (optional):
  - data passed to the transformer:
    - extracted links;
    - service data of the HTTP response;
    - content of the HTTP response as bytes;
  - transformers:
    - leading and trailing spaces trimming in the extracted links;
    - resolving of relative links:
      - by the base tag:
        
        tag and attribute names may be configured (<base href="..." /> by default);
        
        tag selection:
        
        first occurrence;
        
        last occurrence;
      - by the header list:
        
        the headers are listed in the descending order of the priority;
        
        Content-Base and Content-Location by default;
      - by the request URI;
  - supporting of grouping of transformers:
    - the transformers are processed sequentially, so one transformer can influence another one;
minor improvements:
- rename the transformers.BaseTagFilters variable to the DefaultBaseTagFilters;
- add the waiting of the completion of the processing in the handlers.ConcurrentHandler structure;
- error handling:
  - improve the error handling in the sitemap.HierarchicalGenerator.ExtractLinks() method;
  - simplify the error handling in the extractors.TrimmingExtractor.ExtractLinks() method;
  - replace the error producing to the logging in the transformers.ResolvingTransformer.TransformLinks() method;
- logging:
  - move the logging from the registers.LinkRegister structure to the checkers.DuplicateChecker structure:
    - return the error instead of the logging in the registers.LinkRegister structure;
    - add the logging to the checkers.DuplicateChecker structure;
  - improve the logging in the checkers.HostChecker.CheckLink() method;
  - add the Name field to the extractors.ExtractorGroup structure:
    - use it in the log messages as a prefix (optional);
- refactoring:
  - use the transformers.TrimmingTransformer structure in the extractors.TrimmingExtractor.ExtractLinks() method;
  - simplify the extractors.DelayingExtractor.ExtractLinks() method;
  - add the explanatory comment to the extractors.DelayingExtractor.ExtractLinks() method;
  - use the builders.FlattenBuilder structure from the github.com/thewizardplusplus/go-html-selector package in the transformers.BaseTagBuilder structure;
- unit testing:
  - complete the tests of the transformers.ResolvingTransformer.TransformLinks() method;
  - fix the tests of the transformers.BaseTagBuilder.IsSelectionTerminated() method;
examples:
- use the relative link resolving;
- add the explanatory comment to the example with the processing of a sitemap.xml file;
- add the example with all the features;
- simplify the examples:
  - simplify the renderTemplate() function;
  - remove the use:
    - of the extractors.RepeatingExtractor structure;
    - of the extractors.TrimmingExtractor structure;
  - remove the example:
    - with the delaying extracting;
    - with the processing of a robots.txt file on the handling;
    - with the crawler.CrawlByConcurrentHandler() function;
    - with the crawler.HandleLinksConcurrently() function;
documentation:
- complete the README.md file:
  - describe the bibliography;
  - complete the description of the features.

Features

crawling of all relative links for specified ones:
- names of tags and attributes of links may be configured;
- supporting of an outer transformer for the extracted links (optional):
  - data passed to the transformer:
    - extracted links;
    - service data of the HTTP response;
    - content of the HTTP response as bytes;
  - transformers:
    - leading and trailing spaces trimming in the extracted links;
    - resolving of relative links:
      - by the base tag:
        
        tag and attribute names may be configured (<base href="..." /> by default);
        
        tag selection:
        
        first occurrence;
        
        last occurrence;
      - by the header list:
        
        the headers are listed in the descending order of the priority;
        
        Content-Base and Content-Location by default;
      - by the request URI;
  - supporting of grouping of transformers:
    - the transformers are processed sequentially, so one transformer can influence another one;
- supporting of leading and trailing spaces trimming in extracted links (optional):
  - as the transformer for the extracted links (see above);
  - as the wrapper for a link extractor;
- repeated extracting of relative links on error (optional):
  - only the specified repeat count;
  - supporting of a delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - in-memory caching of the loaded sitemap.xml files;
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of the empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
    - supporting of an outer generator for the sitemap.xml links:
      - generators:
        
        hierarchical generator:
        
        returns the suitable sitemap.xml file for each part of the URL path;
        
        supporting of sanitizing of the base link before generating of the sitemap.xml links;
        
        supporting of the restriction of the maximal depth;
        
        generator based on the robots.txt file;
      - supporting of grouping of generators:
        
        result of group generating is merged results of each generator in the group;
        
        processing of each generator is done in a separate goroutine;
  - supporting of a Sitemap index file:
    - supporting of a delay before loading of each sitemap.xml file listed in the index;
  - supporting of a gzip compression of a sitemap.xml file;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each link extractor in the group;
  - processing of each link extractor is done in a separate goroutine;
calling of an outer handler for each extracted link:
- handling of the extracted links directly during the crawling, i.e., immediately after they have been extracted;
- data passed to the handler:
  - extracted link;
  - source link for the extracted link;
- handling only of those extracted links that have been filtered by a link filter (see below; optional);
- handling of the extracted links concurrently, i.e., in the goroutine pool (optional);
- supporting of grouping of handlers:
  - processing of each handler is done in a separate goroutine;
filtering of the extracted links by an outer link filter:
- by relativity of the extracted link (optional):
  - supporting of result inverting;
- by uniqueness of the extracted link (optional):
  - supporting of sanitizing of the link before checking of uniqueness;
- by a robots.txt file (optional):
  - customized user agent;
  - in-memory caching of the loaded robots.txt files;
- supporting of grouping of link filters:
  - the link filters are processed sequentially, so one link filter can influence another one;
  - result of group filtering is successful only when all link filters are successful;
  - the empty group of link filters is always failed;
parallelization possibilities:
- crawling of relative links concurrently, i.e., in the goroutine pool;
- simulation of an unbounded channel of links to avoid a deadlock;
- waiting of completion of processing of all extracted links;
- supporting of stopping of all operations via the context.

Assets 2

11 Sep 23:52

thewizardplusplus

v1.11

4308e94

v1.11

Resolving of relative links.

Change Log

crawling of all relative links for specified ones:
- resolving of relative links:
  - by the base tag;
  - by the Content-Base and Content-Location headers;
  - by the request URI.

Features

crawling of all relative links for specified ones:
- resolving of relative links:
  - by the base tag;
  - by the Content-Base and Content-Location headers;
  - by the request URI;
- supporting of leading and trailing spaces trimming in extracted links (optional);
- repeated extracting of relative links on error (optional):
  - only specified repeat count;
  - supporting of delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of an empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
    - supporting of an outer generator for sitemap.xml links:
      - generators:
        
        simple generator (it returns the sitemap.xml file in the site root);
        
        hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
        
        generator based on the robots.txt file;
      - supporting of grouping of generators:
        
        result of group generating is merged results of each generator in the group;
        
        generating concurrently:
        
        processing of each generator is done in a separate goroutine;
  - supporting of a Sitemap index file:
    - supporting of a delay before loading of each sitemap.xml file listed in the index;
  - supporting of a gzip compression of a sitemap.xml file;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each extractor in the group;
  - extracting links concurrently:
    - processing of each link extractor is done in a separate goroutine;
calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- supporting of grouping of outer handlers:
  - processing of each outer handler is done in a separate goroutine;
custom filtering of considered links:
- by relativity of a link (optional):
  - supporting of result inverting;
- by uniqueness of an extracted link (optional):
  - supporting of sanitizing of a link before checking of uniqueness (optional);
- by a robots.txt file (optional):
  - customized user agent;
- supporting of grouping of link filters:
  - result of group filtering is successful only when all filters are successful;
parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
  - automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.

Assets 2

18 Jul 09:01

thewizardplusplus

v1.10.1

6fcff2d

v1.10.1

Perform the refactoring: add the extractors.TrimmingExtractor structure and ignore errors from each extractor in the group, instead of logging them.

Change Log

perform the refactoring:
- link trimming:
  - add the extractors.TrimmingExtractor structure;
  - remove the link trimming from the extractors.DefaultExtractor structure;
- fix the extractors.ExtractorGroup structure:
  - ignore errors from each extractor in the group, instead of logging them;
- add the registers.BasicRegister structure:
  - use in the registers.RobotsTXTRegister structure;
  - use in the registers.SitemapRegister structure;
fix the bugs:
- fix the tests of the checkers.HostChecker.CheckLink() method.

Features

crawling of all relative links for specified ones:
- supporting of leading and trailing spaces trimming in extracted links (optional);
- repeated extracting of relative links on error (optional):
  - only specified repeat count;
  - supporting of delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of an empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
    - supporting of an outer generator for sitemap.xml links:
      - generators:
        
        simple generator (it returns the sitemap.xml file in the site root);
        
        hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
        
        generator based on the robots.txt file;
      - supporting of grouping of generators:
        
        result of group generating is merged results of each generator in the group;
        
        generating concurrently:
        
        processing of each generator is done in a separate goroutine;
  - supporting of a Sitemap index file:
    - supporting of a delay before loading of each sitemap.xml file listed in the index;
  - supporting of a gzip compression of a sitemap.xml file;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each extractor in the group;
  - extracting links concurrently:
    - processing of each link extractor is done in a separate goroutine;
calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- supporting of grouping of outer handlers:
  - processing of each outer handler is done in a separate goroutine;
custom filtering of considered links:
- by relativity of a link (optional):
  - supporting of result inverting;
- by uniqueness of an extracted link (optional):
  - supporting of sanitizing of a link before checking of uniqueness (optional);
- by a robots.txt file (optional):
  - customized user agent;
- supporting of grouping of link filters:
  - result of group filtering is successful only when all filters are successful;
parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
  - automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.

Assets 2

04 Jul 22:51

thewizardplusplus

v1.10

3bb9018

v1.10

Supporting of leading and trailing spaces trimming in extracted links and supporting of grouping of outer handlers.

Change Log

crawling of all relative links for specified ones:
- supporting of leading and trailing spaces trimming in extracted links (optional);
calling of an outer handler for an each found link:
- supporting of grouping of outer handlers:
  - processing of each outer handler is done in a separate goroutine;
custom filtering of considered links:
- by relativity of a link (optional):
  - supporting of result inverting;
extend the logging:
- in the crawler.HandleLink() function;
- in the extractors package:
  - in the RepeatingExtractor structure;
  - in the SitemapExtractor structure;
- in the checkers package:
  - in the HostChecker structure;
  - in the RobotsTXTChecker structure;
- in the registers.LinkRegister structure;
examples:
- fix the output messages;
- add the example with few handlers.

Features

crawling of all relative links for specified ones:
- supporting of leading and trailing spaces trimming in extracted links (optional);
- repeated extracting of relative links on error (optional):
  - only specified repeat count;
  - supporting of delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of an empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
    - supporting of an outer generator for sitemap.xml links:
      - generators:
        
        simple generator (it returns the sitemap.xml file in the site root);
        
        hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
        
        generator based on the robots.txt file;
      - supporting of grouping of generators:
        
        result of group generating is merged results of each generator in the group;
        
        generating concurrently:
        
        processing of each generator is done in a separate goroutine;
  - supporting of a Sitemap index file:
    - supporting of a delay before loading of each sitemap.xml file listed in the index;
  - supporting of a gzip compression of a sitemap.xml file;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each extractor in the group;
  - extracting links concurrently:
    - processing of each link extractor is done in a separate goroutine;
calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- supporting of grouping of outer handlers:
  - processing of each outer handler is done in a separate goroutine;
custom filtering of considered links:
- by relativity of a link (optional):
  - supporting of result inverting;
- by uniqueness of an extracted link (optional):
  - supporting of sanitizing of a link before checking of uniqueness (optional);
- by a robots.txt file (optional):
  - customized user agent;
- supporting of grouping of link filters:
  - result of group filtering is successful only when all filters are successful;
parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
  - automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.

Assets 2

04 Jul 22:50

thewizardplusplus

v1.9.1

565e2cf

v1.9.1

Perform the refactoring: replace the registers.LinkGenerator interface with models.LinkExtractor and add the urlutils.GenerateHierarchicalLinks() function.

Change Log

perform the refactoring:
- move the interfaces of the models package to the separate file;
- replace the registers.LinkGenerator interface with models.LinkExtractor:
  - replace the sitemap.GeneratorGroup type with extractors.ExtractorGroup;
- pass a thread ID to the registers.SitemapRegister.RegisterSitemap() method;
- rename the sanitizing package to urlutils;
- add the urlutils.GenerateHierarchicalLinks() function:
  - use in the registers.RobotsTXTRegister structure;
  - use in the sitemap.HierarchicalGenerator structure:
    - replace the sitemap.SimpleGenerator structure with sitemap.HierarchicalGenerator;
simplify the examples.

Features

crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
  - only specified repeat count;
  - supporting of delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of an empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
    - supporting of an outer generator for sitemap.xml links:
      - generators:
        
        simple generator (it returns the sitemap.xml file in the site root);
        
        hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
        
        generator based on the robots.txt file;
      - supporting of grouping of generators:
        
        result of group generating is merged results of each generator in the group;
        
        generating concurrently:
        
        processing of each generator is done in a separate goroutine;
  - supporting of a Sitemap index file:
    - supporting of a delay before loading of each sitemap.xml file listed in the index;
  - supporting of a gzip compression of a sitemap.xml file;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each extractor in the group;
  - extracting links concurrently:
    - processing of each link extractor is done in a separate goroutine;
calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
  - supporting of sanitizing of a link before checking of uniqueness (optional);
- by a robots.txt file (optional):
  - customized user agent;
- supporting of grouping of link filters:
  - result of group filtering is successful only when all filters are successful;
parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
  - automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.

Assets 2

19 Jun 15:06

thewizardplusplus

v1.9

99c4009

v1.9

Supporting of a gzip compression of a sitemap.xml file.

Change Log

crawling of all relative links for specified ones:
- extracting links from a sitemap.xml file (optional):
  - supporting of a gzip compression of a sitemap.xml file.

Features

crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
  - only specified repeat count;
  - supporting of delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of an empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
    - supporting of an outer generator for sitemap.xml links:
      - generators:
        
        simple generator (it returns the sitemap.xml file in the site root);
        
        hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
        
        generator based on the robots.txt file;
      - supporting of grouping of generators:
        
        result of group generating is merged results of each generator in the group;
        
        generating concurrently:
        
        processing of each generator is done in a separate goroutine;
  - supporting of a Sitemap index file:
    - supporting of a delay before loading of each sitemap.xml file listed in the index;
  - supporting of a gzip compression of a sitemap.xml file;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each extractor in the group;
  - extracting links concurrently:
    - processing of each link extractor is done in a separate goroutine;
calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
  - supporting of sanitizing of a link before checking of uniqueness (optional);
- by a robots.txt file (optional):
  - customized user agent;
- supporting of grouping of link filters:
  - result of group filtering is successful only when all filters are successful;
parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
  - automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.

Assets 2

19 Jun 15:04

thewizardplusplus

v1.8

3cbc739

v1.8

Supporting of grouping of sitemap.xml link generators and adding of the hierarchical generator and the generator based on the robots.txt file.

Change Log

crawling of all relative links for specified ones:
- extracting links from a sitemap.xml file (optional):
  - supporting of few sitemap.xml files for a single link:
    - supporting of an outer generator for sitemap.xml links:
      - generators:
        
        simple generator (it returns the sitemap.xml file in the site root);
        
        hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
        
        generator based on the robots.txt file;
      - supporting of grouping of generators:
        
        result of group generating is merged results of each generator in the group;
        
        generating concurrently:
        
        processing of each generator is done in a separate goroutine.

Features

crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
  - only specified repeat count;
  - supporting of delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of an empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
    - supporting of an outer generator for sitemap.xml links:
      - generators:
        
        simple generator (it returns the sitemap.xml file in the site root);
        
        hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
        
        generator based on the robots.txt file;
      - supporting of grouping of generators:
        
        result of group generating is merged results of each generator in the group;
        
        generating concurrently:
        
        processing of each generator is done in a separate goroutine;
  - supporting of a Sitemap index file:
    - supporting of a delay before loading of each sitemap.xml file listed in the index;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each extractor in the group;
  - extracting links concurrently:
    - processing of each link extractor is done in a separate goroutine;
calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
  - supporting of sanitizing of a link before checking of uniqueness (optional);
- by a robots.txt file (optional):
  - customized user agent;
- supporting of grouping of link filters:
  - result of group filtering is successful only when all filters are successful;
parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
  - automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.

Assets 2

29 May 21:58

thewizardplusplus

v1.7.1

ba72c49

v1.7.1

Optimizing of the sitemap.xml files processing and ignoring of the errors that occur during it.

Change Log

crawling of all relative links for specified ones:
- extracting links from a sitemap.xml file (optional):
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of an empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
- supporting of grouping of link extractors:
  - extracting links concurrently:
    - processing of each link extractor is done in a separate goroutine.

Features

crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
  - only specified repeat count;
  - supporting of delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - ignoring of the error on loading of the sitemap.xml file:
    - logging of the received error;
    - returning of an empty Sitemap instead;
  - supporting of few sitemap.xml files for a single link:
    - processing of each sitemap.xml file is done in a separate goroutine;
  - supporting of a Sitemap index file:
    - supporting of a delay before loading of each sitemap.xml file listed in the index;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each extractor in the group;
  - extracting links concurrently:
    - processing of each link extractor is done in a separate goroutine;
calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
  - supporting of sanitizing of a link before checking of uniqueness (optional);
- by a robots.txt file (optional):
  - customized user agent;
- supporting of grouping of link filters:
  - result of group filtering is successful only when all filters are successful;
parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
  - automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.

Assets 2

06 May 00:58

thewizardplusplus

v1.7

bc55f00

v1.7

Supporting of grouping of link extractors and extracting links from a sitemap.xml file (optional).

Change Log

crawling of all relative links for specified ones:
- extracting links from a sitemap.xml file (optional):
  - supporting of few sitemap.xml files for a single link;
  - supporting of a Sitemap index file;
  - supporting of a delay before loading of a specific sitemap.xml file;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each extractor in the group.

Features

crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
  - only specified repeat count;
  - supporting of delay between repeats;
- delayed extracting of relative links (optional):
  - reducing of a delay time by the time elapsed since the last request;
  - using of individual delays for each thread;
- extracting links from a sitemap.xml file (optional):
  - supporting of few sitemap.xml files for a single link;
  - supporting of a Sitemap index file;
  - supporting of a delay before loading of a specific sitemap.xml file;
- supporting of grouping of link extractors:
  - result of group extracting is merged results of each extractor in the group;
calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
  - supporting of sanitizing of a link before checking of uniqueness (optional);
- by a robots.txt file (optional):
  - customized user agent;
- supporting of grouping of link filters:
  - result of group filtering is successful only when all filters are successful;
parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
  - automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.

Assets 2

Releases: thewizardplusplus/go-crawler

v1.11.2

Change Log

Features

Uh oh!

v1.11.1

Change Log

Features

Uh oh!

v1.11

Change Log

Features

Uh oh!

v1.10.1

Change Log

Features

Uh oh!

v1.10

Change Log

Features

Uh oh!

v1.9.1

Change Log

Features

Uh oh!

v1.9

Change Log

Features

Uh oh!

v1.8

Change Log

Features

Uh oh!

v1.7.1

Change Log

Features

Uh oh!

v1.7

Change Log

Features

Uh oh!