Skip to content

Releases: thewizardplusplus/go-crawler

v1.11.2

14 Nov 02:04

Choose a tag to compare

Simplify the handlers.ConcurrentHandler structure via the github.com/thewizardplusplus/go-sync-utils package.

Change Log

Features

  • crawling of all relative links for specified ones:
    • names of tags and attributes of links may be configured;
    • supporting of an outer transformer for the extracted links (optional):
      • data passed to the transformer:
        • extracted links;
        • service data of the HTTP response;
        • content of the HTTP response as bytes;
      • transformers:
        • leading and trailing spaces trimming in the extracted links;
        • resolving of relative links:
          • by the base tag:
            • tag and attribute names may be configured (<base href="..." /> by default);
            • tag selection:
              • first occurrence;
              • last occurrence;
          • by the header list:
            • the headers are listed in the descending order of the priority;
            • Content-Base and Content-Location by default;
          • by the request URI;
      • supporting of grouping of transformers:
        • the transformers are processed sequentially, so one transformer can influence another one;
    • supporting of leading and trailing spaces trimming in extracted links (optional):
      • as the transformer for the extracted links (see above);
      • as the wrapper for a link extractor;
    • repeated extracting of relative links on error (optional):
      • only the specified repeat count;
      • supporting of a delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • in-memory caching of the loaded sitemap.xml files;
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of the empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
        • supporting of an outer generator for the sitemap.xml links:
          • generators:
            • hierarchical generator:
              • returns the suitable sitemap.xml file for each part of the URL path;
              • supporting of sanitizing of the base link before generating of the sitemap.xml links;
              • supporting of the restriction of the maximal depth;
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • processing of each generator is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
      • supporting of a gzip compression of a sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each link extractor in the group;
      • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for each extracted link:
    • handling of the extracted links directly during the crawling, i.e., immediately after they have been extracted;
    • data passed to the handler:
      • extracted link;
      • source link for the extracted link;
    • handling only of those extracted links that have been filtered by a link filter (see below; optional);
    • handling of the extracted links concurrently, i.e., in the goroutine pool (optional);
    • supporting of grouping of handlers:
      • processing of each handler is done in a separate goroutine;
  • filtering of the extracted links by an outer link filter:
    • by relativity of the extracted link (optional):
      • supporting of result inverting;
    • by uniqueness of the extracted link (optional):
      • supporting of sanitizing of the link before checking of uniqueness;
    • by a robots.txt file (optional):
      • customized user agent;
      • in-memory caching of the loaded robots.txt files;
    • supporting of grouping of link filters:
      • the link filters are processed sequentially, so one link filter can influence another one;
      • result of group filtering is successful only when all link filters are successful;
      • the empty group of link filters is always failed;
  • parallelization possibilities:
    • crawling of relative links concurrently, i.e., in the goroutine pool;
    • simulation of an unbounded channel of links to avoid a deadlock;
    • waiting of completion of processing of all extracted links;
    • supporting of stopping of all operations via the context.

v1.11.1

15 Oct 23:46

Choose a tag to compare

Supporting of an outer transformer for the extracted links; waiting of the completion of the processing in the handlers.ConcurrentHandler structure; replacing of the error producing to the logging in the transformers.ResolvingTransformer.TransformLinks() method; adding of the Name field to the extractors.ExtractorGroup structure for using it in the log messages as a prefix; using of the relative link resolving in the examples; adding of the example with all the features; simplifying of the examples; completing of the documentation.

Change Log

  • crawling of all relative links for specified ones:
    • supporting of an outer transformer for the extracted links (optional):
      • data passed to the transformer:
        • extracted links;
        • service data of the HTTP response;
        • content of the HTTP response as bytes;
      • transformers:
        • leading and trailing spaces trimming in the extracted links;
        • resolving of relative links:
          • by the base tag:
            • tag and attribute names may be configured (<base href="..." /> by default);
            • tag selection:
              • first occurrence;
              • last occurrence;
          • by the header list:
            • the headers are listed in the descending order of the priority;
            • Content-Base and Content-Location by default;
          • by the request URI;
      • supporting of grouping of transformers:
        • the transformers are processed sequentially, so one transformer can influence another one;
  • minor improvements:
    • rename the transformers.BaseTagFilters variable to the DefaultBaseTagFilters;
    • add the waiting of the completion of the processing in the handlers.ConcurrentHandler structure;
    • error handling:
      • improve the error handling in the sitemap.HierarchicalGenerator.ExtractLinks() method;
      • simplify the error handling in the extractors.TrimmingExtractor.ExtractLinks() method;
      • replace the error producing to the logging in the transformers.ResolvingTransformer.TransformLinks() method;
    • logging:
      • move the logging from the registers.LinkRegister structure to the checkers.DuplicateChecker structure:
        • return the error instead of the logging in the registers.LinkRegister structure;
        • add the logging to the checkers.DuplicateChecker structure;
      • improve the logging in the checkers.HostChecker.CheckLink() method;
      • add the Name field to the extractors.ExtractorGroup structure:
        • use it in the log messages as a prefix (optional);
    • refactoring:
      • use the transformers.TrimmingTransformer structure in the extractors.TrimmingExtractor.ExtractLinks() method;
      • simplify the extractors.DelayingExtractor.ExtractLinks() method;
      • add the explanatory comment to the extractors.DelayingExtractor.ExtractLinks() method;
      • use the builders.FlattenBuilder structure from the github.com/thewizardplusplus/go-html-selector package in the transformers.BaseTagBuilder structure;
    • unit testing:
      • complete the tests of the transformers.ResolvingTransformer.TransformLinks() method;
      • fix the tests of the transformers.BaseTagBuilder.IsSelectionTerminated() method;
  • examples:
    • use the relative link resolving;
    • add the explanatory comment to the example with the processing of a sitemap.xml file;
    • add the example with all the features;
    • simplify the examples:
      • simplify the renderTemplate() function;
      • remove the use:
        • of the extractors.RepeatingExtractor structure;
        • of the extractors.TrimmingExtractor structure;
      • remove the example:
        • with the delaying extracting;
        • with the processing of a robots.txt file on the handling;
        • with the crawler.CrawlByConcurrentHandler() function;
        • with the crawler.HandleLinksConcurrently() function;
  • documentation:
    • complete the README.md file:
      • describe the bibliography;
      • complete the description of the features.

Features

  • crawling of all relative links for specified ones:
    • names of tags and attributes of links may be configured;
    • supporting of an outer transformer for the extracted links (optional):
      • data passed to the transformer:
        • extracted links;
        • service data of the HTTP response;
        • content of the HTTP response as bytes;
      • transformers:
        • leading and trailing spaces trimming in the extracted links;
        • resolving of relative links:
          • by the base tag:
            • tag and attribute names may be configured (<base href="..." /> by default);
            • tag selection:
              • first occurrence;
              • last occurrence;
          • by the header list:
            • the headers are listed in the descending order of the priority;
            • Content-Base and Content-Location by default;
          • by the request URI;
      • supporting of grouping of transformers:
        • the transformers are processed sequentially, so one transformer can influence another one;
    • supporting of leading and trailing spaces trimming in extracted links (optional):
      • as the transformer for the extracted links (see above);
      • as the wrapper for a link extractor;
    • repeated extracting of relative links on error (optional):
      • only the specified repeat count;
      • supporting of a delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • in-memory caching of the loaded sitemap.xml files;
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of the empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
        • supporting of an outer generator for the sitemap.xml links:
          • generators:
            • hierarchical generator:
              • returns the suitable sitemap.xml file for each part of the URL path;
              • supporting of sanitizing of the base link before generating of the sitemap.xml links;
              • supporting of the restriction of the maximal depth;
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • processing of each generator is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
      • supporting of a gzip compression of a sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each link extractor in the group;
      • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for each extracted link:
    • handling of the extracted links directly during the crawling, i.e., immediately after they have been extracted;
    • data passed to the handler:
      • extracted link;
      • source link for the extracted link;
    • handling only of those extracted links that have been filtered by a link filter (see below; optional);
    • handling of the extracted links concurrently, i.e., in the goroutine pool (optional);
    • supporting of grouping of handlers:
      • processing of each handler is done in a separate goroutine;
  • filtering of the extracted links by an outer link filter:
    • by relativity of the extracted link (optional):
      • supporting of result inverting;
    • by uniqueness of the extracted link (optional):
      • supporting of sanitizing of the link before checking of uniqueness;
    • by a robots.txt file (optional):
      • customized user agent;
      • in-memory caching of the loaded robots.txt files;
    • supporting of grouping of link filters:
      • the link filters are processed sequentially, so one link filter can influence another one;
      • result of group filtering is successful only when all link filters are successful;
      • the empty group of link filters is always failed;
  • parallelization possibilities:
    • crawling of relative links concurrently, i.e., in the goroutine pool;
    • simulation of an unbounded channel of links to avoid a deadlock;
    • waiting of completion of processing of all extracted links;
    • supporting of stopping of all operations via the context.

v1.11

11 Sep 23:52

Choose a tag to compare

Resolving of relative links.

Change Log

  • crawling of all relative links for specified ones:
    • resolving of relative links:
      • by the base tag;
      • by the Content-Base and Content-Location headers;
      • by the request URI.

Features

  • crawling of all relative links for specified ones:
    • resolving of relative links:
      • by the base tag;
      • by the Content-Base and Content-Location headers;
      • by the request URI;
    • supporting of leading and trailing spaces trimming in extracted links (optional);
    • repeated extracting of relative links on error (optional):
      • only specified repeat count;
      • supporting of delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of an empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
        • supporting of an outer generator for sitemap.xml links:
          • generators:
            • simple generator (it returns the sitemap.xml file in the site root);
            • hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • generating concurrently:
              • processing of each generator is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
      • supporting of a gzip compression of a sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each extractor in the group;
      • extracting links concurrently:
        • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for an each found link:
    • it's called directly during crawling;
    • handling of links immediately after they have been extracted;
    • passing of the source link in the outer handler;
    • handling links filtered by a custom link filter (optional);
    • handling links concurrently (optional);
    • supporting of grouping of outer handlers:
      • processing of each outer handler is done in a separate goroutine;
  • custom filtering of considered links:
    • by relativity of a link (optional):
      • supporting of result inverting;
    • by uniqueness of an extracted link (optional):
      • supporting of sanitizing of a link before checking of uniqueness (optional);
    • by a robots.txt file (optional):
      • customized user agent;
    • supporting of grouping of link filters:
      • result of group filtering is successful only when all filters are successful;
  • parallelization possibilities:
    • crawling of relative links in parallel;
    • supporting of background working:
      • automatic completion after processing all filtered links;
    • simulate an unbounded channel of links to avoid a deadlock.

v1.10.1

18 Jul 09:01

Choose a tag to compare

Perform the refactoring: add the extractors.TrimmingExtractor structure and ignore errors from each extractor in the group, instead of logging them.

Change Log

  • perform the refactoring:
    • link trimming:
      • add the extractors.TrimmingExtractor structure;
      • remove the link trimming from the extractors.DefaultExtractor structure;
    • fix the extractors.ExtractorGroup structure:
      • ignore errors from each extractor in the group, instead of logging them;
    • add the registers.BasicRegister structure:
      • use in the registers.RobotsTXTRegister structure;
      • use in the registers.SitemapRegister structure;
  • fix the bugs:
    • fix the tests of the checkers.HostChecker.CheckLink() method.

Features

  • crawling of all relative links for specified ones:
    • supporting of leading and trailing spaces trimming in extracted links (optional);
    • repeated extracting of relative links on error (optional):
      • only specified repeat count;
      • supporting of delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of an empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
        • supporting of an outer generator for sitemap.xml links:
          • generators:
            • simple generator (it returns the sitemap.xml file in the site root);
            • hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • generating concurrently:
              • processing of each generator is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
      • supporting of a gzip compression of a sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each extractor in the group;
      • extracting links concurrently:
        • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for an each found link:
    • it's called directly during crawling;
    • handling of links immediately after they have been extracted;
    • passing of the source link in the outer handler;
    • handling links filtered by a custom link filter (optional);
    • handling links concurrently (optional);
    • supporting of grouping of outer handlers:
      • processing of each outer handler is done in a separate goroutine;
  • custom filtering of considered links:
    • by relativity of a link (optional):
      • supporting of result inverting;
    • by uniqueness of an extracted link (optional):
      • supporting of sanitizing of a link before checking of uniqueness (optional);
    • by a robots.txt file (optional):
      • customized user agent;
    • supporting of grouping of link filters:
      • result of group filtering is successful only when all filters are successful;
  • parallelization possibilities:
    • crawling of relative links in parallel;
    • supporting of background working:
      • automatic completion after processing all filtered links;
    • simulate an unbounded channel of links to avoid a deadlock.

v1.10

04 Jul 22:51

Choose a tag to compare

Supporting of leading and trailing spaces trimming in extracted links and supporting of grouping of outer handlers.

Change Log

  • crawling of all relative links for specified ones:
    • supporting of leading and trailing spaces trimming in extracted links (optional);
  • calling of an outer handler for an each found link:
    • supporting of grouping of outer handlers:
      • processing of each outer handler is done in a separate goroutine;
  • custom filtering of considered links:
    • by relativity of a link (optional):
      • supporting of result inverting;
  • extend the logging:
    • in the crawler.HandleLink() function;
    • in the extractors package:
      • in the RepeatingExtractor structure;
      • in the SitemapExtractor structure;
    • in the checkers package:
      • in the HostChecker structure;
      • in the RobotsTXTChecker structure;
    • in the registers.LinkRegister structure;
  • examples:
    • fix the output messages;
    • add the example with few handlers.

Features

  • crawling of all relative links for specified ones:
    • supporting of leading and trailing spaces trimming in extracted links (optional);
    • repeated extracting of relative links on error (optional):
      • only specified repeat count;
      • supporting of delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of an empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
        • supporting of an outer generator for sitemap.xml links:
          • generators:
            • simple generator (it returns the sitemap.xml file in the site root);
            • hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • generating concurrently:
              • processing of each generator is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
      • supporting of a gzip compression of a sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each extractor in the group;
      • extracting links concurrently:
        • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for an each found link:
    • it's called directly during crawling;
    • handling of links immediately after they have been extracted;
    • passing of the source link in the outer handler;
    • handling links filtered by a custom link filter (optional);
    • handling links concurrently (optional);
    • supporting of grouping of outer handlers:
      • processing of each outer handler is done in a separate goroutine;
  • custom filtering of considered links:
    • by relativity of a link (optional):
      • supporting of result inverting;
    • by uniqueness of an extracted link (optional):
      • supporting of sanitizing of a link before checking of uniqueness (optional);
    • by a robots.txt file (optional):
      • customized user agent;
    • supporting of grouping of link filters:
      • result of group filtering is successful only when all filters are successful;
  • parallelization possibilities:
    • crawling of relative links in parallel;
    • supporting of background working:
      • automatic completion after processing all filtered links;
    • simulate an unbounded channel of links to avoid a deadlock.

v1.9.1

04 Jul 22:50

Choose a tag to compare

Perform the refactoring: replace the registers.LinkGenerator interface with models.LinkExtractor and add the urlutils.GenerateHierarchicalLinks() function.

Change Log

  • perform the refactoring:
    • move the interfaces of the models package to the separate file;
    • replace the registers.LinkGenerator interface with models.LinkExtractor:
      • replace the sitemap.GeneratorGroup type with extractors.ExtractorGroup;
    • pass a thread ID to the registers.SitemapRegister.RegisterSitemap() method;
    • rename the sanitizing package to urlutils;
    • add the urlutils.GenerateHierarchicalLinks() function:
      • use in the registers.RobotsTXTRegister structure;
      • use in the sitemap.HierarchicalGenerator structure:
        • replace the sitemap.SimpleGenerator structure with sitemap.HierarchicalGenerator;
  • simplify the examples.

Features

  • crawling of all relative links for specified ones:
    • repeated extracting of relative links on error (optional):
      • only specified repeat count;
      • supporting of delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of an empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
        • supporting of an outer generator for sitemap.xml links:
          • generators:
            • simple generator (it returns the sitemap.xml file in the site root);
            • hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • generating concurrently:
              • processing of each generator is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
      • supporting of a gzip compression of a sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each extractor in the group;
      • extracting links concurrently:
        • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for an each found link:
    • it's called directly during crawling;
    • handling of links immediately after they have been extracted;
    • passing of the source link in the outer handler;
    • handling links filtered by a custom link filter (optional);
    • handling links concurrently (optional);
  • custom filtering of considered links:
    • by relativity of a link (optional);
    • by uniqueness of an extracted link (optional):
      • supporting of sanitizing of a link before checking of uniqueness (optional);
    • by a robots.txt file (optional):
      • customized user agent;
    • supporting of grouping of link filters:
      • result of group filtering is successful only when all filters are successful;
  • parallelization possibilities:
    • crawling of relative links in parallel;
    • supporting of background working:
      • automatic completion after processing all filtered links;
    • simulate an unbounded channel of links to avoid a deadlock.

v1.9

19 Jun 15:06

Choose a tag to compare

Supporting of a gzip compression of a sitemap.xml file.

Change Log

  • crawling of all relative links for specified ones:
    • extracting links from a sitemap.xml file (optional):
      • supporting of a gzip compression of a sitemap.xml file.

Features

  • crawling of all relative links for specified ones:
    • repeated extracting of relative links on error (optional):
      • only specified repeat count;
      • supporting of delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of an empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
        • supporting of an outer generator for sitemap.xml links:
          • generators:
            • simple generator (it returns the sitemap.xml file in the site root);
            • hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • generating concurrently:
              • processing of each generator is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
      • supporting of a gzip compression of a sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each extractor in the group;
      • extracting links concurrently:
        • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for an each found link:
    • it's called directly during crawling;
    • handling of links immediately after they have been extracted;
    • passing of the source link in the outer handler;
    • handling links filtered by a custom link filter (optional);
    • handling links concurrently (optional);
  • custom filtering of considered links:
    • by relativity of a link (optional);
    • by uniqueness of an extracted link (optional):
      • supporting of sanitizing of a link before checking of uniqueness (optional);
    • by a robots.txt file (optional):
      • customized user agent;
    • supporting of grouping of link filters:
      • result of group filtering is successful only when all filters are successful;
  • parallelization possibilities:
    • crawling of relative links in parallel;
    • supporting of background working:
      • automatic completion after processing all filtered links;
    • simulate an unbounded channel of links to avoid a deadlock.

v1.8

19 Jun 15:04

Choose a tag to compare

Supporting of grouping of sitemap.xml link generators and adding of the hierarchical generator and the generator based on the robots.txt file.

Change Log

  • crawling of all relative links for specified ones:
    • extracting links from a sitemap.xml file (optional):
      • supporting of few sitemap.xml files for a single link:
        • supporting of an outer generator for sitemap.xml links:
          • generators:
            • simple generator (it returns the sitemap.xml file in the site root);
            • hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • generating concurrently:
              • processing of each generator is done in a separate goroutine.

Features

  • crawling of all relative links for specified ones:
    • repeated extracting of relative links on error (optional):
      • only specified repeat count;
      • supporting of delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of an empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
        • supporting of an outer generator for sitemap.xml links:
          • generators:
            • simple generator (it returns the sitemap.xml file in the site root);
            • hierarchical generator (it returns the suitable sitemap.xml file for each part of the URL path);
            • generator based on the robots.txt file;
          • supporting of grouping of generators:
            • result of group generating is merged results of each generator in the group;
            • generating concurrently:
              • processing of each generator is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each extractor in the group;
      • extracting links concurrently:
        • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for an each found link:
    • it's called directly during crawling;
    • handling of links immediately after they have been extracted;
    • passing of the source link in the outer handler;
    • handling links filtered by a custom link filter (optional);
    • handling links concurrently (optional);
  • custom filtering of considered links:
    • by relativity of a link (optional);
    • by uniqueness of an extracted link (optional):
      • supporting of sanitizing of a link before checking of uniqueness (optional);
    • by a robots.txt file (optional):
      • customized user agent;
    • supporting of grouping of link filters:
      • result of group filtering is successful only when all filters are successful;
  • parallelization possibilities:
    • crawling of relative links in parallel;
    • supporting of background working:
      • automatic completion after processing all filtered links;
    • simulate an unbounded channel of links to avoid a deadlock.

v1.7.1

29 May 21:58

Choose a tag to compare

Optimizing of the sitemap.xml files processing and ignoring of the errors that occur during it.

Change Log

  • crawling of all relative links for specified ones:
    • extracting links from a sitemap.xml file (optional):
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of an empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
    • supporting of grouping of link extractors:
      • extracting links concurrently:
        • processing of each link extractor is done in a separate goroutine.

Features

  • crawling of all relative links for specified ones:
    • repeated extracting of relative links on error (optional):
      • only specified repeat count;
      • supporting of delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • ignoring of the error on loading of the sitemap.xml file:
        • logging of the received error;
        • returning of an empty Sitemap instead;
      • supporting of few sitemap.xml files for a single link:
        • processing of each sitemap.xml file is done in a separate goroutine;
      • supporting of a Sitemap index file:
        • supporting of a delay before loading of each sitemap.xml file listed in the index;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each extractor in the group;
      • extracting links concurrently:
        • processing of each link extractor is done in a separate goroutine;
  • calling of an outer handler for an each found link:
    • it's called directly during crawling;
    • handling of links immediately after they have been extracted;
    • passing of the source link in the outer handler;
    • handling links filtered by a custom link filter (optional);
    • handling links concurrently (optional);
  • custom filtering of considered links:
    • by relativity of a link (optional);
    • by uniqueness of an extracted link (optional):
      • supporting of sanitizing of a link before checking of uniqueness (optional);
    • by a robots.txt file (optional):
      • customized user agent;
    • supporting of grouping of link filters:
      • result of group filtering is successful only when all filters are successful;
  • parallelization possibilities:
    • crawling of relative links in parallel;
    • supporting of background working:
      • automatic completion after processing all filtered links;
    • simulate an unbounded channel of links to avoid a deadlock.

v1.7

06 May 00:58

Choose a tag to compare

Supporting of grouping of link extractors and extracting links from a sitemap.xml file (optional).

Change Log

  • crawling of all relative links for specified ones:
    • extracting links from a sitemap.xml file (optional):
      • supporting of few sitemap.xml files for a single link;
      • supporting of a Sitemap index file;
      • supporting of a delay before loading of a specific sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each extractor in the group.

Features

  • crawling of all relative links for specified ones:
    • repeated extracting of relative links on error (optional):
      • only specified repeat count;
      • supporting of delay between repeats;
    • delayed extracting of relative links (optional):
      • reducing of a delay time by the time elapsed since the last request;
      • using of individual delays for each thread;
    • extracting links from a sitemap.xml file (optional):
      • supporting of few sitemap.xml files for a single link;
      • supporting of a Sitemap index file;
      • supporting of a delay before loading of a specific sitemap.xml file;
    • supporting of grouping of link extractors:
      • result of group extracting is merged results of each extractor in the group;
  • calling of an outer handler for an each found link:
    • it's called directly during crawling;
    • handling of links immediately after they have been extracted;
    • passing of the source link in the outer handler;
    • handling links filtered by a custom link filter (optional);
    • handling links concurrently (optional);
  • custom filtering of considered links:
    • by relativity of a link (optional);
    • by uniqueness of an extracted link (optional):
      • supporting of sanitizing of a link before checking of uniqueness (optional);
    • by a robots.txt file (optional):
      • customized user agent;
    • supporting of grouping of link filters:
      • result of group filtering is successful only when all filters are successful;
  • parallelization possibilities:
    • crawling of relative links in parallel;
    • supporting of background working:
      • automatic completion after processing all filtered links;
    • simulate an unbounded channel of links to avoid a deadlock.