Releases: thewizardplusplus/go-crawler
v1.11.2
Simplify the handlers.ConcurrentHandler structure via the github.com/thewizardplusplus/go-sync-utils package.
Change Log
- refactoring:
- update the github.com/thewizardplusplus/go-sync-utils package in the dependencies;
- simplify the
handlers.ConcurrentHandlerstructure via the github.com/thewizardplusplus/go-sync-utils package.
Features
- crawling of all relative links for specified ones:
- names of tags and attributes of links may be configured;
- supporting of an outer transformer for the extracted links (optional):
- data passed to the transformer:
- extracted links;
- service data of the HTTP response;
- content of the HTTP response as bytes;
- transformers:
- leading and trailing spaces trimming in the extracted links;
- resolving of relative links:
- by the base tag:
- tag and attribute names may be configured (
<base href="..." />by default); - tag selection:
- first occurrence;
- last occurrence;
- tag and attribute names may be configured (
- by the header list:
- the headers are listed in the descending order of the priority;
Content-BaseandContent-Locationby default;
- by the request URI;
- by the base tag:
- supporting of grouping of transformers:
- the transformers are processed sequentially, so one transformer can influence another one;
- data passed to the transformer:
- supporting of leading and trailing spaces trimming in extracted links (optional):
- as the transformer for the extracted links (see above);
- as the wrapper for a link extractor;
- repeated extracting of relative links on error (optional):
- only the specified repeat count;
- supporting of a delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- in-memory caching of the loaded
sitemap.xmlfiles; - ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of the empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine; - supporting of an outer generator for the
sitemap.xmllinks:- generators:
- hierarchical generator:
- returns the suitable
sitemap.xmlfile for each part of the URL path; - supporting of sanitizing of the base link before generating of the
sitemap.xmllinks; - supporting of the restriction of the maximal depth;
- returns the suitable
- generator based on the
robots.txtfile;
- hierarchical generator:
- supporting of grouping of generators:
- result of group generating is merged results of each generator in the group;
- processing of each generator is done in a separate goroutine;
- generators:
- processing of each
- supporting of a Sitemap index file:
- supporting of a delay before loading of each
sitemap.xmlfile listed in the index;
- supporting of a delay before loading of each
- supporting of a gzip compression of a
sitemap.xmlfile;
- in-memory caching of the loaded
- supporting of grouping of link extractors:
- result of group extracting is merged results of each link extractor in the group;
- processing of each link extractor is done in a separate goroutine;
- calling of an outer handler for each extracted link:
- handling of the extracted links directly during the crawling, i.e., immediately after they have been extracted;
- data passed to the handler:
- extracted link;
- source link for the extracted link;
- handling only of those extracted links that have been filtered by a link filter (see below; optional);
- handling of the extracted links concurrently, i.e., in the goroutine pool (optional);
- supporting of grouping of handlers:
- processing of each handler is done in a separate goroutine;
- filtering of the extracted links by an outer link filter:
- by relativity of the extracted link (optional):
- supporting of result inverting;
- by uniqueness of the extracted link (optional):
- supporting of sanitizing of the link before checking of uniqueness;
- by a
robots.txtfile (optional):- customized user agent;
- in-memory caching of the loaded
robots.txtfiles;
- supporting of grouping of link filters:
- the link filters are processed sequentially, so one link filter can influence another one;
- result of group filtering is successful only when all link filters are successful;
- the empty group of link filters is always failed;
- by relativity of the extracted link (optional):
- parallelization possibilities:
- crawling of relative links concurrently, i.e., in the goroutine pool;
- simulation of an unbounded channel of links to avoid a deadlock;
- waiting of completion of processing of all extracted links;
- supporting of stopping of all operations via the context.
v1.11.1
Supporting of an outer transformer for the extracted links; waiting of the completion of the processing in the handlers.ConcurrentHandler structure; replacing of the error producing to the logging in the transformers.ResolvingTransformer.TransformLinks() method; adding of the Name field to the extractors.ExtractorGroup structure for using it in the log messages as a prefix; using of the relative link resolving in the examples; adding of the example with all the features; simplifying of the examples; completing of the documentation.
Change Log
- crawling of all relative links for specified ones:
- supporting of an outer transformer for the extracted links (optional):
- data passed to the transformer:
- extracted links;
- service data of the HTTP response;
- content of the HTTP response as bytes;
- transformers:
- leading and trailing spaces trimming in the extracted links;
- resolving of relative links:
- by the base tag:
- tag and attribute names may be configured (
<base href="..." />by default); - tag selection:
- first occurrence;
- last occurrence;
- tag and attribute names may be configured (
- by the header list:
- the headers are listed in the descending order of the priority;
Content-BaseandContent-Locationby default;
- by the request URI;
- by the base tag:
- supporting of grouping of transformers:
- the transformers are processed sequentially, so one transformer can influence another one;
- data passed to the transformer:
- supporting of an outer transformer for the extracted links (optional):
- minor improvements:
- rename the
transformers.BaseTagFiltersvariable to theDefaultBaseTagFilters; - add the waiting of the completion of the processing in the
handlers.ConcurrentHandlerstructure; - error handling:
- improve the error handling in the
sitemap.HierarchicalGenerator.ExtractLinks()method; - simplify the error handling in the
extractors.TrimmingExtractor.ExtractLinks()method; - replace the error producing to the logging in the
transformers.ResolvingTransformer.TransformLinks()method;
- improve the error handling in the
- logging:
- move the logging from the
registers.LinkRegisterstructure to thecheckers.DuplicateCheckerstructure:- return the error instead of the logging in the
registers.LinkRegisterstructure; - add the logging to the
checkers.DuplicateCheckerstructure;
- return the error instead of the logging in the
- improve the logging in the
checkers.HostChecker.CheckLink()method; - add the
Namefield to theextractors.ExtractorGroupstructure:- use it in the log messages as a prefix (optional);
- move the logging from the
- refactoring:
- use the
transformers.TrimmingTransformerstructure in theextractors.TrimmingExtractor.ExtractLinks()method; - simplify the
extractors.DelayingExtractor.ExtractLinks()method; - add the explanatory comment to the
extractors.DelayingExtractor.ExtractLinks()method; - use the
builders.FlattenBuilderstructure from the github.com/thewizardplusplus/go-html-selector package in thetransformers.BaseTagBuilderstructure;
- use the
- unit testing:
- complete the tests of the
transformers.ResolvingTransformer.TransformLinks()method; - fix the tests of the
transformers.BaseTagBuilder.IsSelectionTerminated()method;
- complete the tests of the
- rename the
- examples:
- use the relative link resolving;
- add the explanatory comment to the example with the processing of a
sitemap.xmlfile; - add the example with all the features;
- simplify the examples:
- simplify the
renderTemplate()function; - remove the use:
- of the
extractors.RepeatingExtractorstructure; - of the
extractors.TrimmingExtractorstructure;
- of the
- remove the example:
- with the delaying extracting;
- with the processing of a
robots.txtfile on the handling; - with the
crawler.CrawlByConcurrentHandler()function; - with the
crawler.HandleLinksConcurrently()function;
- simplify the
- documentation:
- complete the
README.mdfile:- describe the bibliography;
- complete the description of the features.
- complete the
Features
- crawling of all relative links for specified ones:
- names of tags and attributes of links may be configured;
- supporting of an outer transformer for the extracted links (optional):
- data passed to the transformer:
- extracted links;
- service data of the HTTP response;
- content of the HTTP response as bytes;
- transformers:
- leading and trailing spaces trimming in the extracted links;
- resolving of relative links:
- by the base tag:
- tag and attribute names may be configured (
<base href="..." />by default); - tag selection:
- first occurrence;
- last occurrence;
- tag and attribute names may be configured (
- by the header list:
- the headers are listed in the descending order of the priority;
Content-BaseandContent-Locationby default;
- by the request URI;
- by the base tag:
- supporting of grouping of transformers:
- the transformers are processed sequentially, so one transformer can influence another one;
- data passed to the transformer:
- supporting of leading and trailing spaces trimming in extracted links (optional):
- as the transformer for the extracted links (see above);
- as the wrapper for a link extractor;
- repeated extracting of relative links on error (optional):
- only the specified repeat count;
- supporting of a delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- in-memory caching of the loaded
sitemap.xmlfiles; - ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of the empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine; - supporting of an outer generator for the
sitemap.xmllinks:- generators:
- hierarchical generator:
- returns the suitable
sitemap.xmlfile for each part of the URL path; - supporting of sanitizing of the base link before generating of the
sitemap.xmllinks; - supporting of the restriction of the maximal depth;
- returns the suitable
- generator based on the
robots.txtfile;
- hierarchical generator:
- supporting of grouping of generators:
- result of group generating is merged results of each generator in the group;
- processing of each generator is done in a separate goroutine;
- generators:
- processing of each
- supporting of a Sitemap index file:
- supporting of a delay before loading of each
sitemap.xmlfile listed in the index;
- supporting of a delay before loading of each
- supporting of a gzip compression of a
sitemap.xmlfile;
- in-memory caching of the loaded
- supporting of grouping of link extractors:
- result of group extracting is merged results of each link extractor in the group;
- processing of each link extractor is done in a separate goroutine;
- calling of an outer handler for each extracted link:
- handling of the extracted links directly during the crawling, i.e., immediately after they have been extracted;
- data passed to the handler:
- extracted link;
- source link for the extracted link;
- handling only of those extracted links that have been filtered by a link filter (see below; optional);
- handling of the extracted links concurrently, i.e., in the goroutine pool (optional);
- supporting of grouping of handlers:
- processing of each handler is done in a separate goroutine;
- filtering of the extracted links by an outer link filter:
- by relativity of the extracted link (optional):
- supporting of result inverting;
- by uniqueness of the extracted link (optional):
- supporting of sanitizing of the link before checking of uniqueness;
- by a
robots.txtfile (optional):- customized user agent;
- in-memory caching of the loaded
robots.txtfiles;
- supporting of grouping of link filters:
- the link filters are processed sequentially, so one link filter can influence another one;
- result of group filtering is successful only when all link filters are successful;
- the empty group of link filters is always failed;
- by relativity of the extracted link (optional):
- parallelization possibilities:
- crawling of relative links concurrently, i.e., in the goroutine pool;
- simulation of an unbounded channel of links to avoid a deadlock;
- waiting of completion of processing of all extracted links;
- supporting of stopping of all operations via the context.
v1.11
Resolving of relative links.
Change Log
- crawling of all relative links for specified ones:
- resolving of relative links:
- by the
basetag; - by the
Content-BaseandContent-Locationheaders; - by the request URI.
- by the
- resolving of relative links:
Features
- crawling of all relative links for specified ones:
- resolving of relative links:
- by the
basetag; - by the
Content-BaseandContent-Locationheaders; - by the request URI;
- by the
- supporting of leading and trailing spaces trimming in extracted links (optional);
- repeated extracting of relative links on error (optional):
- only specified repeat count;
- supporting of delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of an empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine; - supporting of an outer generator for
sitemap.xmllinks:- generators:
- simple generator (it returns the
sitemap.xmlfile in the site root); - hierarchical generator (it returns the suitable
sitemap.xmlfile for each part of the URL path); - generator based on the
robots.txtfile;
- simple generator (it returns the
- supporting of grouping of generators:
- result of group generating is merged results of each generator in the group;
- generating concurrently:
- processing of each generator is done in a separate goroutine;
- generators:
- processing of each
- supporting of a Sitemap index file:
- supporting of a delay before loading of each
sitemap.xmlfile listed in the index;
- supporting of a delay before loading of each
- supporting of a gzip compression of a
sitemap.xmlfile;
- ignoring of the error on loading of the
- supporting of grouping of link extractors:
- result of group extracting is merged results of each extractor in the group;
- extracting links concurrently:
- processing of each link extractor is done in a separate goroutine;
- resolving of relative links:
- calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- supporting of grouping of outer handlers:
- processing of each outer handler is done in a separate goroutine;
- custom filtering of considered links:
- by relativity of a link (optional):
- supporting of result inverting;
- by uniqueness of an extracted link (optional):
- supporting of sanitizing of a link before checking of uniqueness (optional);
- by a
robots.txtfile (optional):- customized user agent;
- supporting of grouping of link filters:
- result of group filtering is successful only when all filters are successful;
- by relativity of a link (optional):
- parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
- automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.
v1.10.1
Perform the refactoring: add the extractors.TrimmingExtractor structure and ignore errors from each extractor in the group, instead of logging them.
Change Log
- perform the refactoring:
- link trimming:
- add the
extractors.TrimmingExtractorstructure; - remove the link trimming from the
extractors.DefaultExtractorstructure;
- add the
- fix the
extractors.ExtractorGroupstructure:- ignore errors from each extractor in the group, instead of logging them;
- add the
registers.BasicRegisterstructure:- use in the
registers.RobotsTXTRegisterstructure; - use in the
registers.SitemapRegisterstructure;
- use in the
- link trimming:
- fix the bugs:
- fix the tests of the
checkers.HostChecker.CheckLink()method.
- fix the tests of the
Features
- crawling of all relative links for specified ones:
- supporting of leading and trailing spaces trimming in extracted links (optional);
- repeated extracting of relative links on error (optional):
- only specified repeat count;
- supporting of delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of an empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine; - supporting of an outer generator for
sitemap.xmllinks:- generators:
- simple generator (it returns the
sitemap.xmlfile in the site root); - hierarchical generator (it returns the suitable
sitemap.xmlfile for each part of the URL path); - generator based on the
robots.txtfile;
- simple generator (it returns the
- supporting of grouping of generators:
- result of group generating is merged results of each generator in the group;
- generating concurrently:
- processing of each generator is done in a separate goroutine;
- generators:
- processing of each
- supporting of a Sitemap index file:
- supporting of a delay before loading of each
sitemap.xmlfile listed in the index;
- supporting of a delay before loading of each
- supporting of a gzip compression of a
sitemap.xmlfile;
- ignoring of the error on loading of the
- supporting of grouping of link extractors:
- result of group extracting is merged results of each extractor in the group;
- extracting links concurrently:
- processing of each link extractor is done in a separate goroutine;
- calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- supporting of grouping of outer handlers:
- processing of each outer handler is done in a separate goroutine;
- custom filtering of considered links:
- by relativity of a link (optional):
- supporting of result inverting;
- by uniqueness of an extracted link (optional):
- supporting of sanitizing of a link before checking of uniqueness (optional);
- by a
robots.txtfile (optional):- customized user agent;
- supporting of grouping of link filters:
- result of group filtering is successful only when all filters are successful;
- by relativity of a link (optional):
- parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
- automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.
v1.10
Supporting of leading and trailing spaces trimming in extracted links and supporting of grouping of outer handlers.
Change Log
- crawling of all relative links for specified ones:
- supporting of leading and trailing spaces trimming in extracted links (optional);
- calling of an outer handler for an each found link:
- supporting of grouping of outer handlers:
- processing of each outer handler is done in a separate goroutine;
- supporting of grouping of outer handlers:
- custom filtering of considered links:
- by relativity of a link (optional):
- supporting of result inverting;
- by relativity of a link (optional):
- extend the logging:
- in the
crawler.HandleLink()function; - in the
extractorspackage:- in the
RepeatingExtractorstructure; - in the
SitemapExtractorstructure;
- in the
- in the
checkerspackage:- in the
HostCheckerstructure; - in the
RobotsTXTCheckerstructure;
- in the
- in the
registers.LinkRegisterstructure;
- in the
- examples:
- fix the output messages;
- add the example with few handlers.
Features
- crawling of all relative links for specified ones:
- supporting of leading and trailing spaces trimming in extracted links (optional);
- repeated extracting of relative links on error (optional):
- only specified repeat count;
- supporting of delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of an empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine; - supporting of an outer generator for
sitemap.xmllinks:- generators:
- simple generator (it returns the
sitemap.xmlfile in the site root); - hierarchical generator (it returns the suitable
sitemap.xmlfile for each part of the URL path); - generator based on the
robots.txtfile;
- simple generator (it returns the
- supporting of grouping of generators:
- result of group generating is merged results of each generator in the group;
- generating concurrently:
- processing of each generator is done in a separate goroutine;
- generators:
- processing of each
- supporting of a Sitemap index file:
- supporting of a delay before loading of each
sitemap.xmlfile listed in the index;
- supporting of a delay before loading of each
- supporting of a gzip compression of a
sitemap.xmlfile;
- ignoring of the error on loading of the
- supporting of grouping of link extractors:
- result of group extracting is merged results of each extractor in the group;
- extracting links concurrently:
- processing of each link extractor is done in a separate goroutine;
- calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- supporting of grouping of outer handlers:
- processing of each outer handler is done in a separate goroutine;
- custom filtering of considered links:
- by relativity of a link (optional):
- supporting of result inverting;
- by uniqueness of an extracted link (optional):
- supporting of sanitizing of a link before checking of uniqueness (optional);
- by a
robots.txtfile (optional):- customized user agent;
- supporting of grouping of link filters:
- result of group filtering is successful only when all filters are successful;
- by relativity of a link (optional):
- parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
- automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.
v1.9.1
Perform the refactoring: replace the registers.LinkGenerator interface with models.LinkExtractor and add the urlutils.GenerateHierarchicalLinks() function.
Change Log
- perform the refactoring:
- move the interfaces of the
modelspackage to the separate file; - replace the
registers.LinkGeneratorinterface withmodels.LinkExtractor:- replace the
sitemap.GeneratorGrouptype withextractors.ExtractorGroup;
- replace the
- pass a thread ID to the
registers.SitemapRegister.RegisterSitemap()method; - rename the
sanitizingpackage tourlutils; - add the
urlutils.GenerateHierarchicalLinks()function:- use in the
registers.RobotsTXTRegisterstructure; - use in the
sitemap.HierarchicalGeneratorstructure:- replace the
sitemap.SimpleGeneratorstructure withsitemap.HierarchicalGenerator;
- replace the
- use in the
- move the interfaces of the
- simplify the examples.
Features
- crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
- only specified repeat count;
- supporting of delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of an empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine; - supporting of an outer generator for
sitemap.xmllinks:- generators:
- simple generator (it returns the
sitemap.xmlfile in the site root); - hierarchical generator (it returns the suitable
sitemap.xmlfile for each part of the URL path); - generator based on the
robots.txtfile;
- simple generator (it returns the
- supporting of grouping of generators:
- result of group generating is merged results of each generator in the group;
- generating concurrently:
- processing of each generator is done in a separate goroutine;
- generators:
- processing of each
- supporting of a Sitemap index file:
- supporting of a delay before loading of each
sitemap.xmlfile listed in the index;
- supporting of a delay before loading of each
- supporting of a gzip compression of a
sitemap.xmlfile;
- ignoring of the error on loading of the
- supporting of grouping of link extractors:
- result of group extracting is merged results of each extractor in the group;
- extracting links concurrently:
- processing of each link extractor is done in a separate goroutine;
- repeated extracting of relative links on error (optional):
- calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
- supporting of sanitizing of a link before checking of uniqueness (optional);
- by a
robots.txtfile (optional):- customized user agent;
- supporting of grouping of link filters:
- result of group filtering is successful only when all filters are successful;
- parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
- automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.
v1.9
Supporting of a gzip compression of a sitemap.xml file.
Change Log
- crawling of all relative links for specified ones:
- extracting links from a
sitemap.xmlfile (optional):- supporting of a gzip compression of a
sitemap.xmlfile.
- supporting of a gzip compression of a
- extracting links from a
Features
- crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
- only specified repeat count;
- supporting of delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of an empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine; - supporting of an outer generator for
sitemap.xmllinks:- generators:
- simple generator (it returns the
sitemap.xmlfile in the site root); - hierarchical generator (it returns the suitable
sitemap.xmlfile for each part of the URL path); - generator based on the
robots.txtfile;
- simple generator (it returns the
- supporting of grouping of generators:
- result of group generating is merged results of each generator in the group;
- generating concurrently:
- processing of each generator is done in a separate goroutine;
- generators:
- processing of each
- supporting of a Sitemap index file:
- supporting of a delay before loading of each
sitemap.xmlfile listed in the index;
- supporting of a delay before loading of each
- supporting of a gzip compression of a
sitemap.xmlfile;
- ignoring of the error on loading of the
- supporting of grouping of link extractors:
- result of group extracting is merged results of each extractor in the group;
- extracting links concurrently:
- processing of each link extractor is done in a separate goroutine;
- repeated extracting of relative links on error (optional):
- calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
- supporting of sanitizing of a link before checking of uniqueness (optional);
- by a
robots.txtfile (optional):- customized user agent;
- supporting of grouping of link filters:
- result of group filtering is successful only when all filters are successful;
- parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
- automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.
v1.8
Supporting of grouping of sitemap.xml link generators and adding of the hierarchical generator and the generator based on the robots.txt file.
Change Log
- crawling of all relative links for specified ones:
- extracting links from a
sitemap.xmlfile (optional):- supporting of few
sitemap.xmlfiles for a single link:- supporting of an outer generator for
sitemap.xmllinks:- generators:
- simple generator (it returns the
sitemap.xmlfile in the site root); - hierarchical generator (it returns the suitable
sitemap.xmlfile for each part of the URL path); - generator based on the
robots.txtfile;
- simple generator (it returns the
- supporting of grouping of generators:
- result of group generating is merged results of each generator in the group;
- generating concurrently:
- processing of each generator is done in a separate goroutine.
- generators:
- supporting of an outer generator for
- supporting of few
- extracting links from a
Features
- crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
- only specified repeat count;
- supporting of delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of an empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine; - supporting of an outer generator for
sitemap.xmllinks:- generators:
- simple generator (it returns the
sitemap.xmlfile in the site root); - hierarchical generator (it returns the suitable
sitemap.xmlfile for each part of the URL path); - generator based on the
robots.txtfile;
- simple generator (it returns the
- supporting of grouping of generators:
- result of group generating is merged results of each generator in the group;
- generating concurrently:
- processing of each generator is done in a separate goroutine;
- generators:
- processing of each
- supporting of a Sitemap index file:
- supporting of a delay before loading of each
sitemap.xmlfile listed in the index;
- supporting of a delay before loading of each
- ignoring of the error on loading of the
- supporting of grouping of link extractors:
- result of group extracting is merged results of each extractor in the group;
- extracting links concurrently:
- processing of each link extractor is done in a separate goroutine;
- repeated extracting of relative links on error (optional):
- calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
- supporting of sanitizing of a link before checking of uniqueness (optional);
- by a
robots.txtfile (optional):- customized user agent;
- supporting of grouping of link filters:
- result of group filtering is successful only when all filters are successful;
- parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
- automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.
v1.7.1
Optimizing of the sitemap.xml files processing and ignoring of the errors that occur during it.
Change Log
- crawling of all relative links for specified ones:
- extracting links from a
sitemap.xmlfile (optional):- ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of an empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine;
- processing of each
- ignoring of the error on loading of the
- supporting of grouping of link extractors:
- extracting links concurrently:
- processing of each link extractor is done in a separate goroutine.
- extracting links concurrently:
- extracting links from a
Features
- crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
- only specified repeat count;
- supporting of delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- ignoring of the error on loading of the
sitemap.xmlfile:- logging of the received error;
- returning of an empty Sitemap instead;
- supporting of few
sitemap.xmlfiles for a single link:- processing of each
sitemap.xmlfile is done in a separate goroutine;
- processing of each
- supporting of a Sitemap index file:
- supporting of a delay before loading of each
sitemap.xmlfile listed in the index;
- supporting of a delay before loading of each
- ignoring of the error on loading of the
- supporting of grouping of link extractors:
- result of group extracting is merged results of each extractor in the group;
- extracting links concurrently:
- processing of each link extractor is done in a separate goroutine;
- repeated extracting of relative links on error (optional):
- calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
- supporting of sanitizing of a link before checking of uniqueness (optional);
- by a
robots.txtfile (optional):- customized user agent;
- supporting of grouping of link filters:
- result of group filtering is successful only when all filters are successful;
- parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
- automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.
v1.7
Supporting of grouping of link extractors and extracting links from a sitemap.xml file (optional).
Change Log
- crawling of all relative links for specified ones:
- extracting links from a
sitemap.xmlfile (optional):- supporting of few
sitemap.xmlfiles for a single link; - supporting of a Sitemap index file;
- supporting of a delay before loading of a specific
sitemap.xmlfile;
- supporting of few
- supporting of grouping of link extractors:
- result of group extracting is merged results of each extractor in the group.
- extracting links from a
Features
- crawling of all relative links for specified ones:
- repeated extracting of relative links on error (optional):
- only specified repeat count;
- supporting of delay between repeats;
- delayed extracting of relative links (optional):
- reducing of a delay time by the time elapsed since the last request;
- using of individual delays for each thread;
- extracting links from a
sitemap.xmlfile (optional):- supporting of few
sitemap.xmlfiles for a single link; - supporting of a Sitemap index file;
- supporting of a delay before loading of a specific
sitemap.xmlfile;
- supporting of few
- supporting of grouping of link extractors:
- result of group extracting is merged results of each extractor in the group;
- repeated extracting of relative links on error (optional):
- calling of an outer handler for an each found link:
- it's called directly during crawling;
- handling of links immediately after they have been extracted;
- passing of the source link in the outer handler;
- handling links filtered by a custom link filter (optional);
- handling links concurrently (optional);
- custom filtering of considered links:
- by relativity of a link (optional);
- by uniqueness of an extracted link (optional):
- supporting of sanitizing of a link before checking of uniqueness (optional);
- by a
robots.txtfile (optional):- customized user agent;
- supporting of grouping of link filters:
- result of group filtering is successful only when all filters are successful;
- parallelization possibilities:
- crawling of relative links in parallel;
- supporting of background working:
- automatic completion after processing all filtered links;
- simulate an unbounded channel of links to avoid a deadlock.