Change-log:
Release 1.5.1 (2016-01-04). This version contains a lot of significant additions of all compounds and modules including the hce-node network transport itself and the DC and the DTM services. Also many productivity optimizations and improvements of a parallel computations and several profiling tools added as well as a structural refactoring of source code covers more than 50% of modules and algorithms.
- Core transport infrastructure application hce-node:
- Added the notification connections restore command support.
- Added the DRCE temporary directory by node ini file specification support.
- Improved resources usage balancing modes with extension of the random algorithm to prevent skew in case of similar load level per several host.
- Fixed several bugs of DRCE functional object process tasks management and notifications.
- Updated “Distributed Crawler” (DC) service:
- Added the support of the dynamic fetcher based on the Selenium and Google Chrome.
- Added the support of the multi-item TEMPLATE scraping algorithm especially for the product and search results pages including the smart correspondence correction of fields values per item.
- Added support of the csspath expressions for the rules of the TEMPLATE scraping.
- Added the support of the common metrics calculation for all scraping/processing algorithms; basic metrics like number of bytes, characters, words and so on can be used to chose best result in case of several detection and extraction algorithms applied sequentially.
- Added the support of page chains joining and multi-paged articles as well.
- Added several new types of information tags to detect for the TEMPLATE scraping algorithm.
- Added the support of the several additional properties for the TEMPLATE algorithm rules definition including the regular expression, join delimiter, join modes, mandatory, typification and formatting.
- Added the support of extended definitions of the xpath for the scrapy extractor for the NEWS SCRAPING algorithm.
- Added the support of the rotated proxy configurable per project, domain name and so on…
- Added the support of iterative usage of the fetcher (static then possible dynamic) and auto fetcher type as well.
- Added extended statistical data tracking for crawling and scraping.
- Added the extended root URLs logic including different algorithms of the variation of the root URLs and schema macro support.
- Improved the TEMPLATE scraping with the support of the multi-template sequential algorithm.
- Improved the URL’s schema validation, canonization and generation algorithms.
- Improved the smart crawling algorithms with minimization of requests frequency from one host to prevent site-server overload.
- Improved the crawling algorithm detection of page changes using the HEAD HTTP request and Last-Modified.
- Improved the support of the RSS sources for NEWS scraping, RDF and RSS2 fixed.
- Improved the support of iterative crawling especially for the real-time requests and parallel batching.
- Improved the support of the request depth for real-time request crawling, including the possibility to collect links and crawl several levels of site’s pages.
- Improved the text extraction algorithm to support DOM hierarchy recursive scanning algorithm with detection of the paragraph-markup and configurable replacements to get well formatted text.
- Improved the configurable robots.txt support.
- Improved internal API for several commands, including the URL_CONTENT, URL_DELETE, URL_UPDATE – extended with support of the raw HTML and processed content manipulations, co-related data like HTTP headers, lists of objects manipulations, selection criterion and so on…
- Improved external API, including the HTTP web gateway limitations and request structure validations.
- Improved the DB storage API and operations support.
- Improved the date detection and validation algorithm for all scraping types.
- Improved images detection and best image selection for the NEWS scraping algorithm.
- Improved the NEWS scraping type sequential algorithm including the metrics and value extraction.
- Improved configuration and customization of the crawling and scraping to get maximum flexibility per request, project and so on.
- Improved the filters set include regular expression support, several additional steps and stages support.
- Improved an algorithms of the content unique crc calculation including usage of the soundex and snowball stemming.
- Documentation updates and fixes.
- Updated “Distributed Tasks Manager” DTM application:
- Migrated from the sqlite db backend to the MySQL.
- Added the support of the different balancing types for different tasks types.
- Improved the cli API with support of the tasks list and states.
- Fixed several bugs of tasks management and API.