Stable release 1.5.1 (2016-01-04)

Change-log:

Release 1.5.1 (2016-01-04). This version contains a lot of significant additions of all compounds and modules including the hce-node network transport itself and the DC and the DTM services. Also many productivity optimizations and improvements of a parallel computations and several profiling tools added as well as a structural refactoring of source code covers more than 50% of modules and algorithms.

  • Core transport infrastructure application hce-node:
    • Added the notification connections restore command support.
    • Added the DRCE temporary directory by node ini file specification support.
    • Improved resources usage balancing modes with extension of the random algorithm to prevent skew in case of similar load level per several host.
    • Fixed several bugs of DRCE functional object process tasks management and notifications.
  • Updated “Distributed Crawler” (DC) service:
    • Added the support of the dynamic fetcher based on the Selenium and Google Chrome.
    • Added the support of the multi-item TEMPLATE scraping algorithm especially for the product and search results pages including the smart correspondence correction of fields values per item.
    • Added support of the csspath expressions for the rules of the TEMPLATE scraping.
    • Added the support of the common metrics calculation for all scraping/processing algorithms; basic metrics like number of bytes, characters, words and so on can be used to chose best result in case of several detection and extraction algorithms applied sequentially.
    • Added the support of page chains joining and multi-paged articles as well.
    • Added several new types of information tags to detect for the TEMPLATE scraping algorithm.
    • Added the support of the several additional properties for the TEMPLATE algorithm rules definition including the regular expression, join delimiter, join modes, mandatory, typification and formatting.
    • Added the support of extended definitions of the xpath for the scrapy extractor for the NEWS SCRAPING algorithm.
    • Added the support of the rotated proxy configurable per project, domain name and so on…
    • Added the support of iterative usage of the fetcher (static then possible dynamic) and auto fetcher type as well.
    • Added extended statistical data tracking for crawling and scraping.
    • Added the extended root URLs logic including different algorithms of the variation of the root URLs and schema macro support.
    • Improved the TEMPLATE scraping with the support of the multi-template sequential algorithm.
    • Improved the URL’s schema validation, canonization and generation algorithms.
    • Improved the smart crawling algorithms with minimization of requests frequency from one host to prevent site-server overload.
    • Improved the crawling algorithm detection of page changes using the HEAD HTTP request and Last-Modified.
    • Improved the support of the RSS sources for NEWS scraping, RDF and RSS2 fixed.
    • Improved the support of iterative crawling especially for the real-time requests and parallel batching.
    • Improved the support of the request depth for real-time request crawling, including the possibility to collect links and crawl several levels of site’s pages.
    • Improved the text extraction algorithm to support DOM hierarchy recursive scanning algorithm with detection of the paragraph-markup and configurable replacements to get well formatted text.
    • Improved the configurable robots.txt support.
    • Improved internal API for several commands, including the URL_CONTENT, URL_DELETE, URL_UPDATE – extended with support of the raw HTML and processed content manipulations, co-related data like HTTP headers, lists of objects manipulations, selection criterion and so on…
    • Improved external API, including the HTTP web gateway limitations and request structure validations.
    • Improved the DB storage API and operations support.
    • Improved the date detection and validation algorithm for all scraping types.
    • Improved images detection and best image selection for the NEWS scraping algorithm.
    • Improved the NEWS scraping type sequential algorithm including the metrics and value extraction.
    • Improved configuration and customization of the crawling and scraping to get maximum flexibility per request, project and so on.
    • Improved the filters set include regular expression support, several additional steps and stages support.
    • Improved an algorithms of the content unique crc calculation including usage of the soundex and snowball stemming.
    • Documentation updates and fixes.
  • Updated “Distributed Tasks Manager” DTM application:
    • Migrated from the sqlite db backend to the MySQL.
    • Added the support of the different balancing types for different tasks types.
    • Improved the cli API with support of the tasks list and states.
    • Fixed several bugs of tasks management and API.