HCE-node v1.4.0 stable release


Release 1.4.0 (2015-03-16)

  • Core transport infrastructure application hce-node:
    • Added support of system resources usage balancing modes.
    • Added support of node properties statistics fast accumulated updates.
    • Added support of flexible DRCE tasks scheduling including the re-tries and auto-removing (including default behavior after DTM service restart).
    • Additions of extended hce-node management with php cli utilities including run-time configuration changes and state checks.
    • Fixes for the DRCE functionality of tasks states notifications and internal tasks process management including process state detection, termination and related system resources usage indicators calculation to use them in external monitoring and in resource usage load-balancing modes.
    • Fixed several bugs of DRCE functional object process tasks management and notifications.
    • Many fixes for Python API and additions to support new DRCE functionality.
  • Updated “Distributed Crawler” (DC) application:
    • Support of four types of asynchronous tasks processes: crawl, process (scraping), age and purge as fundamental periodic tasks executed and managed with the DTM service. Including the complete separated set of settings, default behavior definitions, parameters and queues monitoring. Split the service periodic process on four fundamental data flow: the Crawling, the Processing, the Aging and the Purging.
    • Support of multi-threading re-crawl process model isolated and parallel supervision.
    • Complete separated crawling and processing with possibility to configure all of options, schedules, and manage limitations; possibilities to optionally select processing method or algorithm and support the scraping as a one of several possible.
    • Improved real-time crawling processing, updated post-processing procedure and states management.
    • Improved processing algorithms including support of common unified algorithm of selection and usage from several configured or integrated.
    • Improved scraping algorithms usage and estimation of the results indicators, tags quality and so on, metrics support.
    • Extended management automation scripts to start, check state and stop service, supports tasks queues monitoring and wait on tasks finish.
    • Migration from default local contents storage the sqlite to the mysql with support of complete set of operations including the possibility to upload custom content and to process it by regular API.
    • Support of two modes of resource delete operation – immediate and postponed with the possibility to make mass data removing from file system more smooth and CPU i/o wait predictable.
    • Support of RSS feeds including the scraping on the basis only feed’s data and with regular crawling of real web page sources.
    • Support of multiple contents/tags as result of processing, as a part of the sequential scrapers application and results saving in the local storage.
    • Extensions and additions of set of functional tests.
    • Documentation updates and fixes.
  • Updated “Distributed Tasks Manager” DTM application:
    • Improved tasks management and states definition, including the re-scheduling, retrying, remove garbage at start/stop service.
    • Extended client and management tools with possibility to get tasks queue with complete fields set at run time.
    • Extended management automation scripts to start, check state and stop service to support tasks queues check and wait on real tasks finish.
    • Fixed several bugs related with handling specific tasks states on execution environment.