HCE-node v1.4.0 stable release

Changelog:

Release 1.4.0 (2015-03-16)

  • Core transport infrastructure application hce-node:
    • Added support of system resources usage balancing modes.
    • Added support of node properties statistics fast accumulated updates.
    • Added support of flexible DRCE tasks scheduling including the re-tries and auto-removing (including default behavior after DTM service restart).
    • Additions of extended hce-node management with php cli utilities including run-time configuration changes and state checks.
    • Fixes for the DRCE functionality of tasks states notifications and internal tasks process management including process state detection, termination and related system resources usage indicators calculation to use them in external monitoring and in resource usage load-balancing modes.
    • Fixed several bugs of DRCE functional object process tasks management and notifications.
    • Many fixes for Python API and additions to support new DRCE functionality.
  • Updated “Distributed Crawler” (DC) application:
    • Support of four types of asynchronous tasks processes: crawl, process (scraping), age and purge as fundamental periodic tasks executed and managed with the DTM service. Including the complete separated set of settings, default behavior definitions, parameters and queues monitoring. Split the service periodic process on four fundamental data flow: the Crawling, the Processing, the Aging and the Purging.
    • Support of multi-threading re-crawl process model isolated and parallel supervision.
    • Complete separated crawling and processing with possibility to configure all of options, schedules, and manage limitations; possibilities to optionally select processing method or algorithm and support the scraping as a one of several possible.
    • Improved real-time crawling processing, updated post-processing procedure and states management.
    • Improved processing algorithms including support of common unified algorithm of selection and usage from several configured or integrated.
    • Improved scraping algorithms usage and estimation of the results indicators, tags quality and so on, metrics support.
    • Extended management automation scripts to start, check state and stop service, supports tasks queues monitoring and wait on tasks finish.
    • Migration from default local contents storage the sqlite to the mysql with support of complete set of operations including the possibility to upload custom content and to process it by regular API.
    • Support of two modes of resource delete operation – immediate and postponed with the possibility to make mass data removing from file system more smooth and CPU i/o wait predictable.
    • Support of RSS feeds including the scraping on the basis only feed’s data and with regular crawling of real web page sources.
    • Support of multiple contents/tags as result of processing, as a part of the sequential scrapers application and results saving in the local storage.
    • Extensions and additions of set of functional tests.
    • Documentation updates and fixes.
  • Updated “Distributed Tasks Manager” DTM application:
    • Improved tasks management and states definition, including the re-scheduling, retrying, remove garbage at start/stop service.
    • Extended client and management tools with possibility to get tasks queue with complete fields set at run time.
    • Extended management automation scripts to start, check state and stop service to support tasks queues check and wait on real tasks finish.
    • Fixed several bugs related with handling specific tasks states on execution environment.

HCE-node updated to v1.4.0 with new dev. build

Changelog:

Release 1.4.0 (2015-02-20)

  • Added support of system resources usage balancing modes.
  • Added support of node properties fast accumulated updates.
  • Added support of flexible DRCE tasks planning.
  • Additions extended hce-node management php cli utilities including configurations change and state checks support.
  • Fixes for the DRCE functionality of tasks states modification and manage.
  • Fixed several bugs of DRCE functional object.
  • Many fixes for Python API additions to support new DRCE functionality.
  • Updated “Distributed Crawler” DC application:
    • Support of four types of asynchronous tasks processes: crawl, process (scraping), age and purge
    • Support of multi-threading re-crawl process.
    • Complete separated crawling and processing with possibility to configure all of options, schedules, and manage limitations.
    • Improved real-time crawling processing, updated post-processing procedure and states management.
    • Improved processing algorithms support including common unified algorithms selection and usage.
    • Improved scraping algorithms usage and estimation of the results indicators, tags quality estimation and so on.
    • Extended management automation scripts to start, check state and stop service to support tasks queues check and wait on real tasks finish.
  • Updated “Distributed Tasks Manager” DTM application:
    • Improved tasks management and states definition, including the re-scheduling, retrying, remove garbage at start/stop service.
    • Extended client and management tools with possibility to get tasks queue with complete fields set at run time.
    • Extended management automation scripts to start, check state and stop service to support tasks queues check and wait on real tasks finish.
    • Fixed several bugs related with handling specific tasks states on execution environment.

Distributed Crawler Service v1.4 alpha available for developers

Changelog:

Pre-release 1.4 alpha (2015-02-09)

  • New feature of postponed delete, cleanup and purging of resource data including content on disk storage and in the key-value db with periodic limited by load level and items number. Completely configurable schedule and selection of candidates to delete; separated MySQL database for tables for each site; balanced purging task with optimized load level of multi-host system and so on…
  • New feature of completely separated crawling and processing tasks management including the tasks queue processing, scheduling, load level balancing, tasks competitions configuration, re-crawling and re-processing on demand and according the schedule and many more…
  • New feature of completely multi-threaded re-crawling management with support of resources balancing and sites state protection including configurable cleanup, optimize and auto tune up of re-crawl period…
  • New feature of completely separated deleted resources purging from the system including the load-balancing of purging tasks for multy-host configuration and scheduling…
  • New feature of support of the MySQL-based blocking for per host DB operations to protect database structures from multi-process operations overlapping.
  • Improvements of the scraping algorithms and the processing core including of support of fully customized real-time crawling and processing requests with fixed scraping templates and scraper selection.
  • Many fixes for crawling and scraping features.

Latests unstable bundle archive can be downloaded here.

Centos OS installation recomendations

Install as Centos 7 amd64 package


This way requires root privileges or sudo for user.

1) To ensure that we have the latest version of default system tools, let’s begin with running a base update on our system:

sudo yum -y update

2) Add the untrusted repository to /etc/yum.repos.d/hce.repo (replace baseurl)

[hce]
name=hce repo
baseurl=http://packages.hierarchical-cluster-engine.com/centos/7/$basearch/
gpgcheck=0
enabled=1

3) The HCE package have some dependencies which can be resolved with adding Epel repository.

# If you are on a 64-bit CentOS / RHEL based system:

 sudo rpm -ivh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm

4) Run a base update on our system again:

sudo yum -y update

5) Install “hce-node” package:

 sudo yum install hce-node

6) To install the Bundle to the home directory run with regular user privileges script that is included in to the package distribution:

 hce-node-bundle-install.sh

The hce-node-bundle directory will be created in the home directory of current user. Please, read the ~/hce-node-bundle/api/php/doc/readme.txt file to continue, install Bundle Environment and run demo and test mode of HCE.
7) Install Dev Tools:

yum groupinstall 'development tools'

Install the Bundle Environment for PHP language


1) Install zmq library:

sudo yum install libzmq3-dev

If php is already installed this step can be skipped.

2) Install php:

sudo yum install php php-devel
sudo yum install php-pear pkgconfig openpgm-devel zeromq3-devel
sudo pecl install --ignore-errors zmq-beta

After that possible need to create file /etc/php.d/zmq.ini and add there:

extension=zmq.so

3) Install Sphinx search engine:

sudo yum install sphinx sphinx-php

4) For test of DC service and main crawling process (~/hce-node-bundle/api/python/ftests/dc_test_rnd_site.sh) install httpd:

yum -y install httpd
systemctl start httpd

And copy files in to the httpd root directory:

sudo cp ~/hce-node-bundle/api/python/data/ftests/test_site/* /var/www/html/

5) Install bc for DRCE tests:

sudo yum install bc

6) Install Java 7 for DRCE tests (optional):

sudo yum install java-1.7.0-openjdk

Please, read the ~/hce-node-bundle/api/php/doc/readme.txt file to continue.

Install Bundle Environment for Python language, DC and DTM services


This way requires root privileges or sudo for user.

1) CentOS packages dependencies:

sudo yum install openpgm-devel mariadb-server mariadb python-pip python-devel python-flask python-flask-wtf ruby libffi-devel
libxml2-devel libxslt-devel mariadb-devel mysql-connector-python libicu-devel gmp-devel libtidy-devel python-dateutil

add the mariadb in autorun:

systemctl enable mariadb.service

and run mariadb:

systemctl start mariadb

run mysql_secure_installation and create pwd for mysql root user:

mysql_secure_installation

2) Python modules dependencies:

sudo pip install cement sqlalchemy Flask-SQLAlchemy scrapy gmpy lepl requests
sudo pip install urlnorm pyicu mysql-python newspaper goose-extractor
sudo pip install pytidylib uritools python-magic
sudo pip install pyzmq --install-option="--zmq=bundled"

For dynamic pages crawling:

sudo pip install Ghost.py

3) Create MySQL user and DB schema for Distributed Crawler Application:

cd ~/hce-node-bundle/api/python/manage/
sudo ./mysql_create_user.sh
./mysql_create_struct.sh

Install Distributed Crawler client Environment for Python language


This way requires root privileges or sudo for user.

1) CentOS packages dependencies:

sudo yum install python-pip python-devel libffi-devel libxml2-devel libxslt-devel

2) Python packets dependencies:

sudo pip install cement scrapy w3lib
sudo pip install pyzmq --install-option="--zmq=bundled"

3) In case of DTS archive was downloaded directly after unzip run:

chmod 777 ~/hce-node-bundle/usr/bin/hce-node-permissions.sh
~/hce-node-bundle/usr/bin/hce-node-permissions.sh

Dual host HCE-node clusters configuration and management

The dual hosts clusters nodes instances are located on two physical or logical hosts. First host holds router and manager nodes and data nodes. Second host – only data nodes. That is minimal multi-host configuration. The n-type cluster configured the same way as single node – two replica node instances, but located on different hosts – one on first and one on second host. The m-type cluster configured the same way as n-type – two shard node instances one on first and one on second host.

Configuration files for first and second hosts are different. For the first host of n-type:

c112_localhost_n0_2h_cfg.sh

for the second host n-type:

c112_localhost_n0_2h-data_cfg.sh

For the first host of m-type:

c112_localhost_m0_2h_cfg.sh

and for the second host of m-type:

c112_localhost_m0_2h-data_cfg.sh

For the first host of r-type:

c112_localhost_r0_2h_cfg.sh

and for the second host of r-type:

c112_localhost_r0_2h-data_cfg.sh

To activate the dual host configuration some changes need to be done for current_cfg.sh file on each host different way. In each of three sections for n-, m- and r-type clusters the default single host configuration line need to be commented out and correspondent first or second host line of the dual host configuration comment need to be removed. In result for the first
host only three lines titled as:

"Multi-host configuration for first host of dual hosts"

need to be uncommented and for the second host – only three lines titled as:

"Multi-host configuration for second host of dual hosts"

need to be uncommented for the n-, m- and r-type of clusters.

Configuration files need to be modified to specify the IP addresses of hosts. Configuration files for the first host “c112_localhost_?0_2h_cfg.sh” need to be modified with definition of the second host IP address:

REMOTE_HOSTS="10.0.0.2"

instead of the 10.0.0.2 the second host IP address need to be specified. Configuration files for the second host: “c112_localhost_?0_2h-data_cfg.sh” need to be modified with definition of the manager host IP address:

MANAGER="10.0.0.1"

instead of the 10.0.0.1 the first host IP address need to be specified.

The node_pool1.ini file used for all data nodes need to be modified to specify own IP address and notification service IP address. Line:

node_host=localhost

the localhost need to be replaced with this host IP address (different for the first and second hosts). Line:

state_notification_host=127.0.0.1

the 127.0.0.1 need to be replaced with the first host server IP address, because the DTM service located on that host.

After all that modifications was done cluster nodes can be started on both hosts in any order (first or second host).

To start all nodes of n- and m-type clusters on first host use:

~/hce_hce-node-bundle/api/php/manage/start_nm.sh

and on the second host use:

~/hce_hce-node-bundle/api/php/manage/start_replicas_pool_nm.sh

because no router or manager nodes here.

To stop all nodes on the first host use:

~/hce_hce-node-bundle/api/php/manage/stop_nm.sh

and on the second host:

~/hce_hce-node-bundle/api/php/manage/stop_replicas_nm.sh

HCE-node updated to v1.3.2 with new dev. build

Changelog:

Release 1.3.2 (2014-12-05)

  • Fixes for node functionality of support balancing modes.
  • Fixes for node functionality of support routing.
  • Fixes for the DRCE functionality of support extended tasks management and statistical data.
  • Additions extended hce-node management php cli utilities including configurations change and state checks support.
  • Fixed several bugs of DRCE functional object.
  • Fixes for Python API additions to support new DRCE tests sets.
  • Updated “hce-node”, “Distributed Crawler” and “Distributed Tasks Manager”.
  • Added new informational application client for Android “hce-dc-info“.
  • and many more…

HCE-node updated to v1.3 with new dev. build

Changelog:

Release 1.3 (2014-11-07)

  • Added node new functionality of support balancing modes.
  • Added node new functionality of support routing.
  • Added DRCE new functionality of support extended tasks management and statistical data.
  • Extended hce-node management php cli utilities including configurations change and state checks support.
  • Fixed several bugs of DRCE functional object.
  • Added PHP API additions to support new DRCE tests sets.
  • Added Python API additions to support new DRCE tests sets.
  • Updated “Distributed Crawler” and “Distributed Tasks Manager”.
  • DC service – several libraries support added. Crawling of RSS feeds and scraping of articles tags are improved.
  • and many more…

HCE-node v1.2-5 migrated from the Demo Test Suit (DTS) products set to the Bundle set

Changelog:

Release 1.2-5 (2014-09-11)

  • Fixed several bugs in the DRCE module functionality.
  • Improved DRCE statistics for tasks.
  • Complete updated the “Bundle” products including “Distributed Tasks Manager” and “Distributed Crawler” application services.
  • The “Distributed Crawler” (DC) service additions and fixes:
    • Dynamic rendered pages support, including the javascript (based on the Ghost library).
    • Improved scraper module with xpath rules sets support.
    • Improved crawler module.
    • Extended tests of crawling Japan sites.
    • Completely updated documentation.

HCE-node updated to v1.2-5 with new dev. build

Changelog:

Release 1.2-5 (2014-08-20)

  • Added DRCE new functionality of support subtasks.
  • Added DRCE new functionality of support tasks state logging.
  • Added DRCE new functionality of support delete tasks extension.
  • Extended hce-node management php cli utilities including configurations change and state checks support.
  • Fixed several bugs of DRCE functional object.
  • Added PHP API additions to support new DRCE tests sets.
  • Added Python API additions to support new DRCE tests sets.
  • Complete updated the “Demo Test Suit” including Python applications services “Distributed Tasks Manager” and “Distributed Crawler”.
  • New version of the “Distributed Crawler” the “chaika” added.
  • DC service – tidy library support added, the pytidylib for the Python environment required.
  • and many more…

Distributed Crawler and Distributed Tasks Manager services pre-release 1.0.1 “Chaika”

Distributed Crawler service 1.0.1 “Chaika”


Changelog:

Pre-release “1.0.1-chaika” (2014-07-24)

  • Added auto periodic re-crawling.
  • Added incremental crawling.
  • Added proportional crawling.
  • Added auto removing.
  • Added resources host storage migration.
  • Fixed crawling and processing bugs.
  • Complete updated integration and deployment for Debian 7 OS.
  • and many more…

Distributed Tasks Manager service 1.0.1 “Chaika”


Changelog:

Pre-release “1.0.1-chaika” (2014-07-24)

  • Added tasks rescheduling.
  • Added tasks repeat to run in case of resources limitation.
  • Added tasks resources limitation.
  • Fixed DRCE router protocol and DRCE hce-node tasks management.
  • Complete updated integration and deployment for Debian 7 OS.
  • and many more…