News of projects – HCE – Hierarchical Cluster Engine

Stable release 1.5.1 (2016-01-04)

Change-log:

Release 1.5.1 (2016-01-04). This version contains a lot of significant additions of all compounds and modules including the hce-node network transport itself and the DC and the DTM services. Also many productivity optimizations and improvements of a parallel computations and several profiling tools added as well as a structural refactoring of source code covers more than 50% of modules and algorithms.

Core transport infrastructure application hce-node:
- Added the notification connections restore command support.
- Added the DRCE temporary directory by node ini file specification support.
- Improved resources usage balancing modes with extension of the random algorithm to prevent skew in case of similar load level per several host.
- Fixed several bugs of DRCE functional object process tasks management and notifications.
Updated “Distributed Crawler” (DC) service:
- Added the support of the dynamic fetcher based on the Selenium and Google Chrome.
- Added the support of the multi-item TEMPLATE scraping algorithm especially for the product and search results pages including the smart correspondence correction of fields values per item.
- Added support of the csspath expressions for the rules of the TEMPLATE scraping.
- Added the support of the common metrics calculation for all scraping/processing algorithms; basic metrics like number of bytes, characters, words and so on can be used to chose best result in case of several detection and extraction algorithms applied sequentially.
- Added the support of page chains joining and multi-paged articles as well.
- Added several new types of information tags to detect for the TEMPLATE scraping algorithm.
- Added the support of the several additional properties for the TEMPLATE algorithm rules definition including the regular expression, join delimiter, join modes, mandatory, typification and formatting.
- Added the support of extended definitions of the xpath for the scrapy extractor for the NEWS SCRAPING algorithm.
- Added the support of the rotated proxy configurable per project, domain name and so on…
- Added the support of iterative usage of the fetcher (static then possible dynamic) and auto fetcher type as well.
- Added extended statistical data tracking for crawling and scraping.
- Added the extended root URLs logic including different algorithms of the variation of the root URLs and schema macro support.
- Improved the TEMPLATE scraping with the support of the multi-template sequential algorithm.
- Improved the URL’s schema validation, canonization and generation algorithms.
- Improved the smart crawling algorithms with minimization of requests frequency from one host to prevent site-server overload.
- Improved the crawling algorithm detection of page changes using the HEAD HTTP request and Last-Modified.
- Improved the support of the RSS sources for NEWS scraping, RDF and RSS2 fixed.
- Improved the support of iterative crawling especially for the real-time requests and parallel batching.
- Improved the support of the request depth for real-time request crawling, including the possibility to collect links and crawl several levels of site’s pages.
- Improved the text extraction algorithm to support DOM hierarchy recursive scanning algorithm with detection of the paragraph-markup and configurable replacements to get well formatted text.
- Improved the configurable robots.txt support.
- Improved internal API for several commands, including the URL_CONTENT, URL_DELETE, URL_UPDATE – extended with support of the raw HTML and processed content manipulations, co-related data like HTTP headers, lists of objects manipulations, selection criterion and so on…
- Improved external API, including the HTTP web gateway limitations and request structure validations.
- Improved the DB storage API and operations support.
- Improved the date detection and validation algorithm for all scraping types.
- Improved images detection and best image selection for the NEWS scraping algorithm.
- Improved the NEWS scraping type sequential algorithm including the metrics and value extraction.
- Improved configuration and customization of the crawling and scraping to get maximum flexibility per request, project and so on.
- Improved the filters set include regular expression support, several additional steps and stages support.
- Improved an algorithms of the content unique crc calculation including usage of the soundex and snowball stemming.
- Documentation updates and fixes.
Updated “Distributed Tasks Manager” DTM application:
- Migrated from the sqlite db backend to the MySQL.
- Added the support of the different balancing types for different tasks types.
- Improved the cli API with support of the tasks list and states.
- Fixed several bugs of tasks management and API.

HCE-node v1.4.0 stable release

Changelog:

Release 1.4.0 (2015-03-16)

Core transport infrastructure application hce-node:
- Added support of system resources usage balancing modes.
- Added support of node properties statistics fast accumulated updates.
- Added support of flexible DRCE tasks scheduling including the re-tries and auto-removing (including default behavior after DTM service restart).
- Additions of extended hce-node management with php cli utilities including run-time configuration changes and state checks.
- Fixes for the DRCE functionality of tasks states notifications and internal tasks process management including process state detection, termination and related system resources usage indicators calculation to use them in external monitoring and in resource usage load-balancing modes.
- Fixed several bugs of DRCE functional object process tasks management and notifications.
- Many fixes for Python API and additions to support new DRCE functionality.
Updated “Distributed Crawler” (DC) application:
- Support of four types of asynchronous tasks processes: crawl, process (scraping), age and purge as fundamental periodic tasks executed and managed with the DTM service. Including the complete separated set of settings, default behavior definitions, parameters and queues monitoring. Split the service periodic process on four fundamental data flow: the Crawling, the Processing, the Aging and the Purging.
- Support of multi-threading re-crawl process model isolated and parallel supervision.
- Complete separated crawling and processing with possibility to configure all of options, schedules, and manage limitations; possibilities to optionally select processing method or algorithm and support the scraping as a one of several possible.
- Improved real-time crawling processing, updated post-processing procedure and states management.
- Improved processing algorithms including support of common unified algorithm of selection and usage from several configured or integrated.
- Improved scraping algorithms usage and estimation of the results indicators, tags quality and so on, metrics support.
- Extended management automation scripts to start, check state and stop service, supports tasks queues monitoring and wait on tasks finish.
- Migration from default local contents storage the sqlite to the mysql with support of complete set of operations including the possibility to upload custom content and to process it by regular API.
- Support of two modes of resource delete operation – immediate and postponed with the possibility to make mass data removing from file system more smooth and CPU i/o wait predictable.
- Support of RSS feeds including the scraping on the basis only feed’s data and with regular crawling of real web page sources.
- Support of multiple contents/tags as result of processing, as a part of the sequential scrapers application and results saving in the local storage.
- Extensions and additions of set of functional tests.
- Documentation updates and fixes.
Updated “Distributed Tasks Manager” DTM application:
- Improved tasks management and states definition, including the re-scheduling, retrying, remove garbage at start/stop service.
- Extended client and management tools with possibility to get tasks queue with complete fields set at run time.
- Extended management automation scripts to start, check state and stop service to support tasks queues check and wait on real tasks finish.
- Fixed several bugs related with handling specific tasks states on execution environment.

Centos OS installation recomendations

Install as Centos 7 amd64 package

This way requires root privileges or sudo for user.

1) To ensure that we have the latest version of default system tools, let’s begin with running a base update on our system:

sudo yum -y update

2) Add the untrusted repository to /etc/yum.repos.d/hce.repo (replace baseurl)

[hce]
name=hce repo
baseurl=http://packages.hierarchical-cluster-engine.com/centos/7/$basearch/
gpgcheck=0
enabled=1

3) The HCE package have some dependencies which can be resolved with adding Epel repository.

# If you are on a 64-bit CentOS / RHEL based system:

 sudo rpm -ivh http://dl.fedoraproject.org/pub/epel/7/x86_64/e/epel-release-7-5.noarch.rpm

4) Run a base update on our system again:

sudo yum -y update

5) Install “hce-node” package:

 sudo yum install hce-node

6) To install the Bundle to the home directory run with regular user privileges script that is included in to the package distribution:

 hce-node-bundle-install.sh

The hce-node-bundle directory will be created in the home directory of current user. Please, read the ~/hce-node-bundle/api/php/doc/readme.txt file to continue, install Bundle Environment and run demo and test mode of HCE.
7) Install Dev Tools:

yum groupinstall 'development tools'

Install the Bundle Environment for PHP language

1) Install zmq library:

sudo yum install libzmq3-dev

If php is already installed this step can be skipped.

2) Install php:

sudo yum install php php-devel
sudo yum install php-pear pkgconfig openpgm-devel zeromq3-devel
sudo pecl install --ignore-errors zmq-beta

After that possible need to create file /etc/php.d/zmq.ini and add there:

extension=zmq.so

3) Install Sphinx search engine:

sudo yum install sphinx sphinx-php

4) For test of DC service and main crawling process (~/hce-node-bundle/api/python/ftests/dc_test_rnd_site.sh) install httpd:

yum -y install httpd
systemctl start httpd

And copy files in to the httpd root directory:

sudo cp ~/hce-node-bundle/api/python/data/ftests/test_site/* /var/www/html/

5) Install bc for DRCE tests:

sudo yum install bc

6) Install Java 7 for DRCE tests (optional):

sudo yum install java-1.7.0-openjdk

Please, read the ~/hce-node-bundle/api/php/doc/readme.txt file to continue.

Install Bundle Environment for Python language, DC and DTM services

This way requires root privileges or sudo for user.

1) CentOS packages dependencies:

sudo yum install openpgm-devel mariadb-server mariadb python-pip python-devel python-flask python-flask-wtf ruby libffi-devel
libxml2-devel libxslt-devel mariadb-devel mysql-connector-python libicu-devel gmp-devel libtidy-devel python-dateutil

add the mariadb in autorun:

systemctl enable mariadb.service

and run mariadb:

systemctl start mariadb

run mysql_secure_installation and create pwd for mysql root user:

mysql_secure_installation

2) Python modules dependencies:

sudo pip install cement sqlalchemy Flask-SQLAlchemy scrapy gmpy lepl requests
sudo pip install urlnorm pyicu mysql-python newspaper goose-extractor
sudo pip install pytidylib uritools python-magic
sudo pip install pyzmq --install-option="--zmq=bundled"

For dynamic pages crawling:

sudo pip install Ghost.py

3) Create MySQL user and DB schema for Distributed Crawler Application:

cd ~/hce-node-bundle/api/python/manage/
sudo ./mysql_create_user.sh
./mysql_create_struct.sh

Install Distributed Crawler client Environment for Python language

This way requires root privileges or sudo for user.

1) CentOS packages dependencies:

sudo yum install python-pip python-devel libffi-devel libxml2-devel libxslt-devel

2) Python packets dependencies:

sudo pip install cement scrapy w3lib
sudo pip install pyzmq --install-option="--zmq=bundled"

3) In case of DTS archive was downloaded directly after unzip run:

chmod 777 ~/hce-node-bundle/usr/bin/hce-node-permissions.sh
~/hce-node-bundle/usr/bin/hce-node-permissions.sh

Distributed Crawler and Distributed Tasks Manager services pre-release 1.0.1 “Chaika”

Distributed Crawler service 1.0.1 “Chaika”

Changelog:

Pre-release “1.0.1-chaika” (2014-07-24)

Added auto periodic re-crawling.
Added incremental crawling.
Added proportional crawling.
Added auto removing.
Added resources host storage migration.
Fixed crawling and processing bugs.
Complete updated integration and deployment for Debian 7 OS.
and many more…

Distributed Tasks Manager service 1.0.1 “Chaika”

Changelog:

Pre-release “1.0.1-chaika” (2014-07-24)

Added tasks rescheduling.
Added tasks repeat to run in case of resources limitation.
Added tasks resources limitation.
Fixed DRCE router protocol and DRCE hce-node tasks management.
Complete updated integration and deployment for Debian 7 OS.
and many more…

Demo Test Suit package and Python API bindings updated

Changelog:

Release “1.0-alpha” (2014-07-11)

Added Crawler URL normalization.
Added Crawler API results merging from N nodes.
Added Crawler incremental sites crawling.
Added Crawler additional support of structure formats.
Added Processor improvements of basic and predefined templates for scraping.
Added run-time change of logging level and polling timeouts.
Complete updated integration and deployment for Debian OS.
Complete updated main networking and crawling engine.
and many more…

HCE-node updated to v1.2-3

Changelog:

Release 1.2-3 (2014-06-06)

Added DRCE new functionality of support of set of new requests types.
Added DRCE new functionality of support of tasks state callback notification.
Added DRCE new functionality of support of POCO logger with rotation.
Added DRCE new functionality of support of demonize mode.
Added DRCE new stat data in task state response.
Fixed several bugs of DRCE functional object.
Added PHP API additions to support new DRCE tests sets.
Added Python API additions to support new DRCE tests sets.
Complete updated the “Demo Test Suit” including Python applications services “Distributed Tasks Manager” and “Distributed Crawler”.
Added new API and applications for PHP and Python languages.
Complete updated integration and deployment for Debian and Centos OS.
and many more…

New version of hce-node application available

HCE-node updated to v1.1.

New features: DRCE in load-balance mode, Sphinx search commands extended, PHP API extended, Demo Test Suit extended and improved, documentation extended and upgraded and many more.