Products

The HCE project products:

  • The hce-node core network transport cluster infrastructure engine.
  • The Bundle:
    • Distributed Crawler service (DC),
    • Distributed Tasks Manager service (DTM),
    • PHP language API and management tools,
    • Python language API and management tools.
  • Utilities.

All of them are the set of applications that can be used to construct different distributed solutions like: remote processes execution management, data processing (including the text mining with NLP), web sites crawling (including incremental, periodic, with flexible and adaptive scheduling, RSS feeds and custom structured), web sites data scraping (include pre-defined and custom scrapers, xpath templates, sequential and optimized scraping algorithms), web-search engine (complete cycle including the crawling, scraping and distributed search index based on the Sphinx indexing engine), corporate integrated full-text search based on distributed Sphinx engine index and many more another applied solutions with similar business logic…

The hce-node core network transport cluster infrastructure engine application


The node application “hce-node” is a core engine of the HCE project. It implements main network infrastructure support, MOM transport and integrates several applied functional object modules like Sphinx Search Functional Object (SS FO) and Distributed Remote Command Execution Functional Object (DRCE FO). It is main dependency component of all related end-user services and applications provided with the Bundle like Distributed Tasks Manager and Distributed Crawler. It is used for delivery of messages and tasks for target hosts and nodes of cluster infrastructure, remote execution of processes and management automation scripts as synchronous and asynchronous tasks and monitoring if hosts system’s resources as a part of tasks management and load balancing.

It supports client-side, front-end and back-end API based on the ZMQ messaging transport, router sockets, custom message routing and several messages balancing modes.

Distribution provided by separated tarball and package for Debian OS and Centos OS (temporary suspended) forms.

The Bundle


The Bundle archive is a set of end-user applied solutions and API bindings.

Now API implemented for the PHP and the Python languages and includes several layers of networking and functional objects interaction. Set of APIs includes demo applications and tests. More detailed descriptions see in readme.txt file from package archive.

Also the Bundle archive includes end-user services – Distributed Tasks Manager and Distributed Crawler.

They are implemented on the Python language and can be used as self-sufficient functional units in target projects.

Distributed Tasks Manager


It is a Linux OS multi-thread daemon application that implements business-logic functionality of tasks management that uses the DRCE Cluster Execution Environment to manage tasks as remote processes. It implements general management operations for distributed tasks scheduling, execution, state check, OS resources monitoring and so on. This application can be used for parallel tasks execution with state monitoring on hierarchical network cluster infrastructure with custom nodes connection schema. It is multipurpose application aimed to cover needs of projects with big data computations, distributed data processing, multi-host data processing with OS system resources balancing, limitations and so on. It supports several balancing modes including multicast, random, round-robin and system resource usage algorithms. Also, provides high level state check, statistics and diagnostic automation based on natural hierarchy and relations between nodes. Supports messages routing as a tasks and data balancing method or a tasks management.

Distributed Crawler


It is a Linux OS daemon application that implements business-logic functionality of distributed web crawler and document data processor. It is based on the DTM application main functionality and hce-node DRCE Functional Object functionality and uses web crawling, processing and another related tasks as an isolated session executable modules with common business logic encapsulation. Also, the crawler contains raw contents storage subsystem based on file system (can be customized to support key-value storage or SQL). This application uses several DRCE Clusters to construct network infrastructure, MySQL and sqlite back-end for indexed data (Sites, URLs, contents and configuration properties) as well as a key-value data storage for the processed contents of pages or documents.

Both DTM and DC applications provided with set of functional tests and demo operations automation scripts based on Linux shell.

The Bundle distribution provided as zip archive that needs some environmental support to get functionality be ready. Detailed description of dependencies see on installation section.

Utilities


It is set of different by role and functionality separated console applications that can be united in some chains to get sequential data processing on server-side functionality and can be used as self-sufficient tools.

Utilities designed to get common functional units for typical web projects that need to get huge data from web or another sources, pars, convert and process it. It supports unified input-output interface and json format of messages interaction.

First implementation of utilities application is a highlighter.

Planned implementation:

  • Key-value database manager.
  • Fast multi-thread text preparer (documents parser, data extractor, data and formats converter).