Client API bindings
HCE core functionality represents server side technology that is accessible via specific network protocol that is specified and defined by ZMQ library that used as main transport under TCP networking.
The ZMQ library provides binding for about 30 languages, so all this languages community can be covered by HCE Client API.
General API structure consists of three main layers:
- Transport – implements main networking operations for router and admin network connections of node, message interchange, supports MOM as basic transport principles. This is low level API. Under alpha testing.
- Application – implements various functionality and algorithms that covers main integrated and natural way supported subsystems like Sphinx Search, Sphinx Index Management, Distributed Remote Command Execution tasks management and nodes management. This is middle level API. Under development stage.
- User – implements functionality, algorithms, service and cli and-user applications that can be used as top level interfaces and objects inside the target projects and as finished applications for management and administration. This is upper level API and utilities tools. Under design stage.
Each one can be used as separated dedicated API library and implemented as self sufficient modules for distribution and integration.
Top level applications and end-user functional interfaces
On the ground of three level APIs several end-user functionality applications are planned to be implemented this year. This applications are first step of APIs binding integration in to the target projects or for creation stand-alone closed boxed services and sites for wide range of goals and solutions. There are four most important applications planned for building:
- Distributed Crawler Manager – or just Distributed Crawler – this is application with REST and cli UI for complete life cycle of distribute web-crawling. Includes basic web-crawler features like fetching of web pages, extraction of URLs, manage URLs collections, storing resources in the custom defined and configured storage DB, managing of distributed crawling tasks, resources access, managing related attached properties and collections like parse templates and so on. This application based on Distributed Task Manager functionality and extends it with set of specific algorithms and data structures but utilize unified cluster-level APIs. This application oriented on farther resources processing on the same data nodes where they are crawled distributed way. First of all it is a Distributed Content Processor Manager application’s data. That conceptual approach aim is effective data distribution on stage of fetching from source and storage location on the same side as computational unit but distributed way. Such a solution can to help to avoid additional network traffic inside a cluster and between user access point and HCE system. The second positive feature that are provided by distributed crawling – it is multi IP-address crawling with proportional or more complex strategy of distribution of TCP IP connections from crawler to target web-server by crawler IP-address and timing. The problem of bots and robots blockade by IP-address and connection frequency level is most known for web-crawling task and solutions.
- Distributed Content Processor Manager – this application also based on Distributed Tasks Manager and represents a collection of algorithms and libraries for wide are of text mining purpose united by common input/output specification for data flow and call API. This is end-user functionality application with REST and cli UI that provides the possibility to apply some one or several (linked in chains) computational algorithms to computation units like web-pages, resources, documents and so on. The basic goal – it is to provide way to define and to run in planned (according with schedule) manner some vector of data processing procedures using step-by-step technology where output data from one algorithm became input data for another on the same cluster node. In distributed parallel solution such technique provides minimum time between steps as well as well host platform resources utilization. The unified protocol for data messages input-output interaction and interchange allows in easy way with minimal additional interfaces development – to extend the initial algorithms collection with custom libraries and modules. First collection will to contain set of Python libraries for text mining like HTML parsers, tokenization, data scrapping, NLP and so on.
- Distributed Tasks Manager – this is a common purposes application that implements complete set of algorithms to create and to manage tasks that executed inside cluster environment according DRCE functionality. Tasks can be any kind of executable in source code form and formats as well as Java pre-compiled or even binary modules. The only one limitation – it is support of this kind of module by data nodes of a cluster. Scheduler with strategies of tasks time and resource utilization prediction will help to end-user to not to care about cluster overload or inopportune tasks execution. Data for calculations can be provided at stage of the task request with the message from user as well as located on data nodes (for example after web crawling) or on external data provider like external key-value DB. Results of computations also can to stay inside a cluster (at data nodes) until user will not delete them. Two basic tasks execution modes provided – synchronous and asynchronous. Synchronous – is simplest for user and supposes condition when source data as well as results are small and the goal of clusterization is in parallel of computations and effective data delivery on target computing unit (physical host and cluster node). Most common usage of parallel computing mode in this case it is load-balancing and highly effective network infrastructure. Asynchronous – is complex and delayed in time. Scheduling, resource planning, resource and tasks state controlling, tasks management, results data access and so on algorithms and solutions are used to get a best results of optimization of usage set of physical hosts and distributed remote tasks execution.
- Cluster Manager – this application is an administration tools set. It implements complete management support of nodes include all handlers and functional objects. Typical simple cluster management like state monitoring, statistical data tracking and more complex operations like schema management, nodes state and role controlling, data modes management, automation of attaching and detaching of new nodes, balancing controlling and many other. This application and user-level APIs based on application level APIs. Also, it implements some interfaces and solutions for cluster state indicators visualization as well as provides client interface for different management scenarios.
Under design stage.
PHP bindings already implemented alpha in structural form and included in to the Bundle package archive. Complete structural model source code documentation available. Also, object oriented version under development at present time.
The common object model for transport level can be represented by UMLs below:
Sequence diagram illustrates common usage of transport for typical operations:
Admin bindings representation of objects model and dependencies on classes UML below:
Also, one of possible algorithm of usage of admin API can be represented by sequence UML:
The second part of API that is under design stage for PHP language in OOP model it is “Cluster Manager Application” that is cli interface administration tool for complete management operations, as well as for building external service-like client side interfaces.
Now user level is under design stage.
Bindings for the Python language slightly differs by OOP model and implementation of library parts. It is because Python binding of ZMQ library differs from PHP and C++ and some language-specific principles and agreements are not exactly the same.
Common classes UML:
Next stage implementation planned.
Next stage implementation planned.