Distributed Crawler Service

Main requirements specification, phase A – 2014-01-26 finished; phase B – 2015-01-01 started


Characterization


The Distributed Crawler Application (DC) application is common purpose service class software. The main functional goal is to provide possibilities for users to fetch data from data sources and to create huge collections of resources. Basic features include the web-crawling algorithms implementation with standard (web http-based) or custom resource fetcher. DC has developed as a scalable & distributed Web Crawling Service for automated data crawling, data extraction to provide the possibility for information integration from various websites.

This web crawling and data scraping technology can be used to weed out the relevant information as well as build business intelligence. Whatever any kind customized solutions can be done with flexible configuration and parameters tuning. The web crawling engine is also capable of culling relevant information that is several links away from the start set, while effectively eliminating many irrelevant pages and scan web-sites structure efficient way. The frequency of queries to web-sites can also be defined by the client as hourly, daily, weekly, monthly or another more complex way as well as re-crawling schedule.

The DC web data crawling and tags extraction service can to help to convert useless data into structured and useful content. The entire process of crawling, scraping, data collection and sorting are automated. The clean and structured data received from the DC service can be immediately used for decision making solutions and algorithms of target projects.

The service based on Distributed Tasks Manager (DTM) and implements applied crawling algorithms and architecture as distributed tasks doing batch processing in isolated OS processes. Additionally provided interface for extended post-crawling content processing like scrapping, NLP, statistical analysis and so on.

The main features list:

  • http links collections,
  • resources collections and distributed storage,
  • distributed data fetcher IPs,
  • planned and managed crawling http requests frequency,
  • custom data fetchers, request protocols, resources formats,
  • free scalability and hot changes of productivity,
  • different model of resources sharding,
  • parameters templates and inheritance,
  • customizable target data storage, including native file system, key-value DB, SQL BD and other,
  • unified data processing including the scraping algorithms for news articles tags detection,
  • multi-host, multi-task parallel crawling and processing,
  • durability and fail-safe principles including replication with redundancy of of local host’s DB schemas and data.

Conditions


Standalone boxed application that works as a client-side top user level of DTM application server.

Usage Domain


Applied by developers and system integrators in target projects as server-side software and tools set. Most common purposes:

  • Distributed asynchronous parallel web sites crawling.
  • Distributed synchronous real-time web pages crawling and scraping.
  • Distributed raw web resources data storage management.
  • Distributed URLs database storage management.
  • Distributed asynchronous parallel post-crawling processing including the tags scraping.

Technology


Single process, multi-threaded, Python language application implementation, integrated with DTM application server by HCE Python API bindings.

Object model


Threaded control objects, MOM-based networking, ZMQ tcp and inproc sockets, messages communications, messages queues, json serialization protocol.

Functionality


Process client requests, managed automated web-sites crawling, post-crawling processing, real-time crawling+processing (see detailed protocol specification for the real-time gateway API) and resource management.

Philosophy


Asynchronous multi-host multi-threaded mutexless messages handling, parallel computations, asynchronous distributed remote tasks processing, automated managed batch processing, distributed resources management.

Architecture


Process interactions uses ZMQ TCP sockets connections server and client type. In process interactions uses ZMQ inproc sockets connections servers and clients. Asynchronous inproc resources access as well as requests processing provided by MOM-based transport provided by ZMQ sockets technology. No system or POSIX access controlling objects like mutexes or semaphores used. The only one kind of processing item is message. The several messages containers types like queues, dictionaries, lists and so on can be used to accumulate and organize processing sequences. The key-value DB engine is SQLite.

There are two listened by DC service TCP ports:

  1. Administration.
  2. Client.

Administration – handles requests from tool client applications. Requests of that type fetches statistical information from all threading objects as well as can be used to tune some configurations settings of DC service directly at run-time period.

Client – handles requests from client side API or tool application. Requests of that type are main functional queries for any applied operation and statistical data collection including the Site and URL management CRUD as well as real-time crawling. More detailed description see in documentation for the client application.

All requests are represented by json format of message. The structure of requests and main tasks logic principles are separated from concrete execution environment. This gives potential possibility to use some several execution environments engines, but not only HCE DRCE cluster.

More detailed description of usage of service and client modules see in main documentation.

Object model


All active networking usage objects (blue primitives on architecture UML diagram) are based on threaded class object. Also, all threaded objects uses list of transport level items of two types: client and server. Both are based on ZMQ sockets, server type can be TCP or inproc. TCP used for external connections, inproc – for all internal inter-threading events interactions
DC service architecture

Site


The structural unit that represents crawling item from user side. Has properties applied for each URL fetched from page of this site.

Resource


The structural unit that represents results of HTTP request downloaded from site by URL. Mainly it is http resource like html page, image or any kind of MIME type content.

URL


The structural unit that represents the resource URL for the site. Used as base of main resource identifier for md5 checksum calculation.

HTTP Request


The object that used to do HTTP querying and resources downloading algorithms and operations on data nodes.

HTTP Response


The object that used to represent and to parse downloaded resources on data nodes.

Client interface service


This class accepts the client requests events from TCP connection and sends them to the proper target threading class. This class works mostly as a proxy between the target functional threaded class and clients. It connects to three target threaded classes as shown on application architecture schema.

Batch Tasks Manager


It is a main class that implements algorithms of continues crawling (optionally and processing) iteration, interacts with DTM service, sets crawling tasks to execute and checks state of that tasks periodically. To start the crawling task it collects some URLs from SitesManager and creates the crawling task Batch. The crawling task batch send as main data set and input data of the main crawling application that works on target data node in the HCE DRCE Cluster as asynchronous tasks.

Batch Tasks Manager Process


It is a main class that implements processing iteration, interacts with DTM service, sets processing tasks to execute and checks state of that tasks periodically. To start the crawling task it uses simple scheduling. The process task batch send as URLFetch DB-task request, collects resources ready to be processed, converts list of them to the Batch object and starts processor-task that includes the scraper-task and scraping applied algorithms. All tasks applications works on target data node in the HCE DRCE Cluster as asynchronous tasks.

Sites manager


The main client interface for Site and URL objects operations, also performs periodical Re-crawling and Aging actions using the DRCE shard cluster directly in multicast mode.

The Crawler application


The application that acts as web-crawler, uses crawling batch as a main job order, executes HTTP requests and stores resources in the local file system cache, updates state of resources in the local SQL DB. Located and acts on data node.

Processor application


This application represents container and manager for applied processing modules like the scraper application. Located and acts on data node. This is main an after crawling data processor business logic and applied algorithms API.

Scraper application


This application represents sets of scraping algorithms that can be used to process locally stored data. The results of processing  Process located and acts on data node.

DB application


The application that acts as a main interface with DTM application to distributed DB units. The DB units manage the local databases of all objects and items defined on local data node level. Also, this application fetches locally stored resources. Located and acts on data node. Detailed information about the DB schema can be obtained from DB main documentation.