Distributed Crawler Service v1.4 alpha available for developers

Changelog:

Pre-release 1.4 alpha (2015-02-09)

  • New feature of postponed delete, cleanup and purging of resource data including content on disk storage and in the key-value db with periodic limited by load level and items number. Completely configurable schedule and selection of candidates to delete; separated MySQL database for tables for each site; balanced purging task with optimized load level of multi-host system and so on…
  • New feature of completely separated crawling and processing tasks management including the tasks queue processing, scheduling, load level balancing, tasks competitions configuration, re-crawling and re-processing on demand and according the schedule and many more…
  • New feature of completely multi-threaded re-crawling management with support of resources balancing and sites state protection including configurable cleanup, optimize and auto tune up of re-crawl period…
  • New feature of completely separated deleted resources purging from the system including the load-balancing of purging tasks for multy-host configuration and scheduling…
  • New feature of support of the MySQL-based blocking for per host DB operations to protect database structures from multi-process operations overlapping.
  • Improvements of the scraping algorithms and the processing core including of support of fully customized real-time crawling and processing requests with fixed scraping templates and scraper selection.
  • Many fixes for crawling and scraping features.

Latests unstable bundle archive can be downloaded here.