Install

Stable version 1.6.0. soon…

HCE project components can be installed from source code tarballs (required build on target host, depends on C++ build environment) and as packages for Debian and Centos Linux (now suspended). HCE package includes the Bundle archive with complete set of tools and scenarios for simple single-host cluster: create, run, manage and full functional test as well as API bindings for the PHP and Python languages. Also two main applied solutions based on DRCE cluster – the Distributed Crawler (DC) service and the Distributed Tasks Manager (DTM) services provided. Both are ready to be integrated in to the target project environment also DC requires DTM as dependency.

Also, several pre-configured VM images for VMware and VirtualBox are uploaded to get start process faster.  The user name is “root” and password is the same. The target user for DTS archive is “hce”, password the same. VM files zipped at here http://packages.hierarchical-cluster-engine.com/vm/

The source code tarballs and archives including the DTS archive as separated file hce-node-bundle.zip.

The Android client informational application executable hce-dc-info can be downloaded and used to check state of crawling per site or per installation system.

Install in Debian 7.x (wheezy) amd64 Linux package


This way requires root privileges or sudo for user.

1) Add source URL to the Debian Linux sources file by editing the file /etc/apt/sources.list. For example add this line:

deb http://packages.hierarchical-cluster-engine.com/debian/7/stable/ wheezy main

or use:

sudo bash -c 'echo "deb http://packages.hierarchical-cluster-engine.com/debian/7/stable/
wheezy main" > /etc/apt/sources.list.d/hce.list'

* note that the developer’s stable package can be installed from here:

deb http://packages.hierarchical-cluster-engine.com/debian/7/dev/ wheezy main

* and the developer’s not stable packages can be installed from here:

deb http://packages.hierarchical-cluster-engine.com/debian/7/unstable/ wheezy main

2) Add apt-key:
  sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 04742D4B
 or
  sudo apt-key adv --recv-keys --keyserver hkp://keys.gnupg.net 04742D4B

3) Update the sources in the system, run:

sudo apt-get update

4) Using regular package manager install required component, for example, to install “hce-node” application:

sudo apt-get install hce-node

When finished the hce-node binary executable can be started and build version checked:

hce-node -v

Install in Debian 8.x (Jessie) amd64 Linux package


This way requires root privileges or sudo for user.

1) Add source URL to the Debian Linux sources file by editing the file /etc/apt/sources.list. For example add this line:

deb http://packages.hierarchical-cluster-engine.com/debian/8/stable/ jessie main

or use:

sudo bash -c 'echo 
"deb http://packages.hierarchical-cluster-engine.com/debian/8/stable/
jessie main" > /etc/apt/sources.list.d/hce.list'

* note that the developer’s stable package can be installed from here:

deb http://packages.hierarchical-cluster-engine.com/debian/8/dev/ jessie main

* and the developer’s not stable packages can be installed from here:

deb http://packages.hierarchical-cluster-engine.com/debian/8/unstable/ jessie main

2) Add apt-key:
  sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 04742D4B
 or
  sudo apt-key adv --recv-keys --keyserver hkp://keys.gnupg.net 04742D4B

3) Update the sources in the system, run:

sudo apt-get update

4) Using regular package manager install required component, for example, to install “hce-node” application:

sudo apt-get install hce-node

When finished the hce-node binary executable can be started and build version checked:

hce-node -v

Install the Bundle Environment for PHP language


1) To install the Bundle to the home directory run with regular user privileges the deployment script that is included in to the package distribution:

hce-node-bundle-install.sh

The hce-node-bundle directory will be created in the home directory of current user. After that the dependencies libraries need to be installed. This step requires root privileges or sudo for user.

* Note in some cases the unzip need to be installed:

sudo apt-get install unzip

2) Install zmq library

sudo apt-get install libzmq3-dev

If php is already installed this step can be skipped.
3) Install php:

sudo apt-get install php5
sudo apt-get install php5-dev
php -v
sudo apt-get install php-pear
sudo apt-get install pkg-config
sudo apt-get install libpgm-dev
sudo pecl install --ignore-errors zmq-beta

After that possible add line in to the php.ini for command line:

extension=zmq.so or in some cases, for example "Debian 8", execute:
sudo bash -c 'echo -e "; configuration for php ZMQ module \n; priority=20 \nextension=zmq.so" > /etc/php5/cli/conf.d/20-zmq.ini'

4) Optionally (only if search functionality need to be available) install Sphinx search engine

sudo apt-get install sphinxsearch

5) Optionally (for common hce-node cluster tests only) install bc for DRCE tests

sudo apt-get install bc

6) Optionally (for common hce-node cluster tests only) install Java 7 for DRCE tests (optional)

sudo apt-get install openjdk-7-jdk

Please, read the ~/hce-node-bundle/api/php/doc/readme.txt file to continue.

Install Bundle Environment for Python language


This way requires root privileges or sudo for user.

1) Debian packages dependencies:

sudo apt-get install libpgm-dev mysql-server-5.5 python-pip python-dev python-flask python-flaskext.wtf libffi-dev libxml2-dev libxslt1-dev mysql-client libmysqlclient-dev python-mysqldb libicu-dev libgmp3-dev libtidy-dev libjpeg-dev

*For Debian 8 and higher versions possible some packages like a mysql need to be replaced with newer, for example:

mysql-server-5.6

2) Python dependencies:

sudo easy_install --upgrade pip
sudo pip install cement sqlalchemy Flask-SQLAlchemy scrapy gmpy pyzmq lepl urlnorm pyicu mysql-python newspaper goose-extractor pytidylib uritools python-magic feedparser pillow beautifulsoup4 snowballstemmer soundex pycountry langdetect iso639 psutil email validators dateutils

* the lepl package is deprecated.

Better if the scrapy library need to be 0.24.x, to install it use:

sudo pip install -U scrapy==0.24.4

Also, the goose-extractor need to be 1.0.x, to install it use:

sudo pip install -U goose-extractor==1.0.22

*For some Debian-based distributive like Ubuntu 14.x possible the psutil module need to be installed only version 4.1.0, for example:

sudo pip install psutil==4.1.0

Install the requests library version is 2.4.3 or higher but less than 2.6.x or 2.7, for example:

sudo pip install requests[security]==2.7

*Possible the pillow need to be reinstalled to get the JPG support:

sudo pip install -I pillow

3) Configure locale for the UTF8 charset

sudo dpkg-reconfigure locales
set en_US.UTF-8
Configure the .bashrc, add lines:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
Apply the configuration for the current user (run from target user session):
source ~/.bashrc

4) Create MySQL user and DB schema for Distributed Crawler Application:

cd ~/hce-node-bundle/api/python/manage/
sudo ./mysql_create_user.sh
./mysql_create_db.sh

 

*Note, if the Python application after start has some problems with zmq library connection or some related exceptions logged, possible specific version of the zmq library need to be installed as described below:

wget https://archive.org/download/zeromq_4.0.4/zeromq-4.0.4.tar.gz
tar -xvf zeromq-4.0.4.tar.gz
cd zeromq-4.0.4
sudo ./configure && make && make install
sudo ldconfig
sudo pip install pyzmq --install-option="--zmq=4.0.3"
cd ..

Please, read the section “1.4. Hierarchical Cluster Engine (HCE) usage for the DC service” of ~/hce-node-bundle/api/php/doc/readme.txt file to configure the proper hce-node cluster type. The dual hosts configuration for the DC service brief manual can be read here. Please, read the ~/hce-node-bundle/api/python/doc/ftests.txt file to continue.

Install Distributed Crawler client Environment for Python language


This way requires root privileges or sudo for user.

1) Debian packages dependencies:

sudo apt-get install python-pip python-dev libffi-dev libxml2-dev libxslt1-dev

2) Python packets dependencies:

sudo pip install cement scrapy pyzmq w3lib

3) In case of DTS archive was downloaded directly after unzip run:

chmod 777 ~/hce-node-bundle/usr/bin/hce-node-permissions.sh ~/hce-node-bundle/usr/bin/hce-node-permissions.sh

Dynamic pages fetcher support (now based on the python Selenium and web-driver for the chrome browser)

Installation of the chrome browser, driver and dependencies:
Download the chrome driver binary:
https://sites.google.com/a/chromium.org/chromedriver/downloads
and put it in to the directory:
~/hce-node-bundle/api/python/bin/chromedriver64
Set executable permissions:
chmod 777 ~/hce-node-bundle/api/python/bin/chromedriver64
Install the Google chrome any way or as described here:
http://www.tecmint.com/install-google-chrome-in-debian-ubuntu-linux-mint/
Install dependent library:
apt-get install libexif-dev
Install the xvfb package:
apt-get install xvfb
and possible:
sudo apt-get install libxss1
Run the Xvfb for the resolution 1024x768x16:
Xvfb :1 -screen 0 1024x768x16 &> xvfb.log &
or this way for OpenVZ container:
Xvfb +extension RANDR :1 -screen 0 1024x768x16 &> xvfb.log &
Install the python Selenium package:
pip install -U selenium
Configure locale for the UTF8 charset:
dpkg-reconfigure locales
set en_US.UTF-8
Cconfigure the .bashrc, add lines:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
DISPLAY=:1.0
export DISPLAY
Apply the configuration for the current user (run from target user session):
source ~/.bashrc

*If the hce-node clusters was started before (start_nm.sh and start_r.sh executed) clusters need to be restarted.

Digest dependencies

Installation of the wkhtmltox and dependencies:
Download the wkhtmltox patched package from here http://wkhtmltopdf.org/downloads.html

or another source, install dependencies:

apt-get install libxfont1 xfonts-encodings xfonts-utils xfonts-base xfonts-75dpi

install package

dpkg -i wkhtmltox-0.12.2.1_linux-wheezy-amd64.deb

Also possible some fonts set as the Chinese and Japanese need to be installed on Debian 7:

apt-get install ttf-arphic-ukai ttf-arphic-uming ttf-arphic-gbsn00lp ttf-arphic-bkai00mp ttf-arphic-bsmi00lp

fc-cache -f -v

or Debian 8:

apt-get install ttf-wqy-microhei fonts-arphic-ukai fonts-arphic-uming fonts-arphic-gbsn00lp fonts-arphic-bkai00mp fonts-arphic-bsmi00lp

fc-cache -f -v

If the stop words need to be filtered with NLTK, packages need to be installed. To install all NLTK corpora & models for current user:

python -m nltk.downloader all

Alternatively at system level:

sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

Or just one stop words corpora for current user:

 python -m nltk.downloader -f stopwords

To parse Japanese words more exact way the tinysegmenter words tokenizer can be installed optionally:

sudo pip install tinysegmenter

Install romkan:

pip install romkan

Install pykakasi:

sudo pip install nose
sudo pip --no-cache-dir install pykakasi
wget https://pypi.python.org/packages/08/10/e8c7b6b7774b0941dcf583019dc032ecc63d5154bbbf53b6c814fa085f80/pykakasi-0.23-py2.7.egg

copy *.pickle and *.db files to the regular python module location, for example:
/usr/local/lib/python2.7/dist-packages/pykakasi/

 

Install PHP dependencies for web administration console


This way requires root privileges or sudo for user.

sudo apt-get install php5-curl php5-gd php5-mcrypt dialog
sudo php5enmod mcrypt

Install as source code tarball archive


1) To install this way – download latest tarball archive from here http://packages.hierarchical-cluster-engine.com/src/

2) Extract all from array, for example, for hce-node:

tar -xzf hce-node-1.2.tar.gz

3) Run configure to create make file:

./configure

In case of dependencies found, install required dependencies.

4) Run make to build application:

./make

5) Run make install to install application in to the system

sudo ./make install

6) Run ldconfig to upgrade system shared libraries registration:

sudo ldconfig