Install in Debian 9-11 and Ubuntu 17 Linuxes

Debian 9.x-11.x; Ubuntu 17.x, 18.x amd64 Linux

Sorry for a situation with Debian 9+, but now a package not ready and native installation with a package manager can be done with manual resolving of a dependencies list. To avoid this kind of troubles a binary form of distribution can be used. A binary build and fresh hce-node bundle can be downloaded from here.
Extract a bundle directory. A binary hce-node application distribution archives located at “hce-node-bundle/bin/” directory for a Debian 9.x and Ubuntu 14.x. In Ubuntu 17.x a “Debian 9.x” archive can be used.
Install a binary application “hce-node” in to a proper binary directory, for example “/usr/bin” and all libraries and symlinks, for example “/usr/local/lib”.

This way requires root privileges or sudo for user.

1) Add source URL to the Debian Linux sources file by editing the file /etc/apt/sources.list. For example add this line:

deb http://packages.hierarchical-cluster-engine.com/debian/8/stable/ jessie main

or use:

sudo bash -c 'echo 
"deb http://packages.hierarchical-cluster-engine.com/debian/8/stable/
jessie main" > /etc/apt/sources.list.d/hce.list'

* note that the developer’s stable package can be installed from here:

deb http://packages.hierarchical-cluster-engine.com/debian/8/dev/ jessie main

* and the developer’s not stable packages can be installed from here:

deb http://packages.hierarchical-cluster-engine.com/debian/8/unstable/ jessie main

2) Add apt-key:
  sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 04742D4B
 or
  sudo apt-key adv --recv-keys --keyserver hkp://keys.gnupg.net 04742D4B

3) Update the sources in the system, run:

sudo apt-get update

4) Using regular package manager install required component, for example, to install “hce-node” application:

sudo apt-get install hce-node

When finished the hce-node binary executable can be started and build version checked:

hce-node -v

Install the Bundle Environment for PHP language

1) To install the Bundle to the home directory run with regular user privileges the deployment script that is included in to the package distribution:

hce-node-bundle-install.sh

The hce-node-bundle directory will be created in the home directory of current user. After that the dependencies libraries need to be installed. This step requires root privileges or sudo for user.

* Note in some cases the unzip need to be installed:

sudo apt-get install unzip

2) Install zmq library

sudo apt-get install libzmq3-dev

If php is already installed this step can be skipped.
3) Install php (version 7 or above supposed by default system):

sudo apt-get install php
sudo apt-get install php-dev
php -v
sudo apt-get install php-pear
sudo apt-get install pkg-config
sudo apt-get install libpgm-dev
sudo pecl install --ignore-errors zmq-beta

After that possible add line in to the php.ini for command line:

extension=zmq.so or in some cases, for example "Debian 8", execute:
sudo bash -c 'echo -e "; configuration for php ZMQ module \n; priority=20 \nextension=zmq.so" > /etc/php/7.0/cli/conf.d/20-zmq.ini'

4) Optionally (only if search functionality need to be available) install Sphinx search engine

sudo apt-get install sphinxsearch

5) Optionally (for common hce-node cluster tests only) install bc for DRCE tests

sudo apt-get install bc

6) Optionally (for common hce-node cluster tests only) install Java 7 for DRCE tests (optional)

sudo apt-get install openjdk-7-jdk

Please, read the ~/hce-node-bundle/api/php/doc/readme.txt file to continue.

Install Bundle Environment for Python language

This way requires root privileges or sudo for user.

1) Debian packages dependencies:

sudo apt-get install libpgm-dev mysql-server python-pip python-dev python-flask python-flaskext.wtf libffi-dev libxml2-dev libxslt1-dev mysql-client default-libmysqlclient-dev python-mysqldb libicu-dev libgmp3-dev libtidy-dev libjpeg-dev

2) Python dependencies:

sudo easy_install --upgrade pip

sudo pip install sqlalchemy Flask-SQLAlchemy scrapy gmpy pyzmq lepl urlnorm pyicu mysql-python goose-extractor pytidylib uritools python-magic pillow beautifulsoup4 snowballstemmer soundex pycountry langdetect iso639 psutil email validators dateutils jsonpath_ng

sudo pip install regex==2020.10.15
sudo pip install newspaper
sudo pip install -U feedparser==5.2.1
sudo pip install netifaces
sudo pip install cement==2.10.12

* the newspaper package has broken dependency nltk. If this still is not fixed, try to do this way:

wget https://pypi.python.org/packages/source/d/distribute/distribute-0.6.21.tar.gz

tar xzf distribute-0.6.21.tar.gz
cd distribute-0.6.21

edit an default http to https URL in line in file "distribute_setup.py":
DEFAULT_URL = "http://pypi.python.org/packages/source/d/distribute/"

to:
DEFAULT_URL = "https://pypi.python.org/packages/source/d/distribute/"

run:

python distribute_setup.py

then run:

sudo pip install newspaper

* the lepl package is deprecated.

Better if the scrapy library need to be 0.24.x, to install it use:

sudo pip install -U scrapy==0.24.4

Also, the goose-extractor need to be 1.0.x, to install it use:

sudo pip install -U goose-extractor==1.0.22

*For some Debian-based distributive like Ubuntu 14.x possible the psutil module need to be installed only version 4.1.0, for example:

sudo pip install psutil==4.1.0

Install the requests library version is 2.4.3 or higher but less than 2.6.x or 2.7, for example:

sudo pip install requests[security]==2.7

*Possible the pillow need to be reinstalled to get the JPG support:

sudo pip install -I pillow

*Possible a cement package need to be installed with apt-get:

sudo apt-get install python-cement

3) Configure locale for the UTF8 charset

sudo dpkg-reconfigure locales
set en_US.UTF-8
Configure the .bashrc, add lines:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
Apply the configuration for the current user (run from target user session):
source ~/.bashrc

4) Create MySQL user and DB schema for Distributed Crawler Application:

cd ~/hce-node-bundle/api/python/manage/
sudo ./mysql_create_user.sh
./mysql_create_db.sh

*Note, if the Python application after start has some problems with zmq library connection or some related exceptions logged, possible specific version of the zmq library need to be installed as described below:

wget https://archive.org/download/zeromq_4.0.4/zeromq-4.0.4.tar.gz
tar -xvf zeromq-4.0.4.tar.gz
cd zeromq-4.0.4
sudo ./configure && make && sudo make install
sudo ldconfig
sudo pip install pyzmq --install-option="--zmq=4.0.3"
cd ..

Please, read the section “1.4. Hierarchical Cluster Engine (HCE) usage for the DC service” of ~/hce-node-bundle/api/php/doc/readme.txt file to configure the proper hce-node cluster type. The dual hosts configuration for the DC service brief manual can be read here. Please, read the ~/hce-node-bundle/api/python/doc/ftests.txt file to continue.

Install Distributed Crawler client Environment for Python language

This way requires root privileges or sudo for user.

1) Debian packages dependencies:

sudo apt-get install python-pip python-dev libffi-dev libxml2-dev libxslt1-dev

2) Python packets dependencies:

sudo pip install scrapy pyzmq w3lib
sudo pip install cement==2.10.12

3) In case of DTS archive was downloaded directly after unzip run:

chmod 777 ~/hce-node-bundle/usr/bin/hce-node-permissions.sh ~/hce-node-bundle/usr/bin/hce-node-permissions.sh

To get a support of the HTTP2.0 with a crawler – install a hyper:

sudo pip install hyper

Dynamic pages fetcher support (now based on the python Selenium and web-driver for the chrome browser)

Installation of the chrome browser, driver and dependencies:
Download the chrome driver binary:
https://sites.google.com/a/chromium.org/chromedriver/downloads
and put it in to the directory:
~/hce-node-bundle/api/python/bin/chromedriver64
Set executable permissions:
chmod 777 ~/hce-node-bundle/api/python/bin/chromedriver64
Install the Google chrome any way or as described here:
http://www.tecmint.com/install-google-chrome-in-debian-ubuntu-linux-mint/
Or use:

sudo wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
sudo sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list'
sudo apt-get update
sudo apt-get install google-chrome-stable

Install dependent library:
sudo apt-get install libexif-dev
Install the xvfb package:
sudo apt-get install xvfb
and possible:
sudo apt-get install libxss1

Run the Xvfb for the resolution 1024x768x16:
Xvfb :1 -screen 0 1024x768x16 &> xvfb.log &
or this way for OpenVZ container:
Xvfb +extension RANDR :1 -screen 0 1024x768x16 &> xvfb.log &

Install the python Selenium package:
sudo pip install -U selenium==3.4.3

Configure locale for the UTF8 charset:
dpkg-reconfigure locales set en_US.UTF-8

Cconfigure the .bashrc, add lines:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
DISPLAY=:1.0
export DISPLAY

Apply the configuration for the current user (run from target user session): source ~/.bashrc

*If the hce-node clusters was started before (start_nm.sh and start_r.sh executed) clusters need to be restarted.

* In some cases an additional header need to be added in to the “ini/crawler-task_headers.txt” for Google chrome sandboxing:
–no-sandbox:

Optionally a curl fetcher support

sudo apt install python-pycurl

Digest dependencies

Installation of the wkhtmltox and dependencies:
Download the wkhtmltox patched package from here http://wkhtmltopdf.org/downloads.html

or another source, install dependencies:

sudo apt-get install libxfont1 xfonts-encodings xfonts-utils xfonts-base xfonts-75dpi

install package

sudo dpkg -i wkhtmltox-0.12.2.1_linux-wheezy-amd64.deb

sudo apt-get install ttf-wqy-microhei fonts-arphic-ukai fonts-arphic-uming fonts-arphic-gbsn00lp fonts-arphic-bkai00mp fonts-arphic-bsmi00lp

sudo fc-cache -f -v

If the stop words need to be filtered with NLTK, packages need to be installed. To install all NLTK corpora & models for current user:

python -m nltk.downloader all

Alternatively at system level:

sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

Or just one stop words corpora for current user:

 python -m nltk.downloader -f stopwords

Also, if some errors with current version of NLTK, a fixed one can be used:

pip uninstall nltk
pip install -U distribute
pip install nltk==3.4.5

Same way, to get extended lemmas support for English language (snowball is in base) install a “wordnet” from ntlk:

python -m nltk.downloader wordnet

To get more accurate stemming for Germany language install NLTK version 3.4.5 at least:

pip install --user -U nltk

To check a NLTK version:

python -c "import nltk; print nltk.__version__"

To check a German stemmer:

python -c "from nltk.stem.cistem import Cistem"

Modern NLTK probably requires a numpy:

pip install --user -U numpy

Also, to get additional lemmas support from Mystem for Russian language (internal (c) solution in base as main) install a pymystem3 package:

sudo pip install pymystem3

To parse Japanese words more exact way the tinysegmenter, romkan, pykakasi and MeCab words tokenizers can be installed optionally:

sudo pip install tinysegmenter

Install romkan:

sudo pip install romkan

Install pykakasi:

sudo pip install nose

sudo pip install -U pykakasi==0.23
wget https://pypi.python.org/packages/08/10/e8c7b6b7774b0941dcf583019dc032ecc63d5154bbbf53b6c814fa085f80/pykakasi-0.23-py2.7.egg

Extract and copy *.pickle and *.db files to the regular python module location, for example:
/usr/local/lib/python2.7/dist-packages/pykakasi/
for example:
sudo unzip -d /usr/local/lib/python2.7/dist-packages pykakasi-0.23-py2.7.egg *.db *.pickle

Install MeCab:

Download mecab .tgz source from http://taku910.github.io/mecab/

extract from mecab-0.996.tar.gz, build and install:

tar -xzf mecab-0.996.tar.gz
cd mecab-0.996
./configure --with-charset=utf8 --enable-utf8-only
make
sudo make install
sudo ldconfig

Download and install neologd dictionary manually:

sudo apt install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd/
sudo bin/install-mecab-ipadic-neologd -n -u

or follow with manual:
https://github.com/neologd/mecab-ipadic-neologd

Download and install Python support for a MeCab 0.996 from here
https://pypi.python.org/pypi/mecab-python/0.996:

tar -xzf mecab-python-0.996.tar.gz
cd mecab-python-0.996
python setup.py build
sudo python setup.py install

Or try for Python3:

sudo pip install mecab-python

Check is a MeCab with neologd dictionary working properly:

echo "志村けん" | mecab --dicdir /usr/local/lib/mecab/dic/mecab-ipadic-neologd

In proper case it bring output like this:

志村けん 名詞,固有名詞,人名,一般,*,*,志村けん,シムラケン,シムラケン
EOS

For extended suport of Arabic language a POS-tagger need to be installed:

sudo pip install naftawayh
sudo pip install pyarabic

For memory consumption profiling, a Pympler need to be installed

sudo pip install Pympler

Install PHP dependencies for web administration console

This way requires root privileges or sudo for user.

sudo apt-get install php5-curl php5-gd php5-mcrypt dialog
sudo php5enmod mcrypt

Install as source code tarball archive

1) To install this way – download latest tarball archive from here http://packages.hierarchical-cluster-engine.com/src/

2) Extract all from array, for example, for hce-node:

tar -xzf hce-node-1.2.tar.gz

3) Run configure to create make file:

./configure

In case of dependencies found, install required dependencies.

4) Run make to build application:

./make

5) Run make install to install application in to the system

sudo ./make install

6) Run ldconfig to upgrade system shared libraries registration:

sudo ldconfig