Debian 9.x-11.x; Ubuntu 17.x, 18.x amd64 Linux
Sorry for a situation with Debian 9+, but now a package not ready and native installation with a package manager can be done with manual resolving of a dependencies list. To avoid this kind of troubles a binary form of distribution can be used. A binary build and fresh hce-node bundle can be downloaded from here.
Extract a bundle directory. A binary hce-node application distribution archives located at “hce-node-bundle/bin/” directory for a Debian 9.x and Ubuntu 14.x. In Ubuntu 17.x a “Debian 9.x” archive can be used.
Install a binary application “hce-node” in to a proper binary directory, for example “/usr/bin” and all libraries and symlinks, for example “/usr/local/lib”.
This way requires root privileges or sudo for user.
1) Add source URL to the Debian Linux sources file by editing the file /etc/apt/sources.list. For example add this line:
deb http://packages.hierarchical-cluster-engine.com/debian/8/stable/ jessie main
or use:
sudo bash -c 'echo
"deb http://packages.hierarchical-cluster-engine.com/debian/8/stable/
jessie main" > /etc/apt/sources.list.d/hce.list'
* note that the developer’s stable package can be installed from here:
deb http://packages.hierarchical-cluster-engine.com/debian/8/dev/ jessie main
* and the developer’s not stable packages can be installed from here:
deb http://packages.hierarchical-cluster-engine.com/debian/8/unstable/ jessie main
2) Add apt-key:
sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 04742D4B
or
sudo apt-key adv --recv-keys --keyserver hkp://keys.gnupg.net 04742D4B
3) Update the sources in the system, run:
sudo apt-get update
4) Using regular package manager install required component, for example, to install “hce-node” application:
sudo apt-get install hce-node
When finished the hce-node binary executable can be started and build version checked:
hce-node -v
Install the Bundle Environment for PHP language
1) To install the Bundle to the home directory run with regular user privileges the deployment script that is included in to the package distribution:
hce-node-bundle-install.sh
The hce-node-bundle directory will be created in the home directory of current user. After that the dependencies libraries need to be installed. This step requires root privileges or sudo for user.
* Note in some cases the unzip need to be installed:
sudo apt-get install unzip
2) Install zmq library
sudo apt-get install libzmq3-dev
If php is already installed this step can be skipped.
3) Install php (version 7 or above supposed by default system):
sudo apt-get install php sudo apt-get install php-dev php -v sudo apt-get install php-pear sudo apt-get install pkg-config sudo apt-get install libpgm-dev sudo pecl install --ignore-errors zmq-beta
After that possible add line in to the php.ini for command line:
extension=zmq.so or in some cases, for example "Debian 8", execute: sudo bash -c 'echo -e "; configuration for php ZMQ module \n; priority=20 \nextension=zmq.so" > /etc/php/7.0/cli/conf.d/20-zmq.ini'
4) Optionally (only if search functionality need to be available) install Sphinx search engine
sudo apt-get install sphinxsearch
5) Optionally (for common hce-node cluster tests only) install bc for DRCE tests
sudo apt-get install bc
6) Optionally (for common hce-node cluster tests only) install Java 7 for DRCE tests (optional)
sudo apt-get install openjdk-7-jdk
Please, read the ~/hce-node-bundle/api/php/doc/readme.txt file to continue.
Install Bundle Environment for Python language
This way requires root privileges or sudo for user.
1) Debian packages dependencies:
sudo apt-get install libpgm-dev mysql-server python-pip python-dev python-flask python-flaskext.wtf libffi-dev libxml2-dev libxslt1-dev mysql-client default-libmysqlclient-dev python-mysqldb libicu-dev libgmp3-dev libtidy-dev libjpeg-dev
2) Python dependencies:
sudo easy_install --upgrade pip
sudo pip install sqlalchemy Flask-SQLAlchemy scrapy gmpy pyzmq lepl urlnorm pyicu mysql-python goose-extractor pytidylib uritools python-magic pillow beautifulsoup4 snowballstemmer soundex pycountry langdetect iso639 psutil email validators dateutils jsonpath_ng sudo pip install regex==2020.10.15 sudo pip install newspaper sudo pip install -U feedparser==5.2.1 sudo pip install netifaces sudo pip install cement==2.10.12
* the newspaper package has broken dependency nltk. If this still is not fixed, try to do this way:
wget https://pypi.python.org/packages/source/d/distribute/distribute-0.6.21.tar.gz tar xzf distribute-0.6.21.tar.gz cd distribute-0.6.21 edit an default http to https URL in line in file "distribute_setup.py": DEFAULT_URL = "http://pypi.python.org/packages/source/d/distribute/" to: DEFAULT_URL = "https://pypi.python.org/packages/source/d/distribute/" run: python distribute_setup.py then run: sudo pip install newspaper
* the lepl package is deprecated.
Better if the scrapy library need to be 0.24.x, to install it use:
sudo pip install -U scrapy==0.24.4
Also, the goose-extractor need to be 1.0.x, to install it use:
sudo pip install -U goose-extractor==1.0.22
*For some Debian-based distributive like Ubuntu 14.x possible the psutil module need to be installed only version 4.1.0, for example:
sudo pip install psutil==4.1.0
Install the requests library version is 2.4.3 or higher but less than 2.6.x or 2.7, for example:
sudo pip install requests[security]==2.7
*Possible the pillow need to be reinstalled to get the JPG support:
sudo pip install -I pillow
*Possible a cement package need to be installed with apt-get:
sudo apt-get install python-cement
3) Configure locale for the UTF8 charset
sudo dpkg-reconfigure locales
set en_US.UTF-8
Configure the .bashrc, add lines:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
Apply the configuration for the current user (run from target user session):
source ~/.bashrc
4) Create MySQL user and DB schema for Distributed Crawler Application:
cd ~/hce-node-bundle/api/python/manage/
sudo ./mysql_create_user.sh
./mysql_create_db.sh
*Note, if the Python application after start has some problems with zmq library connection or some related exceptions logged, possible specific version of the zmq library need to be installed as described below:
wget https://archive.org/download/zeromq_4.0.4/zeromq-4.0.4.tar.gz tar -xvf zeromq-4.0.4.tar.gz cd zeromq-4.0.4 sudo ./configure && make && sudo make install sudo ldconfig sudo pip install pyzmq --install-option="--zmq=4.0.3" cd ..
Please, read the section “1.4. Hierarchical Cluster Engine (HCE) usage for the DC service” of ~/hce-node-bundle/api/php/doc/readme.txt file to configure the proper hce-node cluster type. The dual hosts configuration for the DC service brief manual can be read here. Please, read the ~/hce-node-bundle/api/python/doc/ftests.txt file to continue.
Install Distributed Crawler client Environment for Python language
This way requires root privileges or sudo for user.
1) Debian packages dependencies:
sudo apt-get install python-pip python-dev libffi-dev libxml2-dev libxslt1-dev
2) Python packets dependencies:
sudo pip install scrapy pyzmq w3lib sudo pip install cement==2.10.12
3) In case of DTS archive was downloaded directly after unzip run:
chmod 777 ~/hce-node-bundle/usr/bin/hce-node-permissions.sh ~/hce-node-bundle/usr/bin/hce-node-permissions.sh
To get a support of the HTTP2.0 with a crawler – install a hyper:
sudo pip install hyper
Dynamic pages fetcher support (now based on the python Selenium and web-driver for the chrome browser)
Installation of the chrome browser, driver and dependencies:
Download the chrome driver binary:
https://sites.google.com/a/chromium.org/chromedriver/downloads
and put it in to the directory:
~/hce-node-bundle/api/python/bin/chromedriver64
Set executable permissions:
chmod 777 ~/hce-node-bundle/api/python/bin/chromedriver64
Install the Google chrome any way or as described here:
http://www.tecmint.com/install-google-chrome-in-debian-ubuntu-linux-mint/
Or use:
sudo wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | sudo apt-key add - sudo sh -c 'echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' sudo apt-get update sudo apt-get install google-chrome-stable
Install dependent library:
sudo apt-get install libexif-dev
Install the xvfb package:
sudo apt-get install xvfb
and possible:
sudo apt-get install libxss1
Run the Xvfb for the resolution 1024x768x16:
Xvfb :1 -screen 0 1024x768x16 &> xvfb.log &
or this way for OpenVZ container:
Xvfb +extension RANDR :1 -screen 0 1024x768x16 &> xvfb.log &
Install the python Selenium package:
sudo pip install -U selenium==3.4.3
Configure locale for the UTF8 charset:
dpkg-reconfigure locales set en_US.UTF-8
Cconfigure the .bashrc, add lines:
export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8
DISPLAY=:1.0
export DISPLAY
Apply the configuration for the current user (run from target user session): source ~/.bashrc
*If the hce-node clusters was started before (start_nm.sh and start_r.sh executed) clusters need to be restarted.
* In some cases an additional header need to be added in to the “ini/crawler-task_headers.txt” for Google chrome sandboxing:
–no-sandbox:
Optionally a curl fetcher support
sudo apt install python-pycurl
Digest dependencies
Installation of the wkhtmltox and dependencies:
Download the wkhtmltox patched package from here http://wkhtmltopdf.org/downloads.html
or another source, install dependencies:
sudo apt-get install libxfont1 xfonts-encodings xfonts-utils xfonts-base xfonts-75dpi
install package
sudo dpkg -i wkhtmltox-0.12.2.1_linux-wheezy-amd64.deb
sudo apt-get install ttf-wqy-microhei fonts-arphic-ukai fonts-arphic-uming fonts-arphic-gbsn00lp fonts-arphic-bkai00mp fonts-arphic-bsmi00lp sudo fc-cache -f -v
If the stop words need to be filtered with NLTK, packages need to be installed. To install all NLTK corpora & models for current user:
python -m nltk.downloader all
Alternatively at system level:
sudo python -m nltk.downloader -d /usr/local/share/nltk_data all
Or just one stop words corpora for current user:
python -m nltk.downloader -f stopwords
Also, if some errors with current version of NLTK, a fixed one can be used:
pip uninstall nltk pip install -U distribute pip install nltk==3.4.5
Same way, to get extended lemmas support for English language (snowball is in base) install a “wordnet” from ntlk:
python -m nltk.downloader wordnet
To get more accurate stemming for Germany language install NLTK version 3.4.5 at least:
pip install --user -U nltk
To check a NLTK version:
python -c "import nltk; print nltk.__version__"
To check a German stemmer:
python -c "from nltk.stem.cistem import Cistem"
Modern NLTK probably requires a numpy:
pip install --user -U numpy
Also, to get additional lemmas support from Mystem for Russian language (internal (c) solution in base as main) install a pymystem3 package:
sudo pip install pymystem3
To parse Japanese words more exact way the tinysegmenter, romkan, pykakasi and MeCab words tokenizers can be installed optionally:
sudo pip install tinysegmenter
Install romkan:
sudo pip install romkan
Install pykakasi:
sudo pip install nose
sudo pip install -U pykakasi==0.23 wget https://pypi.python.org/packages/08/10/e8c7b6b7774b0941dcf583019dc032ecc63d5154bbbf53b6c814fa085f80/pykakasi-0.23-py2.7.egg Extract and copy *.pickle and *.db files to the regular python module location, for example: /usr/local/lib/python2.7/dist-packages/pykakasi/ for example: sudo unzip -d /usr/local/lib/python2.7/dist-packages pykakasi-0.23-py2.7.egg *.db *.pickle
Install MeCab:
Download mecab .tgz source from http://taku910.github.io/mecab/
extract from mecab-0.996.tar.gz, build and install:
tar -xzf mecab-0.996.tar.gz cd mecab-0.996 ./configure --with-charset=utf8 --enable-utf8-only make sudo make install sudo ldconfig
Download and install neologd dictionary manually:
sudo apt install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git cd mecab-ipadic-neologd/ sudo bin/install-mecab-ipadic-neologd -n -u
or follow with manual:
https://github.com/neologd/mecab-ipadic-neologd
Download and install Python support for a MeCab 0.996 from here
https://pypi.python.org/pypi/mecab-python/0.996:
tar -xzf mecab-python-0.996.tar.gz cd mecab-python-0.996 python setup.py build sudo python setup.py install
Or try for Python3:
sudo pip install mecab-python
Check is a MeCab with neologd dictionary working properly:
echo "志村けん" | mecab --dicdir /usr/local/lib/mecab/dic/mecab-ipadic-neologd
In proper case it bring output like this:
志村けん 名詞,固有名詞,人名,一般,*,*,志村けん,シムラケン,シムラケン EOS
For extended suport of Arabic language a POS-tagger need to be installed:
sudo pip install naftawayh sudo pip install pyarabic
For memory consumption profiling, a Pympler need to be installed
sudo pip install Pympler
Install PHP dependencies for web administration console
This way requires root privileges or sudo for user.
sudo apt-get install php5-curl php5-gd php5-mcrypt dialog
sudo php5enmod mcrypt
Install as source code tarball archive
1) To install this way – download latest tarball archive from here http://packages.hierarchical-cluster-engine.com/src/
2) Extract all from array, for example, for hce-node:
tar -xzf hce-node-1.2.tar.gz
3) Run configure to create make file:
./configure
In case of dependencies found, install required dependencies.
4) Run make to build application:
./make
5) Run make install to install application in to the system
sudo ./make install
6) Run ldconfig to upgrade system shared libraries registration:
sudo ldconfig