HCE project C++ developers source code library  1.1.1
HCE project developer library
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
HCE::component::Refine Class Reference

#include <Refine.hpp>

Inheritance diagram for HCE::component::Refine:
Collaboration diagram for HCE::component::Refine:

Public Member Functions

 Refine (ComponentType inType=CT_DEFAULT)
virtual ~Refine ()
Poco::SharedPtr< DataBaseprocess (const Poco::SharedPtr< DataBase > inData)
- Public Member Functions inherited from HCE::component::ComponentBase
 ComponentBase (ComponentType inType=CT_DEFAULT)
const std::atomic_bool & getIsBusy ()
void setIsBusy (bool isBusy)
virtual ~ComponentBase ()
 ComponentBase (ComponentType inType=CT_DEFAULT)
bool getIsBusy ()
void setIsBusy (bool isBusy)
virtual ~ComponentBase ()
- Public Member Functions inherited from HCE::DataBase
 DataBase (ComponentType inType=CT_DEFAULT)
ComponentType getType ()
virtual ~DataBase ()
 DataBase (ComponentType inType=CT_DEFAULT)
ComponentType getType ()
virtual ~DataBase ()

Additional Inherited Members

- Protected Attributes inherited from HCE::component::ComponentBase
std::atomic_bool _isBusy
bool _isBusy

Detailed Description

Definition at line 45 of file Refine.hpp.

Constructor & Destructor Documentation

HCE::component::Refine::Refine ( ComponentType  inType = CT_DEFAULT)

< instance of the smth

Define content processing schema If input message hasn't provide it's own content processing schema Refine component apply default one:

  1. Reduce tags from raw content
  2. Split raw content on the tokens
  3. Detect language's mask for each token in splitted content
  4. Normalize Japanese tokens
  5. Normalize European Languages tokens(Russian, English, etc.)
  6. Part of speech of tokens
  7. CRC64 of the normalized token's form

< tagger pos reduce

< split content into the tokens Set type of the split content on the tokens Available tokenizers:

  1. ICU
  2. Boost (methods: split and tokenizer)
  3. MeCab

< or

< or

< or

< detect language for each token

< perform normalize for Japanese tokens

< perform normalize for other languages

< Part Of Speech

< CRC64

Definition at line 33 of file Refine.cpp.

HCE::component::Refine::~Refine ( )
virtual

Definition at line 114 of file Refine.cpp.

Member Function Documentation

Poco::SharedPtr< DataBase > HCE::component::Refine::process ( const Poco::SharedPtr< DataBase inData)
virtual

< timer statistic

<

< main processing loop

< fill OutDataRefine

<

< for each token extracted from content

<

< cword's instance

That fields must be inserted

unsigned char black; //!< refine unsigned short simClass; //!< refine two bytes morphology ( MorphChangeGrad ) unsigned int hCrc; //!< refine CRC32 word ( CRC word for highlight on CDR ) unsigned int offset; //!< refine unsigned int sentenceNumber; //!< refine (deprecated) number word's sentence, start from begin unsigned char lingIntegrity; //!< refine valuable of the word in the content ( val/unval content ) unsigned int initWordLen; //!< refine std::string normWord; //!< refine POSMaskBitset<POS_NUM> _posMask;

< set word blacklist

< set word morphology

< set word CRC for highlighting

< set word offset

< set word's sentence number

< set word's linguistic integrity

< set init word length

< set original word form

< set normalized word form

< set Part-Of-Speech word's mask

< set word's type

< insert cword to vector

< rword's instance

That fields must be inserted

std::string _word; unsigned long long _crc64; POSMaskBitset<POS_NUM> _posMask; MorphChangeGradBitset<MCG_NUM> _morphChangeGrad;

< set word blacklist

Implements HCE::component::ComponentBase.

Definition at line 117 of file Refine.cpp.

Here is the call graph for this function:


The documentation for this class was generated from the following files: