HCE Project Python language Distributed Tasks Manager Application, Distributed Crawler Application and client API bindings.  2.0.0-chaika
Hierarchical Cluster Engine Python language binding
dc_processor.base_extractor Namespace Reference

Classes

class  BaseExtractor
 

Functions

def signal_handler (signum, frame)
 
def adjustPubDate (dates)
 
def adjustMedia (medias)
 
def adjustContentUTF8Encoded (data)
 
def adjustLink (data)
 
def adjustNone (data)
 

Variables

 logger = Utils.MPLogger().getLogger()
 
string ERR_MSG_ADJUST_PUB_DATE = "Error in adjustPubDate: "
 
string ERR_MSG_ADJUST_MEDIA = "Error in adjustMedia: "
 
string ERR_MSG_ADJUST_CONTENT_UTF8_ENCODED = "Error in adjustContentUTF8Encoded: "
 
string ERR_MSG_OK = ""
 
string EMPTY_DATE = ""
 

Detailed Description

@package docstring
 @file base_extractor.py
 @author Alexey, bgv <developers.hce@gmail.com>
 @link http://hierarchical-cluster-engine.com/
 @copyright Copyright &copy; 2013 IOIX Ukraine
 @license http://hierarchical-cluster-engine.com/license/
 @package HCE project node API
 @since 0.1

Function Documentation

◆ adjustContentUTF8Encoded()

def dc_processor.base_extractor.adjustContentUTF8Encoded (   data)

Definition at line 81 of file base_extractor.py.

81 def adjustContentUTF8Encoded(data):
82  return data
83 
84 
85 # Adjust data in content_encoded tag
86 # If content are non-meaningfull adjust it
87 # @param data content extracted from content

◆ adjustLink()

def dc_processor.base_extractor.adjustLink (   data)

Definition at line 88 of file base_extractor.py.

88 def adjustLink(data):
89  if isinstance(data, list) and len(data) > 1:
90  data = data[0]
91  return data
92 
93 

◆ adjustMedia()

def dc_processor.base_extractor.adjustMedia (   medias)

Definition at line 64 of file base_extractor.py.

64 def adjustMedia(medias):
65  return medias
66  # valid_http_url = HttpUrl()
67  # res = []
68  # try:
69  # if isinstance(medias, list):
70  # for media in medias:
71  # if valid_http_url(media):
72  # res.append(media)
73  # except Exception as err:
74  # logger.error(ERR_MSG_ADJUST_MEDIA + err.message)
75  # return res
76 
77 
78 # Adjust data in content_encoded tag
79 # If content are non-meaningfull adjust it
80 # @param data content extracted from content

◆ adjustNone()

def dc_processor.base_extractor.adjustNone (   data)

Definition at line 94 of file base_extractor.py.

94 def adjustNone(data):
95  return data
96 
97 
98 # #The BaseExtractor class
99 # This is the base class for custom extractors
100 # Provide basic functionality such as add tag, etc.

◆ adjustPubDate()

def dc_processor.base_extractor.adjustPubDate (   dates)

Definition at line 42 of file base_extractor.py.

42 def adjustPubDate(dates):
43  # logger.debug("dates: %s", dates)
44  pub_date = EMPTY_DATE
45  try:
46  # TODO: improve to return most appropriate
47  # if dates and any(i.isdigit() for i in dates):
48  if isinstance(dates, list) and len(dates):
49  # pub_date = dates[0]
50  pub_date = " ".join(dates)
51  else:
52  pub_date = dates
53  if pub_date and len(dates) and not re.search(r'\d+', pub_date):
54  pub_date = EMPTY_DATE
55  except Exception as err:
56  ExceptionLog.handler(logger, err, ERR_MSG_ADJUST_PUB_DATE)
57 
58  return pub_date
59 
60 
61 # Adjust data in media tag
62 # If media are PR (partial reference) adjust path
63 # @param medias media extracted from content
Definition: join.py:1

◆ signal_handler()

def dc_processor.base_extractor.signal_handler (   signum,
  frame 
)

Definition at line 24 of file base_extractor.py.

24 def signal_handler(signum, frame):
25  del signum, frame
26  logger.debug("Time execution limit was reached: %s seconds.", str(CONSTS.TIME_EXECUTION_LIMIT))
27  raise Exception("Timed out!")
28 
29 
30 # Local class constants
def signal_handler(signum, frame)

Variable Documentation

◆ EMPTY_DATE

string dc_processor.base_extractor.EMPTY_DATE = ""

Definition at line 37 of file base_extractor.py.

◆ ERR_MSG_ADJUST_CONTENT_UTF8_ENCODED

string dc_processor.base_extractor.ERR_MSG_ADJUST_CONTENT_UTF8_ENCODED = "Error in adjustContentUTF8Encoded: "

Definition at line 33 of file base_extractor.py.

◆ ERR_MSG_ADJUST_MEDIA

string dc_processor.base_extractor.ERR_MSG_ADJUST_MEDIA = "Error in adjustMedia: "

Definition at line 32 of file base_extractor.py.

◆ ERR_MSG_ADJUST_PUB_DATE

string dc_processor.base_extractor.ERR_MSG_ADJUST_PUB_DATE = "Error in adjustPubDate: "

Definition at line 31 of file base_extractor.py.

◆ ERR_MSG_OK

string dc_processor.base_extractor.ERR_MSG_OK = ""

Definition at line 35 of file base_extractor.py.

◆ logger

dc_processor.base_extractor.logger = Utils.MPLogger().getLogger()

Definition at line 20 of file base_extractor.py.