Inheritance diagram for dc_processor.base_extractor.BaseExtractor:

[legend]

Collaboration diagram for dc_processor.base_extractor.BaseExtractor:

Public Member Functions
def	__init__ (self, config, templ=None, domain=None, processorProperties=None)

def	__str__ (self)

def	__repr__ (self)

def	loadScraperProperties (self, scraperPropFileName)

def	isTagNotFilled (self, result, tagName)

def	isTagValueNotEmpty (self, tagValue)

def	tagValueElemValidate (self, tagValueElem, conditionElem)

def	tagValueValidate (self, tagName, tagValue)

def	addTag (self, result, tag_name, tag_value, xpath="", isDefaultTag=False, callAdjustment=True, tagType=None, allowNotFilled=False)

def	calculateMetrics (self, response)

def	rankReading (self, exctractorName)

Public Attributes
	config

	processorProperties

	name

	rank

	process_mode

	modules

	data

	db_dc_scraper_db

	DBConnector

	imgDelimiter

	tagsValidator

Static Public Attributes
	properties = None

dictionary	tag

dictionary	tagsMask

Detailed Description

Definition at line 101 of file base_extractor.py.

Constructor & Destructor Documentation

◆ init()

def dc_processor.base_extractor.BaseExtractor.__init__	(	self,
		config,
		templ = `None`,
		domain = `None`,
		processorProperties = `None`
	)

Definition at line 161 of file base_extractor.py.

   def __init__(self, config, templ=None, domain=None, processorProperties=None):  # pylint: disable=W0612,W0613
     self.config = config
     self.processorProperties = processorProperties
     self.properties = None
     scraperPropFileName = self.config.get("Application", "property_file_name")
 
     if scraperPropFileName is not None:
       self.loadScraperProperties(scraperPropFileName)
 
     self.name = "Base extractor"
     self.rank = CONSTS.SCRAPER_RANK_INIT
 
     # support processing modes
     self.process_mode = CONSTS.PROCESS_ALGORITHM_REGULAR
     self.modules = {}
 
     self.data = {"extractor":"Base extractor", "data":"", "name":""}
     self.db_dc_scraper_db = None
     self.DBConnector = None
     if processorProperties is not None and "SCRAPER_TAG_ITEMS_DELIMITER" in processorProperties:
       self.imgDelimiter = processorProperties["SCRAPER_TAG_ITEMS_DELIMITER"]
     else:
       self.imgDelimiter = ' '
     self.tagsValidator = None
     if processorProperties is not None and "tagsValidator" in processorProperties:
       try:
         self.tagsValidator = json.loads(processorProperties["tagsValidator"])
       except Exception as excp:
         ExceptionLog.handler(logger, excp, '>>> tagsValidator wronj json format', (), \
                            {ExceptionLog.LEVEL_NAME_ERROR:ExceptionLog.LEVEL_VALUE_DEBUG})
 
 

Member Function Documentation

◆ repr()

def dc_processor.base_extractor.BaseExtractor.__repr__ ( self )

Definition at line 197 of file base_extractor.py.

   def __repr__(self):
     return repr((self.name, self.rank))

◆ str()

def dc_processor.base_extractor.BaseExtractor.__str__ ( self )

Definition at line 193 of file base_extractor.py.

   def __str__(self):
     return "%s" % (self.name)

◆ addTag()

def dc_processor.base_extractor.BaseExtractor.addTag	(	self,
		result,
		tag_name,
		tag_value,
		xpath = `""`,
		isDefaultTag = `False`,
		callAdjustment = `True`,
		tagType = `None`,
		allowNotFilled = `False`
	)

Definition at line 291 of file base_extractor.py.

              allowNotFilled=False):
     ret = False
     if tag_name not in result.blockedByXpathTags:
       tag_value = self.tagValueValidate(tag_name, tag_value)
       if tag_value is not None:
         if callAdjustment:
           try:
             if tag_value and not isinstance(tag_value, list):
               pass
             if tag_value and isinstance(tag_value, list):
               pass
             tag_value = self.tag[tag_name](tag_value)
           except Exception as err:
             logger.debug('No tag name in result template: %s', str(err))
 
         result.errorCode = 0
         result.errorMessage = ERR_MSG_OK
 
         if (tag_name not in result.tags.keys() and self.isTagValueNotEmpty(tag_value) is not None) or \
         (self.isTagNotFilled(result, tag_name) and self.isTagValueNotEmpty(tag_value) is not None) or \
         allowNotFilled:
           data = {"extractor": "Base extractor", "data": "", "name": ""}
           data["data"] = tag_value
           data["name"] = tag_name
           data["xpath"] = xpath
           data["type"] = tagType
           data["lang"] = dc_processor.scraper_result.Result.TAGS_LANG_DEFAULT
           data["lang_suffix"] = dc_processor.scraper_result.Result.TAGS_LANG_SUFFIX_DEFAULT
           data["extractor"] = self.__class__.__name__
           result.tags[tag_name] = data
           if isDefaultTag and tag_name not in result.defaultTags:
             result.defaultTags.append(tag_name)
           ret = True
     else:
       logger.debug(">>> BaseExtractor.addTag, tags in break list; tag is = " + tag_name)
     return ret
 
 

Here is the call graph for this function:

Here is the caller graph for this function:

◆ calculateMetrics()

def dc_processor.base_extractor.BaseExtractor.calculateMetrics	(	self,
		response
	)

Definition at line 331 of file base_extractor.py.

   def calculateMetrics(self, response):
     try:
       for metric in response.metrics:
         logger.debug("response.tags:\n%s\nmetric:\n%s", varDump(response.tags), varDump(metric))
         metric.calculateMetricValue(response.tags)
     except Exception, err:
       ExceptionLog.handler(logger, err, CONSTS.MSG_ERROR_CALC_METRICS)
       raise err
 
 

Here is the call graph for this function:

◆ isTagNotFilled()

def dc_processor.base_extractor.BaseExtractor.isTagNotFilled	(	self,
		result,
		tagName
	)

Definition at line 217 of file base_extractor.py.

   def isTagNotFilled(self, result, tagName):
     ret = True
     if tagName in result.tags:
       if isinstance(result.tags[tagName], basestring):
         ret = (result.tags[tagName].strip() == "")
       elif isinstance(result.tags[tagName], list):
         if len(result.tags[tagName]) > 0:
           ret = False
       elif isinstance(result.tags[tagName], dict):
         if "data" in result.tags[tagName]:
           if isinstance(result.tags[tagName]["data"], basestring):
             ret = (result.tags[tagName]["data"].strip() == "")
           elif isinstance(result.tags[tagName]["data"], list):
             for elem in result.tags[tagName]["data"]:
               ret = (elem.strip() == "")
               if not ret:
                 break
 
     return ret
 
 

Here is the caller graph for this function:

◆ isTagValueNotEmpty()

def dc_processor.base_extractor.BaseExtractor.isTagValueNotEmpty	(	self,
		tagValue
	)

Definition at line 240 of file base_extractor.py.

   def isTagValueNotEmpty(self, tagValue):
     full = None
     if isinstance(tagValue, list):
       if len(tagValue) == 0:
         full = None
       else:
         full = tagValue
     else:
       full = tagValue
     return full
 
 

Here is the caller graph for this function:

◆ loadScraperProperties()

def dc_processor.base_extractor.BaseExtractor.loadScraperProperties	(	self,
		scraperPropFileName
	)

Definition at line 205 of file base_extractor.py.

   def loadScraperProperties(self, scraperPropFileName):
     if scraperPropFileName is not None:
       try:
         with open(scraperPropFileName, "rb") as fd:
           scraperProperies = json.loads(fd.read())
           self.properties = scraperProperies[self.__class__.__name__][CONSTS.PROPERTIES_KEY]
       except Exception as excp:
         logger.debug(">>> Some error with scraper property loads = " + str(excp))
 
 

Here is the caller graph for this function:

◆ rankReading()

def dc_processor.base_extractor.BaseExtractor.rankReading	(	self,
		exctractorName
	)

Definition at line 343 of file base_extractor.py.

   def rankReading(self, exctractorName):
     wasSet = False
     if self.processorProperties is not None and exctractorName is not None and \
     CONSTS.RANK_KEY in self.processorProperties:
       try:
         rankProp = json.loads(self.processorProperties)
         if exctractorName in rankProp:
           self.rank = rankProp[exctractorName]
           wasSet = True
       except Exception:
         logger.debug(">>> Wrong json string in processorProperties[\"%s\"]", CONSTS.RANK_KEY)
 
     if not wasSet and self.properties is not None and CONSTS.RANK_KEY in self.properties:
       self.rank = self.properties[CONSTS.RANK_KEY]
 
     logger.debug(">>> Rank is : %s", str(self.rank))
 

Here is the caller graph for this function:

◆ tagValueElemValidate()

def dc_processor.base_extractor.BaseExtractor.tagValueElemValidate	(	self,
		tagValueElem,
		conditionElem
	)

Definition at line 254 of file base_extractor.py.

   def tagValueElemValidate(self, tagValueElem, conditionElem):
     ret = True
     if conditionElem["type"] == "include":
       ret = False
       if re.compile(conditionElem["RE"]).match(tagValueElem) is not None:
         ret = True
     elif conditionElem["type"] == "exclude":
       if re.compile(conditionElem["RE"]).match(tagValueElem) is not None:
         ret = False
     return ret
 
 

Here is the caller graph for this function:

◆ tagValueValidate()

def dc_processor.base_extractor.BaseExtractor.tagValueValidate	(	self,
		tagName,
		tagValue
	)

Definition at line 268 of file base_extractor.py.

   def tagValueValidate(self, tagName, tagValue):
     ret = tagValue
     if self.tagsValidator is not None and self.name in self.tagsValidator and tagName in self.tagsValidator[self.name]:
       try:
         if isinstance(tagValue, list):
           ret = []
           for elem in tagValue:
             if self.tagValueElemValidate(elem, self.tagsValidator[self.name][tagName]):
               ret.append(elem)
           if len(ret) == 0:
             ret = None
         elif isinstance(tagValue, basestring):
           if not self.tagValueElemValidate(tagValue, self.tagsValidator[self.name][tagName]):
             ret = None
       except Exception as excp:
         ExceptionLog.handler(logger, excp, '>>> something wrong in tagValueValidate method', (), \
                            {ExceptionLog.LEVEL_NAME_ERROR:ExceptionLog.LEVEL_VALUE_DEBUG})
     return ret
 
 

Here is the call graph for this function:

Here is the caller graph for this function:

Member Data Documentation

◆ config

dc_processor.base_extractor.BaseExtractor.config

Definition at line 162 of file base_extractor.py.

◆ data

dc_processor.base_extractor.BaseExtractor.data

Definition at line 177 of file base_extractor.py.

◆ db_dc_scraper_db

dc_processor.base_extractor.BaseExtractor.db_dc_scraper_db

Definition at line 178 of file base_extractor.py.

◆ DBConnector

dc_processor.base_extractor.BaseExtractor.DBConnector

Definition at line 179 of file base_extractor.py.

◆ imgDelimiter

dc_processor.base_extractor.BaseExtractor.imgDelimiter

Definition at line 181 of file base_extractor.py.

◆ modules

dc_processor.base_extractor.BaseExtractor.modules

Definition at line 175 of file base_extractor.py.

◆ name

dc_processor.base_extractor.BaseExtractor.name

Definition at line 170 of file base_extractor.py.

◆ process_mode

dc_processor.base_extractor.BaseExtractor.process_mode

Definition at line 174 of file base_extractor.py.

◆ processorProperties

dc_processor.base_extractor.BaseExtractor.processorProperties

Definition at line 163 of file base_extractor.py.

◆ properties

dc_processor.base_extractor.BaseExtractor.properties = None

static

Definition at line 103 of file base_extractor.py.

◆ rank

dc_processor.base_extractor.BaseExtractor.rank

Definition at line 171 of file base_extractor.py.

◆ tag

dictionary dc_processor.base_extractor.BaseExtractor.tag

static

Initial value:

=  {CONSTS.TAG_MEDIA: adjustMedia,
         CONSTS.TAG_CONTENT_UTF8_ENCODED: adjustContentUTF8Encoded,
         CONSTS.TAG_PUB_DATE: adjustPubDate,
         CONSTS.TAG_TITLE: adjustNone,
         CONSTS.TAG_LINK: adjustLink,
         CONSTS.TAG_DESCRIPTION: adjustNone,
         CONSTS.TAG_DC_DATE: adjustNone,
         CONSTS.TAG_AUTHOR: adjustNone,
         CONSTS.TAG_GUID: adjustNone,
         CONSTS.TAG_KEYWORDS: adjustNone,
         CONSTS.TAG_MEDIA_THUMBNAIL: adjustNone,
         CONSTS.TAG_ENCLOSURE: adjustNone,
         CONSTS.TAG_MEDIA_CONTENT: adjustNone,
         CONSTS.TAG_GOOGLE: adjustNone,
         CONSTS.TAG_GOOGLE_TOTAL: adjustNone,
         CONSTS.HTML_LANG: adjustNone
        }

Definition at line 105 of file base_extractor.py.

◆ tagsMask

dictionary dc_processor.base_extractor.BaseExtractor.tagsMask

static

Definition at line 124 of file base_extractor.py.

◆ tagsValidator

dc_processor.base_extractor.BaseExtractor.tagsValidator

Definition at line 184 of file base_extractor.py.

The documentation for this class was generated from the following file:

sources/hce/dc_processor/base_extractor.py

Public Member Functions

Public Attributes

Static Public Attributes

Detailed Description

Constructor & Destructor Documentation

◆ __init__()

Member Function Documentation

◆ __repr__()

◆ __str__()

◆ addTag()

◆ calculateMetrics()

◆ isTagNotFilled()

◆ isTagValueNotEmpty()

◆ loadScraperProperties()

◆ rankReading()

◆ tagValueElemValidate()

◆ tagValueValidate()

Member Data Documentation

◆ config

◆ data

◆ db_dc_scraper_db

◆ DBConnector

◆ imgDelimiter

◆ modules

◆ name

◆ process_mode

◆ processorProperties

◆ properties

◆ rank

◆ tag

◆ tagsMask

◆ tagsValidator

◆ init()

◆ repr()

◆ str()