HCE Project Python language Distributed Tasks Manager Application, Distributed Crawler Application and client API bindings.  2.0.0-chaika
Hierarchical Cluster Engine Python language binding
dc.SitesManager.SitesManager Class Reference
Inheritance diagram for dc.SitesManager.SitesManager:
Collaboration diagram for dc.SitesManager.SitesManager:

Public Member Functions

def __init__ (self, configParser, connectionBuilderLight=None)
 
def on_poll_timeout (self)
 
def onEventsHandler (self, event)
 
def eventsHandlerTS (self, event, loggingObj)
 
def getDRCEConnectionParamsFromPool (self, eventType)
 
def prepareDRCERequest (self, eventType, eventObj)
 
def processDRCERequest (self, taskExecuteStruct, persistentDCREConnection=True, timeout=-1, connectionParams=None)
 
def sendToDRCERouter (self, request, persistentDCREConnection=True, timeout=-1, connectionParams=None)
 
def processRecrawling (self, loggingObj)
 
def getSitesFromClientResponseItems (self, clientResponseItems)
 
def recalculateRecrawlPeriod (self, siteObj)
 
def logGeneralResponseResults (self, clientResponse)
 
def fixFields (self, event, nodes)
 
def fixField (self, value, divider, comment)
 
- Public Member Functions inherited from app.BaseServerManager.BaseServerManager
def __init__ (self, poller_manager=None, admin_connection=None, conectionLightBuilder=None, exceptionForward=False, dumpStatVars=True)
 constructor More...
 
def addConnection (self, name, connection)
 
def setEventHandler (self, eventType, handler)
 set event handler rewrite the current handler for eventType More...
 
def send (self, connect_name, event)
 send event More...
 
def reply (self, event, reply_event)
 wrapper for sending event in reply for event More...
 
def poll (self)
 poll function polling connections receive as multipart msg, the second argument is pickled pyobj More...
 
def process (self, event)
 process event call the event handler method that was set by user or on_unhandled_event method if not set More...
 
def run (self)
 
def is_connection_registered (self, name)
 check is a connection was registered in a instance of BaseServerManager i object More...
 
def on_poll_timeout (self)
 function will call every time when ConnectionTimeout exception arrive More...
 
def on_unhandled_event (self, event)
 function will call every time when arrive doesn't set handler for event type of event.evenType More...
 
def build_poller_list (self)
 
def clear_poller (self)
 
def onAdminState (self, event)
 onAdminState event handler process admin SHUTDOWN command More...
 
def onAdminFetchStatData (self, event)
 onAdminState event handler process admin command More...
 
def onAdminSuspend (self, event)
 onAdminState event handler process admin command More...
 
def getStatDataFields (self, fields)
 getStatDataFields returns stat data from storage More...
 
def getSystemStat (self)
 getSystemStat returns stat data for system indicators: RAMV, RAMR and CPU More...
 
def getConfigVarsFields (self, fields)
 getConfigVarsFields returns config vars from storage More...
 
def onAdminGetConfigVars (self, event)
 onAdminGetConfigVars event handler process getConfigVars admin command, fill and return config vars array from internal storage More...
 
def onAdminSetConfigVars (self, event)
 onAdminSetConfigVars event handler process setConfigVars admin command More...
 
def setConfigVars (self, setConfigVars)
 processSetConfigVars sets config vars in storage More...
 
def sendAdminReadyEvent (self)
 send ready event to notify adminInterfaceService More...
 
def createLogMsg (self, event)
 from string message from event object More...
 
def initStatFields (self, connect_name)
 add record in statFields More...
 
def updateStatField (self, field_name, value, operation=STAT_FIELDS_OPERATION_ADD)
 update values of stat field - default sum More...
 
def processSpecialConfigVars (self, name, value)
 send ready event to notify adminInterfaceService More...
 
def getLogLevel (self)
 Get log level from first of existing loggers. More...
 
def setLogLevel (self, level)
 Set log level for all loggers. More...
 
def saveStatVarsDump (self)
 Save stat vars in json file. More...
 
def loadStatVarsDump (self)
 Load stat vars in json file. More...
 
def getStatVarsDumpFileName (self)
 Get stat vars file name. More...
 
def createDBIDict (self, configParser)
 

Public Attributes

 serverName
 
 DRCEDBAppName
 
 eventTypes
 
 drceHost
 
 drcePort
 
 drceManager
 
 drceIdGenerator
 
 drceCommandConvertor
 
 processRecrawlLastTs
 
 recrawlSiteslQueue
 
 drceConnectionsPool
 
- Public Attributes inherited from app.BaseServerManager.BaseServerManager
 dumpStatVars
 
 poller_manager
 
 eventBuilder
 
 exit_flag
 
 pollTimeout
 
 connections
 
 event_handlers
 
 statFields
 stat fields container More...
 
 configVars
 
 exceptionForward
 

Static Public Attributes

int DRCE_REDUCER_TTL = 3000000
 
string SITE_PROPERTIES_RECRAWL_WHERE_NAME = "RECRAWL_WHERE"
 
string SITE_PROPERTIES_RECRAWL_DELETE_WHERE_NAME = "RECRAWL_DELETE_WHERE"
 
string SITE_PROPERTIES_RECRAWL_DELETE_NAME = "RECRAWL_DELETE"
 
string SITE_PROPERTIES_RECRAWL_OPTIMIZE_NAME = "RECRAWL_OPTIMIZE"
 
string SITE_PROPERTIES_RECRAWL_PERIOD_MODE_NAME = "RECRAWL_PERIOD_MODE"
 
string SITE_PROPERTIES_RECRAWL_PERIOD_MIN_NAME = "RECRAWL_PERIOD_MIN"
 
string SITE_PROPERTIES_RECRAWL_PERIOD_MAX_NAME = "RECRAWL_PERIOD_MAX"
 
string SITE_PROPERTIES_RECRAWL_PERIOD_STEP_NAME = "RECRAWL_PERIOD_STEP"
 
string SITE_RECRAWL_THREAD_NAME_PREFIX = 'ReCrawl_'
 
string CONFIG_SERVER = "server"
 
string CONFIG_DRCE_HOST = "DRCEHost"
 
string CONFIG_DRCE_PORT = "DRCEPort"
 
string CONFIG_DRCE_TIMEOUT = "DRCETimeout"
 
string CONFIG_DRCE_DB_APP_NAME = "DRCEDBAppName"
 
string CONFIG_RECRAWL_SITES_MAX = "RecrawlSiteMax"
 
string CONFIG_RECRAWL_SITES_ITER_PERIOD = "RecrawlSiteIterationPeriod"
 
string CONFIG_RECRAWL_SITES_PERIOD_MODE = "RecrawlSitePeriodMode"
 
string CONFIG_RECRAWL_SITES_PERIOD_MIN = "RecrawlSitePeriodMin"
 
string CONFIG_RECRAWL_SITES_PERIOD_MAX = "RecrawlSitePeriodMax"
 
string CONFIG_RECRAWL_SITES_PERIOD_STEP = "RecrawlSitePeriodStep"
 
string CONFIG_RECRAWL_SITES_RECRAWL_DATE_EXP = "RecrawlSiteRecrawlDateExpression"
 
string CONFIG_RECRAWL_SITES_SELECT_CRITERION = "RecrawlSiteSelectCriterion"
 
string CONFIG_RECRAWL_SITES_SELECT_ORDER = "RecrawlSiteSelectOrder"
 
string CONFIG_RECRAWL_SITES_MAX_THREADS = "RecrawlSiteMaxThreads"
 
string CONFIG_RECRAWL_SITES_LOCK_STATE = "RecrawlSiteLockState"
 
string CONFIG_RECRAWL_SITES_OPTIMIZE = "RecrawlSiteOptimize"
 
string CONFIG_RECRAWL_SITES_DRCE_TIMEOUT = "RecrawlSiteDRCETimeout"
 
string CONFIG_RECRAWL_SITES_MODE = "RecrawlSiteMode"
 
string CONFIG_RECRAWL_DELAY_BEFORE = "RecrawlDelayBefore"
 
string CONFIG_RECRAWL_DELAY_AFTER = "RecrawlDelayAfter"
 
string CONFIG_POLLING_TIMEOUT = "PollingTimeout"
 
string CONFIG_DEFAULT_RECRAWL_UPDATE_CRITERION = "DefaultRecrawUpdatelCriterion"
 
string CONFIG_DEFAULT_RECRAWL_DELETE_OLD = "DefaultRecrawDeleteOld"
 
string CONFIG_DEFAULT_RECRAWL_DELETE_OLD_CRITERION = "DefaultRecrawDeleteOldCriterion"
 
string CONFIG_DRCE_ROUTE = "DRCERoute"
 
string CONFIG_PURGE_METHOD = "PurgeMethod"
 
string CONFIG_DRCE_NODES = "DRCENodes"
 
string CONFIG_COMMON_COMMANDS_THREADING_MODE = "CommonCommandsThreadingMode"
 
string DRCE_CONNECTIONS_POOL = "DRCEConnectionsPool"
 
int COMMON_COMMANDS_THREADING_SIMPLE = 0
 
int COMMON_COMMANDS_THREADING_MULTI = 1
 
string COMMON_COMMANDS_THREAD_NAME_PREFIX = 'Common_'
 
- Static Public Attributes inherited from app.BaseServerManager.BaseServerManager
string ADMIN_CONNECT_ENDPOINT = "Admin"
 
string ADMIN_CONNECT_CLIENT = "Admin"
 
int POLL_TIMEOUT_DEFAULT = 3000
 
int STAT_FIELDS_OPERATION_ADD = 0
 
int STAT_FIELDS_OPERATION_SUB = 1
 
int STAT_FIELDS_OPERATION_SET = 2
 
int STAT_FIELDS_OPERATION_INIT = 3
 
string POLL_TIMEOUT_CONFIG_VAR_NAME = "POLL_TIMEOUT"
 
string LOG_LEVEL_CONFIG_VAR_NAME = "LOG_LEVEL"
 
string STAT_DUMPS_DEFAULT_DIR = "/tmp/"
 
string STAT_DUMPS_DEFAULT_NAME = "%APP_NAME%_%CLASS_NAME%_stat_vars.dump"
 
dictionary LOGGERS_NAMES = {APP_CONSTS.LOGGER_NAME, "dc", "dtm", "root", ""}
 

Detailed Description

Definition at line 58 of file SitesManager.py.

Constructor & Destructor Documentation

◆ __init__()

def dc.SitesManager.SitesManager.__init__ (   self,
  configParser,
  connectionBuilderLight = None 
)

Definition at line 114 of file SitesManager.py.

114  def __init__(self, configParser, connectionBuilderLight=None):
115  super(SitesManager, self).__init__()
116 
117  # Sites re-crawl counter name for stat variables
118  self.updateStatField(DC_CONSTS.SITES_RECRAWL_COUNTER_NAME, 0, self.STAT_FIELDS_OPERATION_INIT)
119  # Sites re-crawl sites updated counter name for stat variables
120  self.updateStatField(DC_CONSTS.SITES_RECRAWL_UPDATED_COUNTER_NAME, 0, self.STAT_FIELDS_OPERATION_INIT)
121  # Sites re-crawl sites deleted counter name for stat variables
122  self.updateStatField(DC_CONSTS.SITES_RECRAWL_DELETED_COUNTER_NAME, 0, self.STAT_FIELDS_OPERATION_INIT)
123  # Sites DRCE requests counter name for stat variables
124  self.updateStatField(DC_CONSTS.SITES_DRCE_COUNTER_NAME, 0, self.STAT_FIELDS_OPERATION_INIT)
125 
126  # Instantiate the connection builder light if not set
127  if connectionBuilderLight is None:
128  connectionBuilderLight = ConnectionBuilderLight()
129 
130  className = self.__class__.__name__
131 
132  # Get configuration settings
133  self.serverName = configParser.get(className, self.CONFIG_SERVER)
134  self.DRCEDBAppName = configParser.get(className, self.CONFIG_DRCE_DB_APP_NAME)
135 
136  # Create connections and raise bind or connect actions for correspondent connection type
137  serverConnection = connectionBuilderLight.build(TRANSPORT_CONSTS.SERVER_CONNECT, self.serverName)
138 
139  # Add connections to the polling set
140  self.addConnection(self.serverName, serverConnection)
141 
142  # Set handlers for all events of SITE and URL operations
143  self.eventTypes = {EVENT_TYPES.SITE_NEW:EVENT_TYPES.SITE_NEW_RESPONSE,
144  EVENT_TYPES.SITE_UPDATE:EVENT_TYPES.SITE_UPDATE_RESPONSE,
145  EVENT_TYPES.SITE_STATUS:EVENT_TYPES.SITE_STATUS_RESPONSE,
146  EVENT_TYPES.SITE_DELETE:EVENT_TYPES.SITE_DELETE_RESPONSE,
147  EVENT_TYPES.SITE_CLEANUP:EVENT_TYPES.SITE_CLEANUP_RESPONSE,
148  EVENT_TYPES.SITE_FIND:EVENT_TYPES.SITE_FIND_RESPONSE,
149  EVENT_TYPES.URL_NEW:EVENT_TYPES.URL_NEW_RESPONSE,
150  EVENT_TYPES.URL_STATUS:EVENT_TYPES.URL_STATUS_RESPONSE,
151  EVENT_TYPES.URL_UPDATE:EVENT_TYPES.URL_UPDATE_RESPONSE,
152  EVENT_TYPES.URL_DELETE:EVENT_TYPES.URL_DELETE_RESPONSE,
153  EVENT_TYPES.URL_FETCH:EVENT_TYPES.URL_FETCH_RESPONSE,
154  EVENT_TYPES.URL_CLEANUP:EVENT_TYPES.URL_CLEANUP_RESPONSE,
155  EVENT_TYPES.URL_CONTENT:EVENT_TYPES.URL_CONTENT_RESPONSE,
156  EVENT_TYPES.SQL_CUSTOM:EVENT_TYPES.SQL_CUSTOM_RESPONSE,
157  EVENT_TYPES.URL_PUT:EVENT_TYPES.URL_PUT_RESPONSE,
158  EVENT_TYPES.URL_HISTORY:EVENT_TYPES.URL_HISTORY_RESPONSE,
159  EVENT_TYPES.URL_STATS:EVENT_TYPES.URL_STATS_RESPONSE,
160  EVENT_TYPES.PROXY_NEW:EVENT_TYPES.PROXY_NEW_RESPONSE,
161  EVENT_TYPES.PROXY_UPDATE:EVENT_TYPES.PROXY_UPDATE_RESPONSE,
162  EVENT_TYPES.PROXY_DELETE:EVENT_TYPES.PROXY_DELETE_RESPONSE,
163  EVENT_TYPES.PROXY_STATUS:EVENT_TYPES.PROXY_STATUS_RESPONSE,
164  EVENT_TYPES.PROXY_FIND:EVENT_TYPES.PROXY_FIND_RESPONSE,
165  EVENT_TYPES.ATTR_SET:EVENT_TYPES.ATTR_SET_RESPONSE,
166  EVENT_TYPES.ATTR_UPDATE:EVENT_TYPES.ATTR_UPDATE_RESPONSE,
167  EVENT_TYPES.ATTR_DELETE:EVENT_TYPES.ATTR_DELETE_RESPONSE,
168  EVENT_TYPES.ATTR_FETCH:EVENT_TYPES.ATTR_FETCH_RESPONSE}
169  for ret in self.eventTypes:
170  self.setEventHandler(ret, self.onEventsHandler)
171 
172  # Initialize DRCE API
173  self.configVars[self.CONFIG_DRCE_TIMEOUT] = configParser.getint(className, self.CONFIG_DRCE_TIMEOUT)
174  self.configVars[self.CONFIG_DRCE_ROUTE] = configParser.get(className, self.CONFIG_DRCE_ROUTE)
175  try:
176  self.configVars[self.CONFIG_DRCE_NODES] = configParser.get(APP_CONSTS.CONFIG_APPLICATION_SECTION_NAME,
177  self.CONFIG_DRCE_NODES)
178  except ConfigParser.NoOptionError:
179  self.configVars[self.CONFIG_DRCE_NODES] = 1
180  self.drceHost = configParser.get(className, self.CONFIG_DRCE_HOST)
181  self.drcePort = configParser.get(className, self.CONFIG_DRCE_PORT)
182  hostParams = HostParams(self.drceHost, self.drcePort)
183  self.drceManager = DRCEManager()
184  self.drceManager.activate_host(hostParams)
185  self.drceIdGenerator = UIDGenerator()
186  self.drceCommandConvertor = CommandConvertor()
187 
188  # Init config vars storage for auto re-crawl
189  self.configVars[self.CONFIG_RECRAWL_SITES_MAX] = configParser.getint(className, self.CONFIG_RECRAWL_SITES_MAX)
190  self.configVars[self.CONFIG_RECRAWL_SITES_ITER_PERIOD] = \
191  configParser.getint(className, self.CONFIG_RECRAWL_SITES_ITER_PERIOD)
192  self.configVars[self.CONFIG_RECRAWL_SITES_RECRAWL_DATE_EXP] = \
193  configParser.get(className, self.CONFIG_RECRAWL_SITES_RECRAWL_DATE_EXP)
194  self.configVars[self.CONFIG_RECRAWL_SITES_SELECT_CRITERION] = \
195  configParser.get(className, self.CONFIG_RECRAWL_SITES_SELECT_CRITERION)
196  self.configVars[self.CONFIG_RECRAWL_SITES_SELECT_ORDER] = \
197  configParser.get(className, self.CONFIG_RECRAWL_SITES_SELECT_ORDER)
198  self.configVars[self.CONFIG_RECRAWL_SITES_LOCK_STATE] = \
199  configParser.getint(className, self.CONFIG_RECRAWL_SITES_LOCK_STATE)
200  self.configVars[self.CONFIG_RECRAWL_SITES_OPTIMIZE] = \
201  configParser.getint(className, self.CONFIG_RECRAWL_SITES_OPTIMIZE)
202  self.configVars[self.CONFIG_RECRAWL_SITES_DRCE_TIMEOUT] = \
203  configParser.getint(className, self.CONFIG_RECRAWL_SITES_DRCE_TIMEOUT)
204  self.configVars[self.CONFIG_RECRAWL_DELAY_BEFORE] = configParser.getint(className, self.CONFIG_RECRAWL_DELAY_BEFORE)
205  self.configVars[self.CONFIG_RECRAWL_DELAY_AFTER] = configParser.getint(className, self.CONFIG_RECRAWL_DELAY_AFTER)
206  self.processRecrawlLastTs = time.time()
207 
208  # Set connections poll timeout, defines period of HCE cluster monitoring cycle, msec
209  self.configVars[self.POLL_TIMEOUT_CONFIG_VAR_NAME] = configParser.getint(className, self.CONFIG_POLLING_TIMEOUT)
210 
211  # Init default re-crawl update criterion
212  self.configVars[self.CONFIG_DEFAULT_RECRAWL_UPDATE_CRITERION] = \
213  configParser.get(className, self.CONFIG_DEFAULT_RECRAWL_UPDATE_CRITERION)
214 
215  # Init default re-crawl delete old URLs operation
216  self.configVars[self.CONFIG_DEFAULT_RECRAWL_DELETE_OLD] = \
217  configParser.getint(className, self.CONFIG_DEFAULT_RECRAWL_DELETE_OLD)
218  # Init default re-crawl delete old URLs operation's criterion
219  self.configVars[self.CONFIG_DEFAULT_RECRAWL_DELETE_OLD_CRITERION] = \
220  configParser.get(className, self.CONFIG_DEFAULT_RECRAWL_DELETE_OLD_CRITERION)
221  # Init default re-crawl threads max number
222  self.configVars[self.CONFIG_RECRAWL_SITES_MAX_THREADS] = \
223  configParser.getint(className, self.CONFIG_RECRAWL_SITES_MAX_THREADS)
224  # Init re-crawl mode
225  self.configVars[self.CONFIG_RECRAWL_SITES_MODE] = \
226  configParser.getint(className, self.CONFIG_RECRAWL_SITES_MODE)
227 
228  # Init sites re-crawl queue
229  self.recrawlSiteslQueue = {}
230  # Init sites re-crawl threads counter
231  self.updateStatField(DC_CONSTS.RECRAWL_THREADS_COUNTER_QUEUE_NAME, 0, self.STAT_FIELDS_OPERATION_SET)
232  # Init sites re-crawl threads counter
233  self.updateStatField(DC_CONSTS.RECRAWL_SITES_QUEUE_NAME, 0, self.STAT_FIELDS_OPERATION_SET)
234  # Init sites re-crawl threads created counter
235  self.updateStatField(DC_CONSTS.RECRAWL_THREADS_CREATED_COUNTER_NAME, 0, self.STAT_FIELDS_OPERATION_SET)
236 
237  # Purge algorithm init
238  self.configVars[self.CONFIG_PURGE_METHOD] = configParser.getint(className, self.CONFIG_PURGE_METHOD)
239 
240  # Recrawl period mode
241  self.configVars[self.CONFIG_RECRAWL_SITES_PERIOD_MODE] = \
242  configParser.getint(className, self.CONFIG_RECRAWL_SITES_PERIOD_MODE)
243  self.configVars[self.CONFIG_RECRAWL_SITES_PERIOD_MIN] = \
244  configParser.getint(className, self.CONFIG_RECRAWL_SITES_PERIOD_MIN)
245  self.configVars[self.CONFIG_RECRAWL_SITES_PERIOD_MAX] = \
246  configParser.getint(className, self.CONFIG_RECRAWL_SITES_PERIOD_MAX)
247  self.configVars[self.CONFIG_RECRAWL_SITES_PERIOD_STEP] = \
248  configParser.getint(className, self.CONFIG_RECRAWL_SITES_PERIOD_STEP)
249 
250  # Init common operations variables
251  # Init common threads created counter
252  self.updateStatField(DC_CONSTS.COMMON_THREADS_CREATED_COUNTER_NAME, 0, self.STAT_FIELDS_OPERATION_SET)
253  # Init sites re-crawl threads counter
254  self.updateStatField(DC_CONSTS.COMMON_THREADS_COUNTER_QUEUE_NAME, 0, self.STAT_FIELDS_OPERATION_SET)
255  # Common operations counter name for stat variables
256  self.updateStatField(DC_CONSTS.COMMON_OPERATIONS_COUNTER_NAME, 0, self.STAT_FIELDS_OPERATION_INIT)
257 
258  # Init DRCE connections pool and event types / commands assignment
259  try:
260  self.drceConnectionsPool = json.loads(configParser.get(className, self.DRCE_CONNECTIONS_POOL))
261  except ConfigParser.NoOptionError:
262  self.drceConnectionsPool = None
263  except Exception as err:
264  logger.error("Error de-serialize json of connection parameters for DRCE connections pool: %s", err)
265  self.drceConnectionsPool = None
266 
267  # Init common commands threading model
268  try:
269  self.configVars[self.CONFIG_COMMON_COMMANDS_THREADING_MODE] = \
270  configParser.getint(className, self.CONFIG_COMMON_COMMANDS_THREADING_MODE)
271  except ConfigParser.NoOptionError:
272  self.configVars[self.CONFIG_COMMON_COMMANDS_THREADING_MODE] = self.COMMON_COMMANDS_THREADING_SIMPLE
273 
274 
275 
def __init__(self)
constructor
Definition: UIDGenerator.py:19
Here is the call graph for this function:

Member Function Documentation

◆ eventsHandlerTS()

def dc.SitesManager.SitesManager.eventsHandlerTS (   self,
  event,
  loggingObj 
)

Definition at line 344 of file SitesManager.py.

344  def eventsHandlerTS(self, event, loggingObj):
345  global logger # pylint: disable=W0603
346 
347  lock.acquire()
348  logger = loggingObj.getLogger(DC_CONSTS.LOGGER_NAME)
349  if self.configVars[self.CONFIG_COMMON_COMMANDS_THREADING_MODE] == self.COMMON_COMMANDS_THREADING_MULTI:
350  self.updateStatField(DC_CONSTS.COMMON_THREADS_COUNTER_QUEUE_NAME, 1, self.STAT_FIELDS_OPERATION_ADD)
351  self.updateStatField(DC_CONSTS.COMMON_OPERATIONS_COUNTER_NAME, 1, self.STAT_FIELDS_OPERATION_ADD)
352  # Fix some fields values (limits) of event object using the nodes number if >1
353  if self.configVars[self.CONFIG_DRCE_NODES] > 1:
354  self.fixFields(event, self.configVars[self.CONFIG_DRCE_NODES])
355  logger.debug("Request event:\n" + Utils.varDump(event))
356  # Prepare DRCE request
357  drceRequest = self.prepareDRCERequest(event.eventType, event.eventObj)
358  if self.configVars[self.CONFIG_COMMON_COMMANDS_THREADING_MODE] == self.COMMON_COMMANDS_THREADING_MULTI:
359  connectionParams, timeout = self.getDRCEConnectionParamsFromPool(event.eventType)
360  persistentConnection = False
361  else:
362  connectionParams = None
363  timeout = -1
364  persistentConnection = True
365  lock.release()
366 
367  # Send DRCE request
368  clientResponseObj = self.processDRCERequest(drceRequest, persistentConnection, timeout, connectionParams)
369  logger.debug("Response ClientResponseObj:\n" + Utils.varDump(clientResponseObj))
370 
371  lock.acquire()
372  # Prepare reply event
373  replyEvent = self.eventBuilder.build(self.eventTypes[event.eventType], clientResponseObj)
374 
375  # Append source event cookies because they will be copied to reply event
376  if event.eventType == EVENT_TYPES.URL_CONTENT:
377  if event.cookie is None:
378  event.cookie = {}
379  if isinstance(event.cookie, dict):
380  event.cookie[EventObjects.URLFetch.CRITERION_ORDER] = []
381  for urlContentRequestItem in event.eventObj:
382  if urlContentRequestItem.urlFetch is not None and\
383  EventObjects.URLFetch.CRITERION_ORDER in urlContentRequestItem.urlFetch.urlsCriterions:
384  event.cookie[EventObjects.URLFetch.CRITERION_ORDER].append(
385  urlContentRequestItem.urlFetch.urlsCriterions[EventObjects.URLFetch.CRITERION_ORDER]) # pylint: disable=C0330
386  else:
387  event.cookie[EventObjects.URLFetch.CRITERION_ORDER].append("")
388  if len(event.cookie) == 0:
389  event.cookie = None
390 
391  if event.eventType == EVENT_TYPES.URL_FETCH:
392  if event.cookie is None:
393  event.cookie = {}
394  if isinstance(event.cookie, dict):
395  event.cookie[EventObjects.URLFetch.CRITERION_ORDER] = []
396  for urlFetchRequestItem in event.eventObj:
397  if EventObjects.URLFetch.CRITERION_ORDER in urlFetchRequestItem.urlsCriterions:
398  event.cookie[EventObjects.URLFetch.CRITERION_ORDER].append(
399  urlFetchRequestItem.urlsCriterions[EventObjects.URLFetch.CRITERION_ORDER]) # pylint: disable=C0330
400  else:
401  event.cookie[EventObjects.URLFetch.CRITERION_ORDER].append("")
402  if len(event.cookie) == 0:
403  event.cookie = None
404 
405  # Send reply
406  self.reply(event, replyEvent)
407  logger.info("Reply sent")
408  lock.release()
409 
410  lock.acquire()
411  if self.configVars[self.CONFIG_COMMON_COMMANDS_THREADING_MODE] == self.COMMON_COMMANDS_THREADING_MULTI:
412  self.updateStatField(DC_CONSTS.COMMON_THREADS_COUNTER_QUEUE_NAME, 1, self.STAT_FIELDS_OPERATION_SUB)
413  lock.release()
414 
415 
416 
Here is the call graph for this function:
Here is the caller graph for this function:

◆ fixField()

def dc.SitesManager.SitesManager.fixField (   self,
  value,
  divider,
  comment 
)

Definition at line 939 of file SitesManager.py.

939  def fixField(self, value, divider, comment):
940  if value is not None:
941  d = int(divider)
942  v = int(value)
943 
944  ret = int(int(v) / int(d))
945 
946  if ret < 1 and v > 0:
947  ret = 1
948 
949  if ret * d < v:
950  ret = ret + 1
951 
952  logger.debug("Initial value of field `%s` from %s was fixed to %s, divider %s", comment, str(value), str(ret),
953  str(divider))
954  else:
955  ret = value
956 
957  return ret
958 
959 
Here is the caller graph for this function:

◆ fixFields()

def dc.SitesManager.SitesManager.fixFields (   self,
  event,
  nodes 
)

Definition at line 922 of file SitesManager.py.

922  def fixFields(self, event, nodes):
923  try:
924  if (event.eventType == DC_CONSTS.EVENT_TYPES.SITE_NEW and isinstance(event.eventObj, EventObjects.Site)) or\
925  (event.eventType == DC_CONSTS.EVENT_TYPES.SITE_UPDATE and isinstance(event.eventObj, EventObjects.SiteUpdate)):
926  fieldsList = ["maxURLs", "maxResources", "maxErrors"]
927  for fieldName in fieldsList:
928  setattr(event.eventObj, fieldName, self.fixField(getattr(event.eventObj, fieldName), nodes, fieldName))
929  except Exception, e:
930  logger.error("Error %s", str(e))
931 
932 
933 
Here is the call graph for this function:
Here is the caller graph for this function:

◆ getDRCEConnectionParamsFromPool()

def dc.SitesManager.SitesManager.getDRCEConnectionParamsFromPool (   self,
  eventType 
)

Definition at line 421 of file SitesManager.py.

421  def getDRCEConnectionParamsFromPool(self, eventType):
422  ret = (None, -1)
423 
424  if self.drceConnectionsPool is not None:
425  try:
426  for item in self.drceConnectionsPool:
427  if eventType in self.drceConnectionsPool[item]:
428  parts = item.split(':')
429  if len(parts) > 2:
430  ret = ((parts[0], int(parts[1])), int(parts[2]))
431  logger.info("Connection options found for event %s: %s", str(eventType), str(ret))
432  break
433  else:
434  logger.error("Wrong items number 'host:port:timeout' in DRCE connections pool key: %s", str(item))
435  except Exception as err:
436  logger.error("Error get DRCE connection parameters, possible wrong ini value for DRCE connections pool: %s\n%s",
437  err, str(self.drceConnectionsPool))
438 
439  return ret
440 
441 
442 
Here is the caller graph for this function:

◆ getSitesFromClientResponseItems()

def dc.SitesManager.SitesManager.getSitesFromClientResponseItems (   self,
  clientResponseItems 
)

Definition at line 804 of file SitesManager.py.

804  def getSitesFromClientResponseItems(self, clientResponseItems):
805  batchItemsCounter = 0
806  batchItemsTotalCounter = 0
807  uniqueSitesDic = {}
808 
809  for item in clientResponseItems:
810  if item.errorCode == EventObjects.ClientResponseItem.STATUS_OK:
811  if isinstance(item.itemObject, list):
812  for site in item.itemObject:
813  batchItemsTotalCounter = batchItemsTotalCounter + 1
814  if isinstance(site, EventObjects.Site):
815  if str(site.id) not in uniqueSitesDic:
816  uniqueSitesDic[str(site.id)] = site
817  batchItemsCounter = batchItemsCounter + 1
818  else:
819  # Sum for counters fields
820  uniqueSitesDic[str(site.id)].newURLs = uniqueSitesDic[str(site.id)].newURLs + site.newURLs
821  uniqueSitesDic[str(site.id)].collectedURLs = uniqueSitesDic[str(site.id)].collectedURLs + \
822  site.collectedURLs
823  uniqueSitesDic[str(site.id)].deletedURLs = uniqueSitesDic[str(site.id)].deletedURLs + site.deletedURLs
824  uniqueSitesDic[str(site.id)].contents = uniqueSitesDic[str(site.id)].contents + site.contents
825  uniqueSitesDic[str(site.id)].resources = uniqueSitesDic[str(site.id)].resources + site.resources
826  else:
827  logger.error("Wrong object type in the itemObject.item: " + str(type(site)) + \
828  " but 'Site' expected")
829  else:
830  logger.error("Wrong object type in the ClientResponseItem.itemObject: " + str(type(item.itemObject)) + \
831  " but 'list' expected")
832  else:
833  logger.debug("ClientResponseItem error: " + str(item.errorCode) + " : " + item.errorMessage)
834 
835  logger.debug("Unique sites: " + str(batchItemsCounter) + ", total sites: " + str(batchItemsTotalCounter))
836 
837  return uniqueSitesDic
838 
839 
840 
Here is the caller graph for this function:

◆ logGeneralResponseResults()

def dc.SitesManager.SitesManager.logGeneralResponseResults (   self,
  clientResponse 
)

Definition at line 900 of file SitesManager.py.

900  def logGeneralResponseResults(self, clientResponse):
901  if isinstance(clientResponse, EventObjects.ClientResponse):
902  if clientResponse.errorCode > 0:
903  logger.error("clientResponse.errorCode:" + str(clientResponse.errorCode) + ":" + clientResponse.errorMessage)
904  for clientResponseItem in clientResponse.itemsList:
905  if isinstance(clientResponseItem, EventObjects.ClientResponseItem):
906  if clientResponseItem.errorCode != EventObjects.ClientResponseItem.STATUS_OK:
907  logger.error("ClientResponseItem error: " + str(clientResponseItem.errorCode) + " : " + \
908  clientResponseItem.errorMessage + "\n" + Utils.varDump(clientResponseItem))
909  else:
910  logger.error("Wrong type: " + str(type(clientResponseItem)) + ", expected ClientResponseItem\n" + \
911  Utils.varDump(clientResponseItem))
912  else:
913  logger.error("Wrong type: " + str(type(clientResponse)) + ", expected ClientResponse\n" + \
914  Utils.varDump(clientResponse))
915 
916 
917 
Here is the caller graph for this function:

◆ on_poll_timeout()

def dc.SitesManager.SitesManager.on_poll_timeout (   self)

Definition at line 279 of file SitesManager.py.

279  def on_poll_timeout(self):
280  logger.debug("Periodic iteration started.")
281  try:
282  # Process periodic re-crawling
283  if self.configVars[self.CONFIG_RECRAWL_SITES_ITER_PERIOD] > 0 and\
284  time.time() - self.processRecrawlLastTs > self.configVars[self.CONFIG_RECRAWL_SITES_ITER_PERIOD]:
285  if self.configVars[self.CONFIG_RECRAWL_SITES_MODE] == 1:
286  logger.debug("Now time to try to perform re-crawl, interval %s",
287  str(self.configVars[self.CONFIG_RECRAWL_SITES_ITER_PERIOD]))
288  if self.configVars[self.CONFIG_RECRAWL_SITES_MAX_THREADS] > \
289  self.statFields[DC_CONSTS.RECRAWL_THREADS_COUNTER_QUEUE_NAME]:
290  self.processRecrawlLastTs = time.time()
291  logger.info("Forking new recrawl thread")
292  self.updateStatField(DC_CONSTS.RECRAWL_THREADS_CREATED_COUNTER_NAME, 1, self.STAT_FIELDS_OPERATION_ADD)
293  t1 = threading.Thread(target=self.processRecrawling, args=(logging,))
294  t1.setName(self.SITE_RECRAWL_THREAD_NAME_PREFIX + \
295  str(self.statFields[DC_CONSTS.RECRAWL_THREADS_CREATED_COUNTER_NAME]))
296  t1.start()
297  logger.info("New recrawl thread forked")
298  else:
299  # Max threads limit reached
300  logger.debug("Max recrawl threads limit reached %s",
301  str(self.configVars[self.CONFIG_RECRAWL_SITES_MAX_THREADS]))
302  else:
303  logger.debug("Re-crawl disabled!")
304  except IOError as e:
305  del e
306  except Exception as err:
307  Utils.ExceptionLog.handler(logger, err, "Exception:")
308 
309  logger.debug("Periodic iteration finished.")
310 
311 
312 
Here is the call graph for this function:

◆ onEventsHandler()

def dc.SitesManager.SitesManager.onEventsHandler (   self,
  event 
)

Definition at line 316 of file SitesManager.py.

316  def onEventsHandler(self, event):
317  try:
318  if self.configVars[self.CONFIG_COMMON_COMMANDS_THREADING_MODE] == self.COMMON_COMMANDS_THREADING_SIMPLE:
319  logger.info("Common command in simple mode")
320  # Call threa-safe handler direct way as simple single m-type hce-node cluster
321  self.eventsHandlerTS(event, logging)
322  else:
323  # Call threa-safe handler multi-thread way as multi m-type hce-node cluster
324  logger.info("Forking new common commands thread")
325  self.updateStatField(DC_CONSTS.COMMON_THREADS_CREATED_COUNTER_NAME, 1, self.STAT_FIELDS_OPERATION_ADD)
326  t2 = threading.Thread(target=self.eventsHandlerTS, args=(event, logging,))
327  t2.setName(self.COMMON_COMMANDS_THREAD_NAME_PREFIX + \
328  str(self.statFields[DC_CONSTS.COMMON_THREADS_CREATED_COUNTER_NAME]))
329  t2.start()
330  logger.info("New common commands thread forked")
331  except IOError as e:
332  del e
333  except Exception as err:
334  Utils.ExceptionLog.handler(logger, err, "Exception:")
335 
336  self.on_poll_timeout()
337 
338 
339 
Here is the call graph for this function:

◆ prepareDRCERequest()

def dc.SitesManager.SitesManager.prepareDRCERequest (   self,
  eventType,
  eventObj 
)

Definition at line 448 of file SitesManager.py.

448  def prepareDRCERequest(self, eventType, eventObj):
449  # Prepare DRCE request data
450  drceSyncTasksCoverObj = DC_CONSTS.DRCESyncTasksCover(eventType, eventObj)
451  # Prepare DRCE request object
452  taskExecuteStruct = TaskExecuteStruct()
453  taskExecuteStruct.command = self.DRCEDBAppName
454  taskExecuteStruct.input = pickle.dumps(drceSyncTasksCoverObj)
455  taskExecuteStruct.session = Session(Session.TMODE_SYNC)
456  logger.debug("DRCE taskExecuteStruct:\n" + Utils.varDump(taskExecuteStruct))
457 
458  return taskExecuteStruct
459 
460 
461 
Here is the caller graph for this function:

◆ processDRCERequest()

def dc.SitesManager.SitesManager.processDRCERequest (   self,
  taskExecuteStruct,
  persistentDCREConnection = True,
  timeout = -1,
  connectionParams = None 
)

Definition at line 467 of file SitesManager.py.

467  def processDRCERequest(self, taskExecuteStruct, persistentDCREConnection=True, timeout=-1, connectionParams=None):
468  lock.acquire()
469  # Create DRCE TaskExecuteRequest object
470  idGenerator = IDGenerator()
471  taskId = ctypes.c_uint32(zlib.crc32(idGenerator.get_connection_uid(), int(time.time()))).value
472  taskExecuteRequest = TaskExecuteRequest(taskId)
473  if self.configVars[self.CONFIG_DRCE_ROUTE] != "":
474  taskExecuteRequest.route = self.configVars[self.CONFIG_DRCE_ROUTE]
475  # Set taskExecuteRequest fields
476  taskExecuteRequest.data = taskExecuteStruct
477  lock.release()
478 
479  logger.info("Sending sync task id:" + str(taskId) + " to DRCE router!")
480  # Send request to DRCE Cluster router
481  response = self.sendToDRCERouter(taskExecuteRequest, persistentDCREConnection, timeout, connectionParams)
482  logger.info("Received response on sync task from DRCE router!")
483  logger.debug("Response: %s", Utils.varDump(response))
484 
485  # Create new client response object
486  clientResponse = EventObjects.ClientResponse()
487  # Check response returned
488  if response is None:
489  clientResponse.errorCode = EventObjects.ClientResponse.STATUS_ERROR_NONE
490  clientResponse.errorMessage = "Response error, None returned from DRCE, possible timeout " + \
491  str(self.configVars[self.CONFIG_DRCE_TIMEOUT]) + " msec!"
492  logger.error(clientResponse.errorMessage)
493  else:
494  if len(response.items) == 0:
495  clientResponse.errorCode = EventObjects.ClientResponse.STATUS_ERROR_EMPTY_LIST
496  clientResponse.errorMessage = "Response error, empty list returned from DRCE, possible no one node in cluster!"
497  logger.error(clientResponse.errorMessage)
498  else:
499  for item in response.items:
500  # New ClientResponseItem object
501  clientResponseItem = EventObjects.ClientResponseItem(None)
502  # If some error in response item or cli application exit status
503  if item.error_code > 0 or item.exit_status > 0:
504  clientResponseItem.errorCode = clientResponseItem.STATUS_ERROR_DRCE
505  clientResponseItem.errorMessage = "Response item error error_message=" + item.error_message + \
506  ", error_code=" + str(item.error_code) + \
507  ", exit_status=" + str(item.exit_status) + \
508  ", stderror=" + str(item.stderror)
509  logger.error(clientResponseItem.errorMessage)
510  else:
511  # Try to restore serialized response object from dump
512  try:
513  drceSyncTasksCover = pickle.loads(item.stdout)
514  clientResponseItem.itemObject = drceSyncTasksCover.eventObject
515  except Exception as e:
516  clientResponseItem.errorCode = EventObjects.ClientResponseItem.STATUS_ERROR_RESTORE_OBJECT
517  clientResponseItem.errorMessage = EventObjects.ClientResponseItem.MSG_ERROR_RESTORE_OBJECT + "\n" + \
518  str(e.message) + "\nstdout=" + str(item.stdout) + \
519  ", stderror=" + str(item.stderror)
520  logger.error(clientResponseItem.errorMessage)
521  # Set all information fields in any case
522  clientResponseItem.id = item.id
523  clientResponseItem.host = item.host
524  clientResponseItem.port = item.port
525  clientResponseItem.node = item.node
526  clientResponseItem.time = item.time
527  # Add ClientResponseItem object
528  clientResponse.itemsList.append(clientResponseItem)
529 
530  return clientResponse
531 
532 
533 
Here is the call graph for this function:
Here is the caller graph for this function:

◆ processRecrawling()

def dc.SitesManager.SitesManager.processRecrawling (   self,
  loggingObj 
)

Definition at line 586 of file SitesManager.py.

586  def processRecrawling(self, loggingObj):
587  try:
588  global logger # pylint: disable=W0603
589  lock.acquire()
590  logger = loggingObj.getLogger(DC_CONSTS.LOGGER_NAME)
591  self.updateStatField(DC_CONSTS.RECRAWL_THREADS_COUNTER_QUEUE_NAME, 1, self.STAT_FIELDS_OPERATION_ADD)
592  self.updateStatField(DC_CONSTS.SITES_RECRAWL_COUNTER_NAME, 1, self.STAT_FIELDS_OPERATION_ADD)
593  lock.release()
594  logger.info("RECRAWL_THREAD_STARTED")
595  # Set timeout in msecs
596  timeout = self.configVars[self.CONFIG_RECRAWL_SITES_DRCE_TIMEOUT] * 1000
597 
598  # Find sites that need to be re-crawled
599  crit = {EventObjects.SiteFind.CRITERION_WHERE: self.configVars[self.CONFIG_RECRAWL_SITES_SELECT_CRITERION],
600  EventObjects.SiteFind.CRITERION_ORDER: self.configVars[self.CONFIG_RECRAWL_SITES_SELECT_ORDER],
601  EventObjects.SiteFind.CRITERION_LIMIT: str(self.configVars[self.CONFIG_RECRAWL_SITES_MAX])}
602  siteFind = EventObjects.SiteFind(None, crit)
603  logger.debug("Send DRCE request SITE_FIND")
604  clientResponse = self.processDRCERequest(self.prepareDRCERequest(EVENT_TYPES.SITE_FIND, siteFind), False, timeout)
605  logger.debug("clientResponse:" + Utils.varDump(clientResponse))
606 
607  lock.acquire()
608  # Process results and create unique sites dict
609  sites = self.getSitesFromClientResponseItems(clientResponse.itemsList)
610  lock.release()
611 
612  sitesQueue = {}
613  for siteId in sites.keys():
614  if siteId in self.recrawlSiteslQueue:
615  # Site is already in progress of recrawl by some thread
616  logger.debug("Site %s is already in progress of recrawl by some thread", str(siteId))
617  continue
618 
619  lock.acquire()
620  # Push site item in to the cross-threading queue
621  t1 = time.time()
622  self.recrawlSiteslQueue[str(siteId)] = {"siteObj":sites[siteId], "time":t1}
623  # Push Site item in to the local queue
624  sitesQueue[str(siteId)] = {"time":t1}
625  # Refresh stat queue size
626  self.updateStatField(DC_CONSTS.RECRAWL_SITES_QUEUE_NAME, len(self.recrawlSiteslQueue),
627  self.STAT_FIELDS_OPERATION_SET)
628  lock.release()
629 
630  urlUpdateList = []
631  urlDeleteList = []
632  logger.debug("Site selected for recrawl, site[" + str(siteId) + "]:\n" + Utils.varDump(sites[siteId]))
633  # Save prev. site state to restore after re-crawl will be finished
634  sitePrevState = sites[siteId].state
635  # Update site including lock by state
636  siteUpdate = EventObjects.SiteUpdate(siteId)
637  siteUpdate.iterations = EventObjects.SQLExpression("Iterations+1")
638  siteUpdate.tcDate = EventObjects.SQLExpression("Now()")
639  siteUpdate.uDate = siteUpdate.tcDate
640  if sites[siteId].recrawlPeriod > 0 and self.configVars[self.CONFIG_RECRAWL_SITES_RECRAWL_DATE_EXP] != "":
641  # Set next re-crawl date if periodic re-crawl
642  siteUpdate.recrawlDate = \
643  EventObjects.SQLExpression(self.configVars[self.CONFIG_RECRAWL_SITES_RECRAWL_DATE_EXP])
644  else:
645  # Disable re-crawl
646  siteUpdate.recrawlDate = EventObjects.SQLExpression("NULL")
647  if self.configVars[self.CONFIG_RECRAWL_SITES_LOCK_STATE] != "":
648  # Set state to lock site
649  siteUpdate.state = int(self.configVars[self.CONFIG_RECRAWL_SITES_LOCK_STATE])
650  else:
651  logger.debug("Site is not locked due empty string value of configuration parameter %s",
652  self.CONFIG_RECRAWL_SITES_LOCK_STATE)
653  logger.debug("Update site request including lock state if configured, id=%s", str(siteId))
654  # Update site fields
655  clientResponse = self.processDRCERequest(self.prepareDRCERequest(EVENT_TYPES.SITE_UPDATE, siteUpdate), False,
656  timeout)
657  logger.debug("Update site request done, id=%s", str(siteId))
658  # Check responses on errors
659  lock.acquire()
660  self.logGeneralResponseResults(clientResponse)
661  lock.release()
662 
663  if self.configVars[self.CONFIG_RECRAWL_DELAY_BEFORE] > 0:
664  # Delay after set site state SUSPENDED
665  logger.debug("Delay before, %s sec...", str(self.configVars[self.CONFIG_RECRAWL_DELAY_BEFORE]))
666  time.sleep(self.configVars[self.CONFIG_RECRAWL_DELAY_BEFORE])
667 
668  # Prepare URLs update
669  urlUpdate = EventObjects.URLUpdate(siteId, "")
670  urlUpdate.url = None
671  urlUpdate.urlMd5 = None
672  urlUpdate.status = EventObjects.URLUpdate.STATUS_NEW
673  crit = EventObjects.Site.getFromProperties(sites[siteId].properties, self.SITE_PROPERTIES_RECRAWL_WHERE_NAME)
674  if crit is None or crit == "":
675  crit = self.configVars[self.CONFIG_DEFAULT_RECRAWL_UPDATE_CRITERION]
676  logger.debug("Default update criterion: " + str(crit))
677  else:
678  logger.debug("Custom site update criterion: " + str(crit))
679  urlUpdate.criterions = {EventObjects.URLFetch.CRITERION_WHERE : crit}
680  urlUpdateList.append(urlUpdate)
681  # Delete old URLs from prev. re-crawling iteration
682  if self.configVars[self.CONFIG_DEFAULT_RECRAWL_DELETE_OLD] > 0:
683  sqlExpression = EventObjects.SQLExpression(self.configVars[self.CONFIG_DEFAULT_RECRAWL_DELETE_OLD_CRITERION])
684  siteDelOld = EventObjects.Site.getFromProperties(sites[siteId].properties,
685  self.SITE_PROPERTIES_RECRAWL_DELETE_NAME)
686  if siteDelOld is None or siteDelOld != "0":
687  siteSqlExpression = EventObjects.Site.getFromProperties(sites[siteId].properties,
688  self.SITE_PROPERTIES_RECRAWL_DELETE_WHERE_NAME)
689  logger.debug("Site expression: " + str(siteSqlExpression))
690  if siteSqlExpression is not None and siteSqlExpression != "":
691  sqlExpression = siteSqlExpression
692  logger.debug("Custom expression set: " + str(sqlExpression))
693  else:
694  logger.debug("Site delete old: " + str(siteDelOld))
695  urlDelete = EventObjects.URLDelete(siteId, None, EventObjects.URLStatus.URL_TYPE_MD5,
696  {EventObjects.URLFetch.CRITERION_WHERE:sqlExpression},
697  reason=EventObjects.URLDelete.REASON_RECRAWL)
698  urlDelete.urlType = None
699  urlDelete.delayedType = self.configVars[self.CONFIG_PURGE_METHOD]
700  logger.debug("Old URLs delete due re-crawl: " + Utils.varDump(urlDelete))
701  urlDeleteList.append(urlDelete)
702  logger.debug("URLDelete request, id=%s", str(siteId))
703  # Delete old URLs from prev. re-crawling
704  clientResponse = self.processDRCERequest(self.prepareDRCERequest(EVENT_TYPES.URL_DELETE, urlDeleteList),
705  False, timeout)
706  logger.debug("URLDelete request done, id=%s", str(siteId))
707  lock.acquire()
708  self.updateStatField(DC_CONSTS.SITES_RECRAWL_DELETED_COUNTER_NAME, len(urlDeleteList),
709  self.STAT_FIELDS_OPERATION_ADD)
710  # Check responses on errors
711  self.logGeneralResponseResults(clientResponse)
712  lock.release()
713 
714  # Optimize urls table
715  optimize = EventObjects.Site.getFromProperties(sites[siteId].properties,
716  self.SITE_PROPERTIES_RECRAWL_OPTIMIZE_NAME)
717  if (int(self.configVars[self.CONFIG_RECRAWL_SITES_OPTIMIZE]) == 1 and optimize is None) or int(optimize) == 1:
718  sqlQuery = "OPTIMIZE TABLE `urls_" + str(siteId) + "`"
719  logger.debug("CustomRequest query: %s", sqlQuery)
720  customRequest = CustomRequest(1, sqlQuery, "dc_urls")
721  # OPTIMIZE TABLE request
722  customResponse = self.processDRCERequest(self.prepareDRCERequest(EVENT_TYPES.SQL_CUSTOM, customRequest),
723  False, timeout)
724  logger.debug("CustomRequest request done, id=%s, customResponse:\n%s", str(siteId),
725  Utils.varDump(customResponse))
726 
727  # Auto tune RecrawlPeriod
728  recrawlPeriod = self.recalculateRecrawlPeriod(sites[siteId])
729  if recrawlPeriod is not None:
730  siteUpdate.recrawlPeriod = recrawlPeriod
731 
732  # Update URLs to push them to start crawling
733  clientResponse = self.processDRCERequest(self.prepareDRCERequest(EVENT_TYPES.URL_UPDATE, urlUpdateList), False,
734  timeout)
735  lock.acquire()
736  self.updateStatField(DC_CONSTS.SITES_RECRAWL_UPDATED_COUNTER_NAME, len(urlUpdateList),
737  self.STAT_FIELDS_OPERATION_ADD)
738  # Check responses on errors
739  self.logGeneralResponseResults(clientResponse)
740  lock.release()
741 
742  if self.configVars[self.CONFIG_RECRAWL_DELAY_AFTER] > 0:
743  # Delay after re-crawl processing done
744  logger.debug("Delay after, %s sec...", str(self.configVars[self.CONFIG_RECRAWL_DELAY_AFTER]))
745  time.sleep(self.configVars[self.CONFIG_RECRAWL_DELAY_AFTER])
746 
747  # Return site to previous state, i.e. unlock
748  if self.configVars[self.CONFIG_RECRAWL_SITES_LOCK_STATE] != "":
749  # Set state to unlock site
750  siteUpdate.state = sitePrevState
751  siteUpdate.iterations = None
752  siteUpdate.tcDate = EventObjects.SQLExpression("Now()")
753  siteUpdate.uDate = None
754  siteUpdate.recrawlDate = None
755  logger.debug("Unlock site request, id=%s", str(siteId))
756  # Update site fields
757  clientResponse = self.processDRCERequest(self.prepareDRCERequest(EVENT_TYPES.SITE_UPDATE, siteUpdate), False,
758  timeout)
759  logger.debug("Unlock site request done, id=%s", str(siteId))
760  # Check responses on errors
761  lock.acquire()
762  self.logGeneralResponseResults(clientResponse)
763  lock.release()
764  else:
765  logger.debug("Site is not unlocked due empty string value of configuration parameter %s",
766  self.CONFIG_RECRAWL_SITES_LOCK_STATE)
767  # lock.acquire()
768  # self.totalTime = self.totalTime + (time.time() - t)
769  # self.updateStatField(DC_CONSTS.BATCHES_CRAWL_COUNTER_TIME_AVG_NAME,
770  # str(self.totalTime / float(1 + self.statFields[DC_CONSTS.BATCHES_CRAWL_COUNTER_TOTAL_NAME])),
771  # self.STAT_FIELDS_OPERATION_SET)
772  # lock.release()
773  except Exception as err:
774  lock.acquire()
775  logger.error("Recrawl thread exception:" + str(err))
776  lock.release()
777  except: # pylint: disable=W0702
778  lock.acquire()
779  logger.error("Recrawl thread unknown exception!")
780  lock.release()
781 
782  lock.acquire()
783  # Remove processed sites from global the cross-threading queue
784  tmpQueue = {}
785  for siteId in self.recrawlSiteslQueue.keys():
786  if siteId not in sitesQueue:
787  # Accumulate only items that is not processed by this thread
788  tmpQueue[str(siteId)] = self.recrawlSiteslQueue[str(siteId)]
789  # Replace current queue with accumulated queue
790  self.recrawlSiteslQueue = tmpQueue
791  # Decrement of counter of threads
792  self.updateStatField(DC_CONSTS.RECRAWL_THREADS_COUNTER_QUEUE_NAME, 1, self.STAT_FIELDS_OPERATION_SUB)
793  # Refresh stat queue size
794  self.updateStatField(DC_CONSTS.RECRAWL_SITES_QUEUE_NAME, len(self.recrawlSiteslQueue),
795  self.STAT_FIELDS_OPERATION_SET)
796  logger.info("RECRAWL_THREAD_FINISHED")
797  lock.release()
798 
799 
800 
Here is the call graph for this function:
Here is the caller graph for this function:

◆ recalculateRecrawlPeriod()

def dc.SitesManager.SitesManager.recalculateRecrawlPeriod (   self,
  siteObj 
)

Definition at line 844 of file SitesManager.py.

844  def recalculateRecrawlPeriod(self, siteObj):
845  recrawlPeriod = None
846 
847  # Init per site settings
848  mode = EventObjects.Site.getFromProperties(siteObj.properties, self.SITE_PROPERTIES_RECRAWL_PERIOD_MODE_NAME)
849  if mode is None:
850  mode = self.configVars[self.CONFIG_RECRAWL_SITES_PERIOD_MODE]
851  else:
852  mode = int(mode)
853  minv = EventObjects.Site.getFromProperties(siteObj.properties, self.SITE_PROPERTIES_RECRAWL_PERIOD_MIN_NAME)
854  if minv is None:
855  minv = self.configVars[self.CONFIG_RECRAWL_SITES_PERIOD_MIN]
856  else:
857  minv = int(minv)
858  maxv = EventObjects.Site.getFromProperties(siteObj.properties, self.SITE_PROPERTIES_RECRAWL_PERIOD_MAX_NAME)
859  if maxv is None:
860  maxv = self.configVars[self.CONFIG_RECRAWL_SITES_PERIOD_MAX]
861  else:
862  maxv = int(maxv)
863  step = EventObjects.Site.getFromProperties(siteObj.properties, self.SITE_PROPERTIES_RECRAWL_PERIOD_STEP_NAME)
864  if step is None:
865  step = self.configVars[self.CONFIG_RECRAWL_SITES_PERIOD_STEP]
866  else:
867  step = int(step)
868 
869  logger.debug("RecrawlPeriod auto recalculate, siteId:%s, mode:%s, minv:%s, maxv:%s, step:%s", str(siteObj.id),
870  str(mode), str(minv), str(maxv), str(step))
871 
872  # If recrawl period recalculation mode is ON
873  if mode > 0:
874  logger.debug("RecrawlPeriod auto recalculate is ON, siteId:%s, current value:%s",
875  str(siteObj.id), str(siteObj.recrawlPeriod))
876  # If not all URLS in state NEW are crawled or processed (siteObj.resources - means crawled but not processed)
877  if siteObj.newURLs > 0 or siteObj.resources > 0:
878  # If maximum value of period is not reached
879  if siteObj.recrawlPeriod < maxv:
880  # Increase period on step value
881  recrawlPeriod = siteObj.recrawlPeriod + step
882  else:
883  logger.debug("Max value of RecrawlPeriod reached:%s", str(maxv))
884  else:
885  # If minimum value of period is not reached
886  if siteObj.recrawlPeriod > minv:
887  # Decrease the period on step value
888  recrawlPeriod = siteObj.recrawlPeriod - step
889  else:
890  logger.debug("Min value of RecrawlPeriod reached:%s", str(minv))
891  logger.debug("New RecrawlPeriod value for site %s is:%s", str(siteObj.id), str(recrawlPeriod))
892 
893  return recrawlPeriod
894 
895 
896 
Here is the caller graph for this function:

◆ sendToDRCERouter()

def dc.SitesManager.SitesManager.sendToDRCERouter (   self,
  request,
  persistentDCREConnection = True,
  timeout = -1,
  connectionParams = None 
)

Definition at line 540 of file SitesManager.py.

540  def sendToDRCERouter(self, request, persistentDCREConnection=True, timeout=-1, connectionParams=None):
541  lock.acquire()
542  if timeout == -1:
543  timeout = int(self.configVars[self.CONFIG_DRCE_TIMEOUT])
544  self.updateStatField(DC_CONSTS.SITES_DRCE_COUNTER_NAME, 1, self.STAT_FIELDS_OPERATION_ADD)
545  if persistentDCREConnection:
546  logger.info("DRCE router sending via persistent connection with timeout=%s", str(timeout))
547  drceManager = self.drceManager
548  else:
549  drceManager = DRCEManager()
550  if connectionParams is None:
551  drceManager.activate_host(HostParams(self.drceHost, self.drcePort))
552  logger.info("DRCE router sending via temporary connection with timeout=" + str(timeout) + \
553  ", and regular host:" + str(self.drceHost) + ", port:" + str(self.drcePort))
554  else:
555  logger.info("DRCE router sending via temporary connection with timeout=" + str(timeout) + \
556  ", and DRCE connections pool host:" + str(connectionParams[0]) + \
557  ", port:" + str(connectionParams[1]))
558  drceManager.activate_host(HostParams(connectionParams[0], int(connectionParams[1])))
559  lock.release()
560 
561  # Try to execute request
562  try:
563  response = drceManager.process(request, timeout, self.DRCE_REDUCER_TTL)
564  except (ConnectionTimeout, TransportInternalErr, CommandExecutorErr) as err:
565  response = None
566  logger.error("DRCE router transport send error : " + str(err.message))
567  except Exception as err:
568  response = None
569  logger.error("DRCE router common error : " + str(err.message))
570 
571  logger.info("DRCE router sent!")
572 
573  if not persistentDCREConnection:
574  lock.acquire()
575  drceManager.clear_host()
576  lock.release()
577 
578  return response
579 
580 
581 
Here is the call graph for this function:
Here is the caller graph for this function:

Member Data Documentation

◆ COMMON_COMMANDS_THREAD_NAME_PREFIX

string dc.SitesManager.SitesManager.COMMON_COMMANDS_THREAD_NAME_PREFIX = 'Common_'
static

Definition at line 106 of file SitesManager.py.

◆ COMMON_COMMANDS_THREADING_MULTI

int dc.SitesManager.SitesManager.COMMON_COMMANDS_THREADING_MULTI = 1
static

Definition at line 105 of file SitesManager.py.

◆ COMMON_COMMANDS_THREADING_SIMPLE

int dc.SitesManager.SitesManager.COMMON_COMMANDS_THREADING_SIMPLE = 0
static

Definition at line 104 of file SitesManager.py.

◆ CONFIG_COMMON_COMMANDS_THREADING_MODE

string dc.SitesManager.SitesManager.CONFIG_COMMON_COMMANDS_THREADING_MODE = "CommonCommandsThreadingMode"
static

Definition at line 101 of file SitesManager.py.

◆ CONFIG_DEFAULT_RECRAWL_DELETE_OLD

string dc.SitesManager.SitesManager.CONFIG_DEFAULT_RECRAWL_DELETE_OLD = "DefaultRecrawDeleteOld"
static

Definition at line 96 of file SitesManager.py.

◆ CONFIG_DEFAULT_RECRAWL_DELETE_OLD_CRITERION

string dc.SitesManager.SitesManager.CONFIG_DEFAULT_RECRAWL_DELETE_OLD_CRITERION = "DefaultRecrawDeleteOldCriterion"
static

Definition at line 97 of file SitesManager.py.

◆ CONFIG_DEFAULT_RECRAWL_UPDATE_CRITERION

string dc.SitesManager.SitesManager.CONFIG_DEFAULT_RECRAWL_UPDATE_CRITERION = "DefaultRecrawUpdatelCriterion"
static

Definition at line 95 of file SitesManager.py.

◆ CONFIG_DRCE_DB_APP_NAME

string dc.SitesManager.SitesManager.CONFIG_DRCE_DB_APP_NAME = "DRCEDBAppName"
static

Definition at line 77 of file SitesManager.py.

◆ CONFIG_DRCE_HOST

string dc.SitesManager.SitesManager.CONFIG_DRCE_HOST = "DRCEHost"
static

Definition at line 74 of file SitesManager.py.

◆ CONFIG_DRCE_NODES

string dc.SitesManager.SitesManager.CONFIG_DRCE_NODES = "DRCENodes"
static

Definition at line 100 of file SitesManager.py.

◆ CONFIG_DRCE_PORT

string dc.SitesManager.SitesManager.CONFIG_DRCE_PORT = "DRCEPort"
static

Definition at line 75 of file SitesManager.py.

◆ CONFIG_DRCE_ROUTE

string dc.SitesManager.SitesManager.CONFIG_DRCE_ROUTE = "DRCERoute"
static

Definition at line 98 of file SitesManager.py.

◆ CONFIG_DRCE_TIMEOUT

string dc.SitesManager.SitesManager.CONFIG_DRCE_TIMEOUT = "DRCETimeout"
static

Definition at line 76 of file SitesManager.py.

◆ CONFIG_POLLING_TIMEOUT

string dc.SitesManager.SitesManager.CONFIG_POLLING_TIMEOUT = "PollingTimeout"
static

Definition at line 94 of file SitesManager.py.

◆ CONFIG_PURGE_METHOD

string dc.SitesManager.SitesManager.CONFIG_PURGE_METHOD = "PurgeMethod"
static

Definition at line 99 of file SitesManager.py.

◆ CONFIG_RECRAWL_DELAY_AFTER

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_DELAY_AFTER = "RecrawlDelayAfter"
static

Definition at line 93 of file SitesManager.py.

◆ CONFIG_RECRAWL_DELAY_BEFORE

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_DELAY_BEFORE = "RecrawlDelayBefore"
static

Definition at line 92 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_DRCE_TIMEOUT

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_DRCE_TIMEOUT = "RecrawlSiteDRCETimeout"
static

Definition at line 90 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_ITER_PERIOD

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_ITER_PERIOD = "RecrawlSiteIterationPeriod"
static

Definition at line 79 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_LOCK_STATE

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_LOCK_STATE = "RecrawlSiteLockState"
static

Definition at line 88 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_MAX

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_MAX = "RecrawlSiteMax"
static

Definition at line 78 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_MAX_THREADS

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_MAX_THREADS = "RecrawlSiteMaxThreads"
static

Definition at line 87 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_MODE

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_MODE = "RecrawlSiteMode"
static

Definition at line 91 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_OPTIMIZE

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_OPTIMIZE = "RecrawlSiteOptimize"
static

Definition at line 89 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_PERIOD_MAX

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_PERIOD_MAX = "RecrawlSitePeriodMax"
static

Definition at line 82 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_PERIOD_MIN

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_PERIOD_MIN = "RecrawlSitePeriodMin"
static

Definition at line 81 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_PERIOD_MODE

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_PERIOD_MODE = "RecrawlSitePeriodMode"
static

Definition at line 80 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_PERIOD_STEP

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_PERIOD_STEP = "RecrawlSitePeriodStep"
static

Definition at line 83 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_RECRAWL_DATE_EXP

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_RECRAWL_DATE_EXP = "RecrawlSiteRecrawlDateExpression"
static

Definition at line 84 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_SELECT_CRITERION

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_SELECT_CRITERION = "RecrawlSiteSelectCriterion"
static

Definition at line 85 of file SitesManager.py.

◆ CONFIG_RECRAWL_SITES_SELECT_ORDER

string dc.SitesManager.SitesManager.CONFIG_RECRAWL_SITES_SELECT_ORDER = "RecrawlSiteSelectOrder"
static

Definition at line 86 of file SitesManager.py.

◆ CONFIG_SERVER

string dc.SitesManager.SitesManager.CONFIG_SERVER = "server"
static

Definition at line 73 of file SitesManager.py.

◆ DRCE_CONNECTIONS_POOL

string dc.SitesManager.SitesManager.DRCE_CONNECTIONS_POOL = "DRCEConnectionsPool"
static

Definition at line 103 of file SitesManager.py.

◆ DRCE_REDUCER_TTL

int dc.SitesManager.SitesManager.DRCE_REDUCER_TTL = 3000000
static

Definition at line 60 of file SitesManager.py.

◆ drceCommandConvertor

dc.SitesManager.SitesManager.drceCommandConvertor

Definition at line 186 of file SitesManager.py.

◆ drceConnectionsPool

dc.SitesManager.SitesManager.drceConnectionsPool

Definition at line 260 of file SitesManager.py.

◆ DRCEDBAppName

dc.SitesManager.SitesManager.DRCEDBAppName

Definition at line 134 of file SitesManager.py.

◆ drceHost

dc.SitesManager.SitesManager.drceHost

Definition at line 180 of file SitesManager.py.

◆ drceIdGenerator

dc.SitesManager.SitesManager.drceIdGenerator

Definition at line 185 of file SitesManager.py.

◆ drceManager

dc.SitesManager.SitesManager.drceManager

Definition at line 183 of file SitesManager.py.

◆ drcePort

dc.SitesManager.SitesManager.drcePort

Definition at line 181 of file SitesManager.py.

◆ eventTypes

dc.SitesManager.SitesManager.eventTypes

Definition at line 143 of file SitesManager.py.

◆ processRecrawlLastTs

dc.SitesManager.SitesManager.processRecrawlLastTs

Definition at line 206 of file SitesManager.py.

◆ recrawlSiteslQueue

dc.SitesManager.SitesManager.recrawlSiteslQueue

Definition at line 229 of file SitesManager.py.

◆ serverName

dc.SitesManager.SitesManager.serverName

Definition at line 133 of file SitesManager.py.

◆ SITE_PROPERTIES_RECRAWL_DELETE_NAME

string dc.SitesManager.SitesManager.SITE_PROPERTIES_RECRAWL_DELETE_NAME = "RECRAWL_DELETE"
static

Definition at line 63 of file SitesManager.py.

◆ SITE_PROPERTIES_RECRAWL_DELETE_WHERE_NAME

string dc.SitesManager.SitesManager.SITE_PROPERTIES_RECRAWL_DELETE_WHERE_NAME = "RECRAWL_DELETE_WHERE"
static

Definition at line 62 of file SitesManager.py.

◆ SITE_PROPERTIES_RECRAWL_OPTIMIZE_NAME

string dc.SitesManager.SitesManager.SITE_PROPERTIES_RECRAWL_OPTIMIZE_NAME = "RECRAWL_OPTIMIZE"
static

Definition at line 64 of file SitesManager.py.

◆ SITE_PROPERTIES_RECRAWL_PERIOD_MAX_NAME

string dc.SitesManager.SitesManager.SITE_PROPERTIES_RECRAWL_PERIOD_MAX_NAME = "RECRAWL_PERIOD_MAX"
static

Definition at line 68 of file SitesManager.py.

◆ SITE_PROPERTIES_RECRAWL_PERIOD_MIN_NAME

string dc.SitesManager.SitesManager.SITE_PROPERTIES_RECRAWL_PERIOD_MIN_NAME = "RECRAWL_PERIOD_MIN"
static

Definition at line 67 of file SitesManager.py.

◆ SITE_PROPERTIES_RECRAWL_PERIOD_MODE_NAME

string dc.SitesManager.SitesManager.SITE_PROPERTIES_RECRAWL_PERIOD_MODE_NAME = "RECRAWL_PERIOD_MODE"
static

Definition at line 66 of file SitesManager.py.

◆ SITE_PROPERTIES_RECRAWL_PERIOD_STEP_NAME

string dc.SitesManager.SitesManager.SITE_PROPERTIES_RECRAWL_PERIOD_STEP_NAME = "RECRAWL_PERIOD_STEP"
static

Definition at line 69 of file SitesManager.py.

◆ SITE_PROPERTIES_RECRAWL_WHERE_NAME

string dc.SitesManager.SitesManager.SITE_PROPERTIES_RECRAWL_WHERE_NAME = "RECRAWL_WHERE"
static

Definition at line 61 of file SitesManager.py.

◆ SITE_RECRAWL_THREAD_NAME_PREFIX

string dc.SitesManager.SitesManager.SITE_RECRAWL_THREAD_NAME_PREFIX = 'ReCrawl_'
static

Definition at line 70 of file SitesManager.py.


The documentation for this class was generated from the following file: