HCE Project Python language Distributed Tasks Manager Application, Distributed Crawler Application and client API bindings.  2.0.0-chaika
Hierarchical Cluster Engine Python language binding
dc_crawler.OwnRobots.RobotExclusionRulesParser Class Reference
Inheritance diagram for dc_crawler.OwnRobots.RobotExclusionRulesParser:
Collaboration diagram for dc_crawler.OwnRobots.RobotExclusionRulesParser:

Public Member Functions

def __init__ (self)
 
def source_url (self)
 
def response_code (self)
 
def sitemap (self)
 
def sitemaps (self)
 
def is_expired (self)
 
def is_allowed (self, user_agent, url, syntax=GYM2008)
 
def get_crawl_delay (self, user_agent)
 
def fetch (self, url, timeout=None)
 
def parse (self, s)
 
def __str__ (self)
 
def __unicode__ (self)
 

Public Attributes

 user_agent
 
 use_local_time
 
 expiration_date
 

Private Member Functions

def _now (self)
 

Private Attributes

 _source_url
 
 _response_code
 
 _sitemaps
 
 __rulesets
 

Detailed Description

A parser for robots.txt files.

Definition at line 300 of file OwnRobots.py.

Constructor & Destructor Documentation

◆ __init__()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.__init__ (   self)

Definition at line 302 of file OwnRobots.py.

302  def __init__(self):
303  self._source_url = ""
304  self.user_agent = None
305  self.use_local_time = True
306  self.expiration_date = self._now() + SEVEN_DAYS
307  self._response_code = 0
308  self._sitemaps = [ ]
309  self.__rulesets = [ ]
310 
311 
def __init__(self)
constructor
Definition: UIDGenerator.py:19

Member Function Documentation

◆ __str__()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.__str__ (   self)

Definition at line 651 of file OwnRobots.py.

651  def __str__(self):
652  s = self.__unicode__()
653  if PY_MAJOR_VERSION == 2:
654  s = s.encode("utf-8")
655 
656  return s
657 
Here is the call graph for this function:

◆ __unicode__()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.__unicode__ (   self)

Definition at line 658 of file OwnRobots.py.

658  def __unicode__(self):
659  if self._sitemaps:
660  s = "Sitemaps: %s\n\n" % self._sitemaps
661  else:
662  s = ""
663  if PY_MAJOR_VERSION < 3:
664  s = unicode(s)
665  # I also need to string-ify each ruleset. The function for doing so
666  # varies under Python 2/3.
667  stringify = (unicode if (PY_MAJOR_VERSION == 2) else str)
668  return s + '\n'.join( [stringify(ruleset) for ruleset in self.__rulesets] )
669 
670 
Definition: join.py:1
Here is the caller graph for this function:

◆ _now()

def dc_crawler.OwnRobots.RobotExclusionRulesParser._now (   self)
private

Definition at line 344 of file OwnRobots.py.

344  def _now(self):
345  if self.use_local_time:
346  return time.time()
347  else:
348  # What the heck is timegm() doing in the calendar module?!?
349  return calendar.timegm(time.gmtime())
350 
351 
Here is the caller graph for this function:

◆ fetch()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.fetch (   self,
  url,
  timeout = None 
)
Attempts to fetch the URL requested which should refer to a 
robots.txt file, e.g. http://example.com/robots.txt.

Definition at line 399 of file OwnRobots.py.

399  def fetch(self, url, timeout=None):
400  """Attempts to fetch the URL requested which should refer to a
401  robots.txt file, e.g. http://example.com/robots.txt.
402  """
403 
404  # ISO-8859-1 is the default encoding for text files per the specs for
405  # HTTP 1.0 (RFC 1945 sec 3.6.1) and HTTP 1.1 (RFC 2616 sec 3.7.1).
406  # ref: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
407  encoding = "iso-8859-1"
408  content = ""
409  expires_header = None
410  content_type_header = None
411  self._response_code = 0
412  self._source_url = url
413 
414  if self.user_agent:
415  req = urllib_request.Request(url, None, { 'User-Agent' : self.user_agent })
416  else:
417  req = urllib_request.Request(url)
418 
419  try:
420  if timeout:
421  f = urllib_request.urlopen(req, timeout=timeout)
422  else:
423  f = urllib_request.urlopen(req)
424 
425  content = f.read(MAX_FILESIZE)
426  # As of Python 2.5, f.info() looks like it returns the HTTPMessage
427  # object created during the connection.
428  expires_header = f.info().get("expires")
429  content_type_header = f.info().get("Content-Type")
430  # As of Python 2.4, this file-like object reports the response
431  # code, too.
432  if hasattr(f, "code"):
433  self._response_code = f.code
434  else:
435  self._response_code = 200
436  f.close()
437  except urllib_error.URLError:
438  # This is a slightly convoluted way to get the error instance,
439  # but it works under Python 2 & 3.
440  error_instance = sys.exc_info()
441  if len(error_instance) > 1:
442  error_instance = error_instance[1]
443  if hasattr(error_instance, "code"):
444  self._response_code = error_instance.code
445 
446  # MK1996 section 3.4 says, "...robots should take note of Expires
447  # header set by the origin server. If no cache-control directives
448  # are present robots should default to an expiry of 7 days".
449 
450  # This code is lazy and looks at the Expires header but not
451  # Cache-Control directives.
452  self.expiration_date = None
453  if self._response_code >= 200 and self._response_code < 300:
454  # All's well.
455  if expires_header:
456  self.expiration_date = email_utils.parsedate_tz(expires_header)
457 
458  if self.expiration_date:
459  # About time zones -- the call to parsedate_tz() returns a
460  # 10-tuple with the time zone offset in the 10th element.
461  # There are 3 valid formats for HTTP dates, and one of
462  # them doesn't contain time zone information. (UTC is
463  # implied since all HTTP header dates are UTC.) When given
464  # a date that lacks time zone information, parsedate_tz()
465  # returns None in the 10th element. mktime_tz() interprets
466  # None in the 10th (time zone) element to mean that the
467  # date is *local* time, not UTC.
468  # Therefore, if the HTTP timestamp lacks time zone info
469  # and I run that timestamp through parsedate_tz() and pass
470  # it directly to mktime_tz(), I'll get back a local
471  # timestamp which isn't what I want. To fix this, I simply
472  # convert a time zone of None to zero. It's much more
473  # difficult to explain than to fix. =)
474  # ref: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.3.1
475  if self.expiration_date[9] == None:
476  self.expiration_date = self.expiration_date[:9] + (0,)
477 
478  self.expiration_date = email_utils.mktime_tz(self.expiration_date)
479  if self.use_local_time:
480  # I have to do a little more converting to get this
481  # UTC timestamp into localtime.
482  self.expiration_date = time.mktime(time.gmtime(self.expiration_date))
483  #else:
484  # The expires header was garbage.
485 
486  if not self.expiration_date: self.expiration_date = self._now() + SEVEN_DAYS
487 
488  if (self._response_code >= 200) and (self._response_code < 300):
489  # All's well.
490  media_type, encoding = _parse_content_type_header(content_type_header)
491  # RFC 2616 sec 3.7.1 --
492  # When no explicit charset parameter is provided by the sender,
493  # media subtypes of the "text" type are defined to have a default
494  # charset value of "ISO-8859-1" when received via HTTP.
495  # http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
496  if not encoding:
497  encoding = "iso-8859-1"
498  elif self._response_code in (401, 403):
499  # 401 or 403 ==> Go away or I will taunt you a second time!
500  # (according to MK1996)
501  content = "User-agent: *\nDisallow: /\n"
502  elif self._response_code == 404:
503  # No robots.txt ==> everyone's welcome
504  content = ""
505  else:
506  # Uh-oh. I punt this up to the caller.
507  _raise_error(urllib_error.URLError, self._response_code)
508 
509  if((PY_MAJOR_VERSION == 2) and isinstance(content, str)) or \
510  ((PY_MAJOR_VERSION > 2) and (not isinstance(content, str))):
511  # This ain't Unicode yet! It needs to be.
512 
513  # Unicode decoding errors are another point of failure that I punt
514  # up to the caller.
515  try:
516  content = content.decode(encoding)
517  except UnicodeError:
518  _raise_error(UnicodeError,
519  "Robots.txt contents are not in the encoding expected (%s)." % encoding)
520  except (LookupError, ValueError):
521  # LookupError ==> Python doesn't have a decoder for that encoding.
522  # One can also get a ValueError here if the encoding starts with
523  # a dot (ASCII 0x2e). See Python bug 1446043 for details. This
524  # bug was supposedly fixed in Python 2.5.
525  _raise_error(UnicodeError, "I don't understand the encoding \"%s\"." % encoding)
526 
527  # Now that I've fetched the content and turned it into Unicode, I
528  # can parse it.
529  self.parse(content)
530 
531 
def _raise_error(error, message)
Definition: OwnRobots.py:133
def _parse_content_type_header(header)
Definition: OwnRobots.py:162
Here is the call graph for this function:

◆ get_crawl_delay()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.get_crawl_delay (   self,
  user_agent 
)
Returns a float representing the crawl delay specified for this 
user agent, or None if the crawl delay was unspecified or not a float.

Definition at line 384 of file OwnRobots.py.

384  def get_crawl_delay(self, user_agent):
385  """Returns a float representing the crawl delay specified for this
386  user agent, or None if the crawl delay was unspecified or not a float.
387  """
388  # See is_allowed() comment about the explicit unicode conversion.
389  if (PY_MAJOR_VERSION < 3) and (not isinstance(user_agent, unicode)):
390  user_agent = user_agent.decode()
391 
392  for ruleset in self.__rulesets:
393  if ruleset.does_user_agent_match(user_agent):
394  return ruleset.crawl_delay
395 
396  return None
397 
398 

◆ is_allowed()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.is_allowed (   self,
  user_agent,
  url,
  syntax = GYM2008 
)
True if the user agent is permitted to visit the URL. The syntax 
parameter can be GYM2008 (the default) or MK1996 for strict adherence 
to the traditional standard.

Definition at line 352 of file OwnRobots.py.

352  def is_allowed(self, user_agent, url, syntax=GYM2008):
353  """True if the user agent is permitted to visit the URL. The syntax
354  parameter can be GYM2008 (the default) or MK1996 for strict adherence
355  to the traditional standard.
356  """
357  # The robot rules are stored internally as Unicode. The two lines
358  # below ensure that the parameters passed to this function are
359  # also Unicode. If those lines were not present and the caller
360  # passed a non-Unicode user agent or URL string to this function,
361  # Python would silently convert it to Unicode before comparing it
362  # to the robot rules. Such conversions use the default encoding
363  # (usually US-ASCII) and if the string couldn't be converted using
364  # that encoding, Python would raise a UnicodeError later on in the
365  # guts of this code which would be confusing.
366  # Converting the strings to Unicode here doesn't make the problem
367  # go away but it does make the conversion explicit so that
368  # failures are easier to understand.
369  if not isinstance(user_agent, unicode):
370  user_agent = user_agent.decode()
371  if not isinstance(url, unicode):
372  url = url.decode()
373 
374  if syntax not in (MK1996, GYM2008):
375  _raise_error(ValueError, "Syntax must be MK1996 or GYM2008")
376 
377  for ruleset in self.__rulesets:
378  if ruleset.does_user_agent_match(user_agent):
379  return ruleset.is_url_allowed(url, syntax)
380 
381  return True
382 
383 
def _raise_error(error, message)
Definition: OwnRobots.py:133
Here is the call graph for this function:

◆ is_expired()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.is_expired (   self)
True if the difference between now and the last call to fetch()
exceeds the robots.txt expiration. Read only.

Definition at line 337 of file OwnRobots.py.

337  def is_expired(self):
338  """True if the difference between now and the last call to fetch()
339  exceeds the robots.txt expiration. Read only.
340  """
341  return self.expiration_date <= self._now()
342 
343 
Here is the call graph for this function:

◆ parse()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.parse (   self,
  s 
)
Parses the passed string as a set of robots.txt rules.

Definition at line 532 of file OwnRobots.py.

532  def parse(self, s):
533  """Parses the passed string as a set of robots.txt rules."""
534  self._sitemaps = []
535  self.__rulesets = []
536 
537  if(PY_MAJOR_VERSION > 2) and (isinstance(s, bytes) or isinstance(s, bytearray)) or \
538  (PY_MAJOR_VERSION == 2) and (not isinstance(s, unicode)):
539  s = s.decode("iso-8859-1")
540 
541  # Normalize newlines.
542  s = _end_of_line_regex.sub("\n", s)
543 
544  lines = s.split("\n")
545 
546  previous_line_was_a_user_agent = False
547  current_ruleset = None
548 
549  for line in lines:
550  line = line.strip()
551 
552  if line and line[0] == '#':
553  # "Lines containing only a comment are discarded completely,
554  # and therefore do not indicate a record boundary." (MK1994)
555  pass
556  else:
557  # Remove comments
558  i = line.find("#")
559  if i != -1: line = line[:i]
560 
561  line = line.strip()
562 
563  if not line:
564  # An empty line indicates the end of a ruleset.
565  if current_ruleset and current_ruleset.is_not_empty():
566  self.__rulesets.append(current_ruleset)
567 
568  current_ruleset = None
569  previous_line_was_a_user_agent = False
570  else:
571  # Each non-empty line falls into one of six categories:
572  # 1) User-agent: blah blah blah
573  # 2) Disallow: blah blah blah
574  # 3) Allow: blah blah blah
575  # 4) Crawl-delay: blah blah blah
576  # 5) Sitemap: blah blah blah
577  # 6) Everything else
578  # 1 - 5 are interesting and I find them with the regex
579  # below. Category 6 I discard as directed by the MK1994
580  # ("Unrecognised headers are ignored.")
581  # Note that 4 & 5 are specific to GYM2008 syntax, but
582  # respecting them here is not a problem. They're just
583  # additional information the the caller is free to ignore.
584  matches = _directive_regex.findall(line)
585 
586  # Categories 1 - 5 produce two matches, #6 produces none.
587  if matches:
588  field, data = matches[0]
589  field = field.lower()
590  data = _scrub_data(data)
591 
592  # Matching "useragent" is a deviation from the
593  # MK1994/96 which permits only "user-agent".
594  if field in ("useragent", "user-agent"):
595  if previous_line_was_a_user_agent:
596  # Add this UA to the current ruleset
597  if current_ruleset and data:
598  current_ruleset.add_robot_name(data)
599  else:
600  # Save the current ruleset and start a new one.
601  if current_ruleset and current_ruleset.is_not_empty():
602  self.__rulesets.append(current_ruleset)
603  #else:
604  # (is_not_empty() == False) ==> malformed
605  # robots.txt listed a UA line but provided
606  # no name or didn't provide any rules
607  # for a named UA.
608  current_ruleset = _Ruleset()
609  if data:
610  current_ruleset.add_robot_name(data)
611 
612  previous_line_was_a_user_agent = True
613  elif field == "allow":
614  previous_line_was_a_user_agent = False
615  if current_ruleset:
616  current_ruleset.add_allow_rule(data)
617  elif field == "sitemap":
618  previous_line_was_a_user_agent = False
619  self._sitemaps.append(data)
620  elif field == "crawl-delay":
621  # Only Yahoo documents the syntax for Crawl-delay.
622  # ref: http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-03.html
623  previous_line_was_a_user_agent = False
624  if current_ruleset:
625  try:
626  current_ruleset.crawl_delay = float(data)
627  except ValueError:
628  # Invalid crawl-delay -- ignore.
629  pass
630  else:
631  # This is a disallow line
632  previous_line_was_a_user_agent = False
633  if current_ruleset:
634  current_ruleset.add_disallow_rule(data)
635 
636  if current_ruleset and current_ruleset.is_not_empty():
637  self.__rulesets.append(current_ruleset)
638 
639  # Now that I have all the rulesets, I want to order them in a way
640  # that makes comparisons easier later. Specifically, any ruleset that
641  # contains the default user agent '*' should go at the end of the list
642  # so that I only apply the default as a last resort. According to
643  # MK1994/96, there should only be one ruleset that specifies * as the
644  # user-agent, but you know how these things go.
645  not_defaults = [r for r in self.__rulesets if not r.is_default()]
646  defaults = [r for r in self.__rulesets if r.is_default()]
647 
648  self.__rulesets = not_defaults + defaults
649 
650 
Here is the call graph for this function:
Here is the caller graph for this function:

◆ response_code()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.response_code (   self)
The remote server's response code. Read only.

Definition at line 318 of file OwnRobots.py.

318  def response_code(self):
319  """The remote server's response code. Read only."""
320  return self._response_code
321 

◆ sitemap()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.sitemap (   self)
Deprecated; use 'sitemaps' instead. Returns the sitemap URL present
in the robots.txt, if any. Defaults to None. Read only.

Definition at line 323 of file OwnRobots.py.

323  def sitemap(self):
324  """Deprecated; use 'sitemaps' instead. Returns the sitemap URL present
325  in the robots.txt, if any. Defaults to None. Read only."""
326  _raise_error(DeprecationWarning, "The sitemap property is deprecated. Use 'sitemaps' instead.")
327 
def _raise_error(error, message)
Definition: OwnRobots.py:133
Here is the call graph for this function:

◆ sitemaps()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.sitemaps (   self)
The sitemap URLs present in the robots.txt, if any. Defaults 
to an empty list. Read only.

Definition at line 329 of file OwnRobots.py.

329  def sitemaps(self):
330  """The sitemap URLs present in the robots.txt, if any. Defaults
331  to an empty list. Read only."""
332  # I return a copy of the list so the caller can manipulate the list
333  # without affecting self._sitemaps.
334  return self._sitemaps[:]
335 

◆ source_url()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.source_url (   self)
The URL from which this robots.txt was fetched. Read only.

Definition at line 313 of file OwnRobots.py.

313  def source_url(self):
314  """The URL from which this robots.txt was fetched. Read only."""
315  return self._source_url
316 

Member Data Documentation

◆ __rulesets

dc_crawler.OwnRobots.RobotExclusionRulesParser.__rulesets
private

Definition at line 309 of file OwnRobots.py.

◆ _response_code

dc_crawler.OwnRobots.RobotExclusionRulesParser._response_code
private

Definition at line 307 of file OwnRobots.py.

◆ _sitemaps

dc_crawler.OwnRobots.RobotExclusionRulesParser._sitemaps
private

Definition at line 308 of file OwnRobots.py.

◆ _source_url

dc_crawler.OwnRobots.RobotExclusionRulesParser._source_url
private

Definition at line 303 of file OwnRobots.py.

◆ expiration_date

dc_crawler.OwnRobots.RobotExclusionRulesParser.expiration_date

Definition at line 306 of file OwnRobots.py.

◆ use_local_time

dc_crawler.OwnRobots.RobotExclusionRulesParser.use_local_time

Definition at line 305 of file OwnRobots.py.

◆ user_agent

dc_crawler.OwnRobots.RobotExclusionRulesParser.user_agent

Definition at line 304 of file OwnRobots.py.


The documentation for this class was generated from the following file: