Inheritance diagram for dc_crawler.OwnRobots.RobotExclusionRulesParser:

Collaboration diagram for dc_crawler.OwnRobots.RobotExclusionRulesParser:

Public Member Functions
def	__init__ (self)

def	source_url (self)

def	response_code (self)

def	sitemap (self)

def	sitemaps (self)

def	is_expired (self)

def	is_allowed (self, user_agent, url, syntax=GYM2008)

def	get_crawl_delay (self, user_agent)

def	fetch (self, url, timeout=None)

def	parse (self, s)

def	__str__ (self)

def	__unicode__ (self)

Public Attributes
	user_agent

	use_local_time

	expiration_date

Private Member Functions
def	_now (self)

Private Attributes
	_source_url

	_response_code

	_sitemaps

	__rulesets

Detailed Description

A parser for robots.txt files.

Definition at line 300 of file OwnRobots.py.

Constructor & Destructor Documentation

◆ init()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.__init__ ( self )

Definition at line 302 of file OwnRobots.py.

   def __init__(self):
     self._source_url = ""
     self.user_agent = None
     self.use_local_time = True
     self.expiration_date = self._now() + SEVEN_DAYS
     self._response_code = 0
     self._sitemaps = [ ]
     self.__rulesets = [ ]
       
 

Member Function Documentation

◆ str()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.__str__ ( self )

Definition at line 651 of file OwnRobots.py.

   def __str__(self):
     s = self.__unicode__()
     if PY_MAJOR_VERSION == 2:
       s = s.encode("utf-8")
 
     return s
 

Here is the call graph for this function:

◆ unicode()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.__unicode__ ( self )

Definition at line 658 of file OwnRobots.py.

   def __unicode__(self):
     if self._sitemaps:
       s = "Sitemaps: %s\n\n" % self._sitemaps
     else: 
       s = ""
     if PY_MAJOR_VERSION < 3:
       s = unicode(s)
     # I also need to string-ify each ruleset. The function for doing so
     # varies under Python 2/3. 
     stringify = (unicode if (PY_MAJOR_VERSION == 2) else str)
     return s + '\n'.join( [stringify(ruleset) for ruleset in self.__rulesets] )
 
 

Here is the caller graph for this function:

◆ _now()

def dc_crawler.OwnRobots.RobotExclusionRulesParser._now ( self )

private

Definition at line 344 of file OwnRobots.py.

   def _now(self):
     if self.use_local_time:
       return time.time()
     else:
       # What the heck is timegm() doing in the calendar module?!?
       return calendar.timegm(time.gmtime())
 
 

Here is the caller graph for this function:

◆ fetch()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.fetch	(	self,
		url,
		timeout = `None`
	)

Attempts to fetch the URL requested which should refer to a 
robots.txt file, e.g. http://example.com/robots.txt.

Definition at line 399 of file OwnRobots.py.

   def fetch(self, url, timeout=None):
     """Attempts to fetch the URL requested which should refer to a 
     robots.txt file, e.g. http://example.com/robots.txt.
     """
 
     # ISO-8859-1 is the default encoding for text files per the specs for
     # HTTP 1.0 (RFC 1945 sec 3.6.1) and HTTP 1.1 (RFC 2616 sec 3.7.1).
     # ref: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
     encoding = "iso-8859-1"
     content = ""
     expires_header = None
     content_type_header = None
     self._response_code = 0
     self._source_url = url
 
     if self.user_agent:
       req = urllib_request.Request(url, None, { 'User-Agent' : self.user_agent })
     else:
       req = urllib_request.Request(url)
 
     try:
       if timeout:
         f = urllib_request.urlopen(req, timeout=timeout)
       else:
         f = urllib_request.urlopen(req)
 
       content = f.read(MAX_FILESIZE)
       # As of Python 2.5, f.info() looks like it returns the HTTPMessage
       # object created during the connection. 
       expires_header = f.info().get("expires")
       content_type_header = f.info().get("Content-Type")
       # As of Python 2.4, this file-like object reports the response 
       # code, too. 
       if hasattr(f, "code"):
         self._response_code = f.code
       else:
         self._response_code = 200
       f.close()
     except urllib_error.URLError:
       # This is a slightly convoluted way to get the error instance,
       # but it works under Python 2 & 3. 
       error_instance = sys.exc_info()
       if len(error_instance) > 1:
         error_instance = error_instance[1]
       if hasattr(error_instance, "code"):
         self._response_code = error_instance.code
             
     # MK1996 section 3.4 says, "...robots should take note of Expires 
     # header set by the origin server. If no cache-control directives 
     # are present robots should default to an expiry of 7 days".
     
     # This code is lazy and looks at the Expires header but not 
     # Cache-Control directives.
     self.expiration_date = None
     if self._response_code >= 200 and self._response_code < 300:
       # All's well.
       if expires_header:
         self.expiration_date = email_utils.parsedate_tz(expires_header)
             
         if self.expiration_date:
           # About time zones -- the call to parsedate_tz() returns a
           # 10-tuple with the time zone offset in the 10th element. 
           # There are 3 valid formats for HTTP dates, and one of 
           # them doesn't contain time zone information. (UTC is 
           # implied since all HTTP header dates are UTC.) When given
           # a date that lacks time zone information, parsedate_tz() 
           # returns None in the 10th element. mktime_tz() interprets
           # None in the 10th (time zone) element to mean that the 
           # date is *local* time, not UTC. 
           # Therefore, if the HTTP timestamp lacks time zone info 
           # and I run that timestamp through parsedate_tz() and pass
           # it directly to mktime_tz(), I'll get back a local 
           # timestamp which isn't what I want. To fix this, I simply
           # convert a time zone of None to zero. It's much more 
           # difficult to explain than to fix. =)
           # ref: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.3.1
           if self.expiration_date[9] == None: 
             self.expiration_date = self.expiration_date[:9] + (0,)
             
           self.expiration_date = email_utils.mktime_tz(self.expiration_date)
           if self.use_local_time: 
             # I have to do a little more converting to get this 
             # UTC timestamp into localtime.
             self.expiration_date = time.mktime(time.gmtime(self.expiration_date)) 
             #else:
                 # The expires header was garbage.
 
     if not self.expiration_date: self.expiration_date = self._now() + SEVEN_DAYS
 
     if (self._response_code >= 200) and (self._response_code < 300):
       # All's well.
       media_type, encoding = _parse_content_type_header(content_type_header)
       # RFC 2616 sec 3.7.1 -- 
       # When no explicit charset parameter is provided by the sender, 
       # media subtypes  of the "text" type are defined to have a default
       # charset value of "ISO-8859-1" when received via HTTP.
       # http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.7.1
       if not encoding: 
         encoding = "iso-8859-1"
     elif self._response_code in (401, 403):
       # 401 or 403 ==> Go away or I will taunt you a second time! 
       # (according to MK1996)
       content = "User-agent: *\nDisallow: /\n"
     elif self._response_code == 404:
       # No robots.txt ==> everyone's welcome
       content = ""
     else:        
       # Uh-oh. I punt this up to the caller. 
       _raise_error(urllib_error.URLError, self._response_code)
 
     if((PY_MAJOR_VERSION == 2) and isinstance(content, str)) or \
       ((PY_MAJOR_VERSION > 2)  and (not isinstance(content, str))):
       # This ain't Unicode yet! It needs to be.
       
       # Unicode decoding errors are another point of failure that I punt 
       # up to the caller.
       try:
         content = content.decode(encoding)
       except UnicodeError:
         _raise_error(UnicodeError,
         "Robots.txt contents are not in the encoding expected (%s)." % encoding)
       except (LookupError, ValueError):
         # LookupError ==> Python doesn't have a decoder for that encoding.
         # One can also get a ValueError here if the encoding starts with 
         # a dot (ASCII 0x2e). See Python bug 1446043 for details. This 
         # bug was supposedly fixed in Python 2.5.
         _raise_error(UnicodeError, "I don't understand the encoding \"%s\"." % encoding)
     
     # Now that I've fetched the content and turned it into Unicode, I 
     # can parse it.
     self.parse(content)
         
         

Here is the call graph for this function:

◆ get_crawl_delay()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.get_crawl_delay	(	self,
		user_agent
	)

Returns a float representing the crawl delay specified for this 
user agent, or None if the crawl delay was unspecified or not a float.

Definition at line 384 of file OwnRobots.py.

   def get_crawl_delay(self, user_agent):
     """Returns a float representing the crawl delay specified for this 
     user agent, or None if the crawl delay was unspecified or not a float.
     """
     # See is_allowed() comment about the explicit unicode conversion.
     if (PY_MAJOR_VERSION < 3) and (not isinstance(user_agent, unicode)):
       user_agent = user_agent.decode()
   
     for ruleset in self.__rulesets:
       if ruleset.does_user_agent_match(user_agent):
         return ruleset.crawl_delay
 
     return None
 
 

◆ is_allowed()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.is_allowed	(	self,
		user_agent,
		url,
		syntax = `GYM2008`
	)

True if the user agent is permitted to visit the URL. The syntax 
parameter can be GYM2008 (the default) or MK1996 for strict adherence 
to the traditional standard.

Definition at line 352 of file OwnRobots.py.

   def is_allowed(self, user_agent, url, syntax=GYM2008):
     """True if the user agent is permitted to visit the URL. The syntax 
     parameter can be GYM2008 (the default) or MK1996 for strict adherence 
     to the traditional standard.
     """
           # The robot rules are stored internally as Unicode. The two lines 
           # below ensure that the parameters passed to this function are 
           # also Unicode. If those lines were not present and the caller 
           # passed a non-Unicode user agent or URL string to this function,
           # Python would silently convert it to Unicode before comparing it
           # to the robot rules. Such conversions use the default encoding 
           # (usually US-ASCII) and if the string couldn't be converted using
           # that encoding, Python would raise a UnicodeError later on in the
           # guts of this code which would be confusing. 
           # Converting the strings to Unicode here doesn't make the problem
           # go away but it does make the conversion explicit so that 
           # failures are easier to understand. 
     if not isinstance(user_agent, unicode):
       user_agent = user_agent.decode()
     if not isinstance(url, unicode):
       url = url.decode()
       
     if syntax not in (MK1996, GYM2008):
       _raise_error(ValueError, "Syntax must be MK1996 or GYM2008")
   
     for ruleset in self.__rulesets:
       if ruleset.does_user_agent_match(user_agent):
         return ruleset.is_url_allowed(url, syntax)
               
     return True
 
 

Here is the call graph for this function:

◆ is_expired()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.is_expired ( self )

True if the difference between now and the last call to fetch()
exceeds the robots.txt expiration. Read only.

Definition at line 337 of file OwnRobots.py.

   def is_expired(self):
     """True if the difference between now and the last call to fetch()
     exceeds the robots.txt expiration. Read only.
     """
     return self.expiration_date <= self._now()     
 
 

Here is the call graph for this function:

◆ parse()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.parse	(	self,
		s
	)

Parses the passed string as a set of robots.txt rules.

Definition at line 532 of file OwnRobots.py.

   def parse(self, s):
     """Parses the passed string as a set of robots.txt rules."""
     self._sitemaps = []
     self.__rulesets = []
     
     if(PY_MAJOR_VERSION > 2) and (isinstance(s, bytes) or isinstance(s, bytearray)) or \
       (PY_MAJOR_VERSION == 2) and (not isinstance(s, unicode)):            
       s = s.decode("iso-8859-1")
 
     # Normalize newlines.
     s = _end_of_line_regex.sub("\n", s)
     
     lines = s.split("\n")
     
     previous_line_was_a_user_agent = False
     current_ruleset = None
     
     for line in lines:
       line = line.strip()
       
       if line and line[0] == '#':
         # "Lines containing only a comment are discarded completely, 
         # and therefore do not indicate a record boundary." (MK1994)
         pass
       else:
         # Remove comments
         i = line.find("#")
         if i != -1: line = line[:i]
   
         line = line.strip()
           
         if not line:
           # An empty line indicates the end of a ruleset.
           if current_ruleset and current_ruleset.is_not_empty():
             self.__rulesets.append(current_ruleset)
               
           current_ruleset = None
           previous_line_was_a_user_agent = False
         else:
           # Each non-empty line falls into one of six categories:
           # 1) User-agent: blah blah blah
           # 2) Disallow: blah blah blah
           # 3) Allow: blah blah blah
           # 4) Crawl-delay: blah blah blah
           # 5) Sitemap: blah blah blah
           # 6) Everything else
           # 1 - 5 are interesting and I find them with the regex 
           # below. Category 6 I discard as directed by the MK1994 
           # ("Unrecognised headers are ignored.")
           # Note that 4 & 5 are specific to GYM2008 syntax, but 
           # respecting them here is not a problem. They're just 
           # additional information the the caller is free to ignore.
           matches = _directive_regex.findall(line)
           
           # Categories 1 - 5 produce two matches, #6 produces none.
           if matches:
             field, data = matches[0]
             field = field.lower()
             data = _scrub_data(data)
 
             # Matching "useragent" is a deviation from the 
             # MK1994/96 which permits only "user-agent".
             if field in ("useragent", "user-agent"):
               if previous_line_was_a_user_agent:
                 # Add this UA to the current ruleset 
                 if current_ruleset and data:
                   current_ruleset.add_robot_name(data)
               else:
                 # Save the current ruleset and start a new one.
                 if current_ruleset and current_ruleset.is_not_empty():
                   self.__rulesets.append(current_ruleset)
                 #else:
                   # (is_not_empty() == False) ==> malformed 
                   # robots.txt listed a UA line but provided
                   # no name or didn't provide any rules 
                   # for a named UA.
                 current_ruleset = _Ruleset()
                 if data: 
                   current_ruleset.add_robot_name(data)
                 
               previous_line_was_a_user_agent = True
             elif field == "allow":
               previous_line_was_a_user_agent = False
               if current_ruleset:
                 current_ruleset.add_allow_rule(data)
             elif field == "sitemap":
               previous_line_was_a_user_agent = False
               self._sitemaps.append(data)
             elif field == "crawl-delay":
               # Only Yahoo documents the syntax for Crawl-delay.
               # ref: http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-03.html
               previous_line_was_a_user_agent = False
               if current_ruleset:
                 try:
                   current_ruleset.crawl_delay = float(data)
                 except ValueError:
                   # Invalid crawl-delay -- ignore.
                   pass
             else:
               # This is a disallow line
               previous_line_was_a_user_agent = False
               if current_ruleset:
                 current_ruleset.add_disallow_rule(data)
 
     if current_ruleset and current_ruleset.is_not_empty():
       self.__rulesets.append(current_ruleset)
         
     # Now that I have all the rulesets, I want to order them in a way 
     # that makes comparisons easier later. Specifically, any ruleset that 
     # contains the default user agent '*' should go at the end of the list
     # so that I only apply the default as a last resort. According to 
     # MK1994/96, there should only be one ruleset that specifies * as the 
     # user-agent, but you know how these things go.
     not_defaults = [r for r in self.__rulesets if not r.is_default()]
     defaults = [r for r in self.__rulesets if r.is_default()]
 
     self.__rulesets = not_defaults + defaults
 
     

Here is the call graph for this function:

Here is the caller graph for this function:

◆ response_code()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.response_code ( self )

The remote server's response code. Read only.

Definition at line 318 of file OwnRobots.py.

   def response_code(self): 
     """The remote server's response code. Read only."""
     return self._response_code
 

◆ sitemap()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.sitemap ( self )

Deprecated; use 'sitemaps' instead. Returns the sitemap URL present
in the robots.txt, if any. Defaults to None. Read only.

Definition at line 323 of file OwnRobots.py.

   def sitemap(self): 
     """Deprecated; use 'sitemaps' instead. Returns the sitemap URL present
     in the robots.txt, if any. Defaults to None. Read only."""
     _raise_error(DeprecationWarning, "The sitemap property is deprecated. Use 'sitemaps' instead.")
 

Here is the call graph for this function:

◆ sitemaps()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.sitemaps ( self )

The sitemap URLs present in the robots.txt, if any. Defaults 
to an empty list. Read only.

Definition at line 329 of file OwnRobots.py.

   def sitemaps(self): 
     """The sitemap URLs present in the robots.txt, if any. Defaults 
     to an empty list. Read only."""
     # I return a copy of the list so the caller can manipulate the list
     # without affecting self._sitemaps.
     return self._sitemaps[:]
 

◆ source_url()

def dc_crawler.OwnRobots.RobotExclusionRulesParser.source_url ( self )

The URL from which this robots.txt was fetched. Read only.

Definition at line 313 of file OwnRobots.py.

   def source_url(self): 
     """The URL from which this robots.txt was fetched. Read only."""
     return self._source_url
 

Member Data Documentation

◆ __rulesets

dc_crawler.OwnRobots.RobotExclusionRulesParser.__rulesets

private

Definition at line 309 of file OwnRobots.py.

◆ _response_code

dc_crawler.OwnRobots.RobotExclusionRulesParser._response_code

private

Definition at line 307 of file OwnRobots.py.

◆ _sitemaps

dc_crawler.OwnRobots.RobotExclusionRulesParser._sitemaps

private

Definition at line 308 of file OwnRobots.py.

◆ _source_url

dc_crawler.OwnRobots.RobotExclusionRulesParser._source_url

private

Definition at line 303 of file OwnRobots.py.

◆ expiration_date

dc_crawler.OwnRobots.RobotExclusionRulesParser.expiration_date

Definition at line 306 of file OwnRobots.py.

◆ use_local_time

dc_crawler.OwnRobots.RobotExclusionRulesParser.use_local_time

Definition at line 305 of file OwnRobots.py.

◆ user_agent

dc_crawler.OwnRobots.RobotExclusionRulesParser.user_agent

Definition at line 304 of file OwnRobots.py.

The documentation for this class was generated from the following file:

sources/hce/dc_crawler/OwnRobots.py

Public Member Functions

Public Attributes

Private Member Functions

Private Attributes

Detailed Description

Constructor & Destructor Documentation

◆ __init__()

Member Function Documentation

◆ __str__()

◆ __unicode__()

◆ _now()

◆ fetch()

◆ get_crawl_delay()

◆ is_allowed()

◆ is_expired()

◆ parse()

◆ response_code()

◆ sitemap()

◆ sitemaps()

◆ source_url()

Member Data Documentation

◆ __rulesets

◆ _response_code

◆ _sitemaps

◆ _source_url

◆ expiration_date

◆ use_local_time

◆ user_agent

◆ init()

◆ str()

◆ unicode()