HCE Project Python language Distributed Tasks Manager Application, Distributed Crawler Application and client API bindings.  2.0.0-chaika
Hierarchical Cluster Engine Python language binding
dc_crawler.OwnRobots._Ruleset Class Reference
Inheritance diagram for dc_crawler.OwnRobots._Ruleset:
Collaboration diagram for dc_crawler.OwnRobots._Ruleset:

Public Member Functions

def __init__ (self)
 
def __str__ (self)
 
def __unicode__ (self)
 
def add_robot_name (self, bot)
 
def add_allow_rule (self, path)
 
def add_disallow_rule (self, path)
 
def is_not_empty (self)
 
def is_default (self)
 
def does_user_agent_match (self, user_agent)
 
def is_url_allowed (self, url, syntax=GYM2008)
 

Public Attributes

 robot_names
 
 rules
 
 crawl_delay
 

Static Public Attributes

int ALLOW = 1
 
int DISALLOW = 2
 

Detailed Description

_Ruleset represents a set of allow/disallow rules (and possibly a 
crawl delay) that apply to a set of user agents.
  
Users of this module don't need this class. It's available at the module
level only because RobotExclusionRulesParser() instances can't be 
pickled if _Ruleset isn't visible a the module level.    

Definition at line 187 of file OwnRobots.py.

Constructor & Destructor Documentation

◆ __init__()

def dc_crawler.OwnRobots._Ruleset.__init__ (   self)

Definition at line 198 of file OwnRobots.py.

198  def __init__(self):
199  self.robot_names = [ ]
200  self.rules = [ ]
201  self.crawl_delay = None
202 
def __init__(self)
constructor
Definition: UIDGenerator.py:19

Member Function Documentation

◆ __str__()

def dc_crawler.OwnRobots._Ruleset.__str__ (   self)

Definition at line 203 of file OwnRobots.py.

203  def __str__(self):
204  s = self.__unicode__()
205  if PY_MAJOR_VERSION == 2:
206  s = s.encode("utf-8")
207 
208  return s
209 
Here is the call graph for this function:

◆ __unicode__()

def dc_crawler.OwnRobots._Ruleset.__unicode__ (   self)

Definition at line 210 of file OwnRobots.py.

210  def __unicode__(self):
211  d = { self.ALLOW : "Allow", self.DISALLOW : "Disallow" }
212 
213  s = ''.join( ["User-agent: %s\n" % name for name in self.robot_names] )
214 
215  if self.crawl_delay:
216  s += "Crawl-delay: %s\n" % self.crawl_delay
217 
218  s += ''.join( ["%s: %s\n" % (d[rule_type], path) for rule_type, path in self.rules] )
219 
220  return s
221 
Definition: join.py:1
Here is the caller graph for this function:

◆ add_allow_rule()

def dc_crawler.OwnRobots._Ruleset.add_allow_rule (   self,
  path 
)

Definition at line 225 of file OwnRobots.py.

225  def add_allow_rule(self, path):
226  self.rules.append((self.ALLOW, _unquote_path(path)))
227 
def _unquote_path(path)
Definition: OwnRobots.py:142
Here is the call graph for this function:

◆ add_disallow_rule()

def dc_crawler.OwnRobots._Ruleset.add_disallow_rule (   self,
  path 
)

Definition at line 228 of file OwnRobots.py.

228  def add_disallow_rule(self, path):
229  self.rules.append((self.DISALLOW, _unquote_path(path)))
230 
def _unquote_path(path)
Definition: OwnRobots.py:142
Here is the call graph for this function:

◆ add_robot_name()

def dc_crawler.OwnRobots._Ruleset.add_robot_name (   self,
  bot 
)

Definition at line 222 of file OwnRobots.py.

222  def add_robot_name(self, bot):
223  self.robot_names.append(bot)
224 

◆ does_user_agent_match()

def dc_crawler.OwnRobots._Ruleset.does_user_agent_match (   self,
  user_agent 
)

Definition at line 237 of file OwnRobots.py.

237  def does_user_agent_match(self, user_agent):
238  match = False
239 
240  for robot_name in self.robot_names:
241  # MK1994 says, "A case insensitive substring match of the name
242  # without version information is recommended." MK1996 3.2.1
243  # states it even more strongly: "The robot must obey the first
244  # record in /robots.txt that contains a User-Agent line whose
245  # value contains the name token of the robot as a substring.
246  # The name comparisons are case-insensitive."
247  match = match or (robot_name == '*') or (robot_name.lower() in user_agent.lower())
248 
249  return match
250 

◆ is_default()

def dc_crawler.OwnRobots._Ruleset.is_default (   self)

Definition at line 234 of file OwnRobots.py.

234  def is_default(self):
235  return bool('*' in self.robot_names)
236 

◆ is_not_empty()

def dc_crawler.OwnRobots._Ruleset.is_not_empty (   self)

Definition at line 231 of file OwnRobots.py.

231  def is_not_empty(self):
232  return bool(len(self.rules)) and bool(len(self.robot_names))
233 

◆ is_url_allowed()

def dc_crawler.OwnRobots._Ruleset.is_url_allowed (   self,
  url,
  syntax = GYM2008 
)

Definition at line 251 of file OwnRobots.py.

251  def is_url_allowed(self, url, syntax=GYM2008):
252  allowed = True
253 
254  # Schemes and host names are not part of the robots.txt protocol,
255  # so I ignore them. It is the caller's responsibility to make
256  # sure they match.
257  _, _, path, parameters, query, fragment = urllib_urlparse(url)
258  url = urllib_urlunparse(("", "", path, parameters, query, fragment))
259 
260  url = _unquote_path(url)
261 
262  done = False
263  i = 0
264  while not done:
265  rule_type, path = self.rules[i]
266 
267  if (syntax == GYM2008) and ("*" in path or path.endswith("$")):
268  # GYM2008-specific syntax applies here
269  # http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360
270  if path.endswith("$"):
271  appendix = "$"
272  path = path[:-1]
273  else:
274  appendix = ""
275  parts = path.split("*")
276  pattern = "%s%s" % (".*".join([re.escape(p) for p in parts]), appendix)
277  if re.match(pattern, url):
278  # Ding!
279  done = True
280  allowed = (rule_type == self.ALLOW)
281  else:
282  # Wildcards are either not present or are taken literally.
283  if url.startswith(path):
284  # Ding!
285  done = True
286  allowed = (rule_type == self.ALLOW)
287  # A blank path means "nothing", so that effectively
288  # negates the value above.
289  # e.g. "Disallow: " means allow everything
290  if not path:
291  allowed = not allowed
292 
293  i += 1
294  if i == len(self.rules):
295  done = True
296 
297  return allowed
298 
299 
def _unquote_path(path)
Definition: OwnRobots.py:142
Definition: join.py:1
Here is the call graph for this function:

Member Data Documentation

◆ ALLOW

int dc_crawler.OwnRobots._Ruleset.ALLOW = 1
static

Definition at line 195 of file OwnRobots.py.

◆ crawl_delay

dc_crawler.OwnRobots._Ruleset.crawl_delay

Definition at line 201 of file OwnRobots.py.

◆ DISALLOW

int dc_crawler.OwnRobots._Ruleset.DISALLOW = 2
static

Definition at line 196 of file OwnRobots.py.

◆ robot_names

dc_crawler.OwnRobots._Ruleset.robot_names

Definition at line 199 of file OwnRobots.py.

◆ rules

dc_crawler.OwnRobots._Ruleset.rules

Definition at line 200 of file OwnRobots.py.


The documentation for this class was generated from the following file: