HCE Project Python language Distributed Tasks Manager Application, Distributed Crawler Application and client API bindings.  2.0.0-chaika
Hierarchical Cluster Engine Python language binding
dc_crawler.Fetcher.URLLibFetcher Class Reference
Inheritance diagram for dc_crawler.Fetcher.URLLibFetcher:
Collaboration diagram for dc_crawler.Fetcher.URLLibFetcher:

Public Member Functions

def open (self, url, kwargs)
 
- Public Member Functions inherited from dc_crawler.Fetcher.BaseFetcher
def __init__ (self)
 
def open (self, url, method='get', headers=None, timeout=100, allow_redirects=True, proxies=None, auth=None, data=None, log=None, allowed_content_types=None, max_resource_size=None, max_redirects=CONSTS.MAX_HTTP_REDIRECTS_LIMIT, filters=None, executable_path=None, depth=None, macro=None)
 
def should_have_meta_res (self)
 
def getDomainNameFromURL (self, url, default='')
 

Additional Inherited Members

- Static Public Member Functions inherited from dc_crawler.Fetcher.BaseFetcher
def init (dbWrapper=None, siteId=None)
 
def get_fetcher (typ, dbWrapper=None, siteId=None)
 
- Public Attributes inherited from dc_crawler.Fetcher.BaseFetcher
 connectionTimeout
 
 logger
 
- Static Public Attributes inherited from dc_crawler.Fetcher.BaseFetcher
 fetchers = None
 
int TYP_NORMAL = 1
 
int TYP_DYNAMIC = 2
 
int TYP_URLLIB = 5
 
int TYP_CONTENT = 6
 
int TYP_AUTO = 7
 
float CONNECTION_TIMEOUT = 1.0
 

Detailed Description

Definition at line 1511 of file Fetcher.py.

Member Function Documentation

◆ open()

def dc_crawler.Fetcher.URLLibFetcher.open (   self,
  url,
  kwargs 
)

Definition at line 1523 of file Fetcher.py.

1523  def open(self, url, **kwargs):
1524  import urllib2
1525 
1526  if 'logger' in kwargs['logger']:
1527  log = kwargs['logger']
1528  else:
1529  log = logger
1530  allowed_content_types = kwargs['allowed_content_types']
1531  # max_resource_size = kwargs["max_resource_size"]
1532 
1533  res = Response()
1534  log.debug("url: <%s>", url)
1535  response = None
1536  try:
1537  response = urllib2.urlopen(url)
1538  headers_info = response.info()
1539  if headers_info is not None:
1540  if headers_info.type in allowed_content_types:
1541  if response is not None:
1542  # res.encoding = impl_res.encoding
1543  # res.cookies = requests.utils.dict_from_cookiejar(impl_res.cookies)
1544  res.url = response.geturl()
1545  res.status_code = response.getcode()
1546  content_response = response.read()
1547  res.unicode_content = content_response
1548  res.str_content = content_response
1549  res.rendered_unicode_content = content_response
1550  res.content_size = len(content_response)
1551  headers = {}
1552  headers["content-length"] = res.content_size
1553  headers["content-type"] = headers_info.type
1554  res.headers = headers
1555  history = []
1556  res.redirects = history
1557  else:
1558  log.debug("URLLib return empty response.")
1559  else:
1560  log.debug("Content-Type not allowed. headers_info.type: %s", str(headers_info.type))
1561  else:
1562  log.debug("URLLib info is empty.")
1563  except urllib2.HTTPError, err:
1564  # except Exception, err:
1565  log.debug("Exception <%s>", str(err.code))
1566 
1567  return res
1568 
1569 
1570 
1571 # # external Fetcher
1572 #
1573 #

The documentation for this class was generated from the following file: