Inheritance diagram for dc_crawler.Fetcher.SimpleCharsetDetector:

Collaboration diagram for dc_crawler.Fetcher.SimpleCharsetDetector:

Public Member Functions
def	__init__ (self, content=None)

def	detect (self, content=None, contentType="html")

def	xmlCharsetDetector (self, fp, buff=None)

Public Attributes
	content

Detailed Description

Definition at line 1650 of file Fetcher.py.

Constructor & Destructor Documentation

◆ init()

def dc_crawler.Fetcher.SimpleCharsetDetector.__init__	(	self,
		content = `None`
	)

Definition at line 1653 of file Fetcher.py.

   def __init__(self, content=None):
     # content
     self.content = content
 

Member Function Documentation

◆ detect()

def dc_crawler.Fetcher.SimpleCharsetDetector.detect	(	self,
		content = `None`,
		contentType = `"html"`
	)

Definition at line 1657 of file Fetcher.py.

   def detect(self, content=None, contentType="html"):
     ret = None
 
     if content is None:
       cnt = self.content
     else:
       cnt = content
 
     try:
       if contentType == 'html':
         pattern = r'<meta(?!\s*(?:name|value)\s*=)(?:[^>]*?content\s*=[\s"\']*)?([^>]*?)[\s"\';]*charset\s*=[\s"\']*([^\s"\'/>]*)'  #  pylint: disable=C0301
         matchObj = re.search(pattern, cnt, re.I | re.M | re.S)
         if matchObj:
           ret = matchObj.group(2)
       elif contentType == 'xml':
         ret = self.xmlCharsetDetector(None, cnt)
 
     except Exception, err:
       logger.error("Exception: %s", str(err))
 
     if ret is not None and ret in CONSTS.charsetDetectorMap:
       logger.debug("Extracted wrong encoding '%s' from page replace to correct '%s'", ret,
                    CONSTS.charsetDetectorMap[ret])
       ret = CONSTS.charsetDetectorMap[ret]
 
     return ret
 
 

Here is the call graph for this function:

◆ xmlCharsetDetector()

def dc_crawler.Fetcher.SimpleCharsetDetector.xmlCharsetDetector	(	self,
		fp,
		buff = `None`
	)

Attempts to detect the character encoding of the xml file
given by a file object fp. fp must not be a codec wrapped file
object!

The return value can be:
- if detection of the BOM succeeds, the codec name of the
corresponding unicode charset is returned

- if BOM detection fails, the xml declaration is searched for
the encoding attribute and its value returned. the "<"
character has to be the very first in the file then (it's xml
standard after all).

- if BOM and xml declaration fail, None is returned. According
to xml 1.0 it should be utf_8 then, but it wasn't detected by
the means offered here. at least one can be pretty sure that a
character coding including most of ASCII is used :-/

Definition at line 1685 of file Fetcher.py.

   def xmlCharsetDetector(self, fp, buff=None):
     """ Attempts to detect the character encoding of the xml file
     given by a file object fp. fp must not be a codec wrapped file
     object!
 
     The return value can be:
         - if detection of the BOM succeeds, the codec name of the
         corresponding unicode charset is returned
 
         - if BOM detection fails, the xml declaration is searched for
         the encoding attribute and its value returned. the "<"
         character has to be the very first in the file then (it's xml
         standard after all).
 
         - if BOM and xml declaration fail, None is returned. According
         to xml 1.0 it should be utf_8 then, but it wasn't detected by
         the means offered here. at least one can be pretty sure that a
         character coding including most of ASCII is used :-/
     """
     # ## detection using BOM
 
     # # the BOMs we know, by their pattern
     bomDict = {  # bytepattern : name
              (0x00, 0x00, 0xFE, 0xFF) : "utf_32_be",
              (0xFF, 0xFE, 0x00, 0x00) : "utf_32_le",
              (0xFE, 0xFF, None, None) : "utf_16_be",
              (0xFF, 0xFE, None, None) : "utf_16_le",
              (0xEF, 0xBB, 0xBF, None) : "utf_8",
             }
 
     if fp is not None:
       # # go to beginning of file and get the first 4 bytes
       oldFP = fp.tell()
       fp.seek(0)
       (byte1, byte2, byte3, byte4) = tuple(map(ord, fp.read(4)))
 
       # # try bom detection using 4 bytes, 3 bytes, or 2 bytes
       bomDetection = bomDict.get((byte1, byte2, byte3, byte4))
       if not bomDetection :
           bomDetection = bomDict.get((byte1, byte2, byte3, None))
           if not bomDetection :
               bomDetection = bomDict.get((byte1, byte2, None, None))
 
       # # if BOM detected, we're done :-)
       if bomDetection :
           fp.seek(oldFP)
           return bomDetection
 
       # # still here? BOM detection failed.
       # #  now that BOM detection has failed we assume one byte character
       # #  encoding behaving ASCII - of course one could think of nice
       # #  algorithms further investigating on that matter, but I won't for now.
 
       # # assume xml declaration fits into the first 2 KB (*cough*)
       fp.seek(0)
       buff = fp.read(2048)
 
     # # set up regular expression
     xmlDeclPattern = r"""
     ^<\?xml             # w/o BOM, xmldecl starts with <?xml at the first byte
     .+?                 # some chars (version info), matched minimal
     encoding=           # encoding attribute begins
     ["']                # attribute start delimiter
     (?P<encstr>         # what's matched in the brackets will be named encstr
      [^"']+              # every character not delimiter (not overly exact!)
     )                   # closes the brackets pair for the named group
     ["']                # attribute end delimiter
     .*?                 # some chars optionally (standalone decl or whitespace)
     \?>                 # xmldecl end
     """
 
     xmlDeclRE = re.compile(xmlDeclPattern, re.VERBOSE)
 
     # # search and extract encoding string
     match = xmlDeclRE.search(buff)
     if fp is not None:
       fp.seek(oldFP)
     if match :
         return match.group("encstr")
     else :
         return None
 

Here is the caller graph for this function:

Member Data Documentation

◆ content

dc_crawler.Fetcher.SimpleCharsetDetector.content

Definition at line 1655 of file Fetcher.py.

The documentation for this class was generated from the following file:

sources/hce/dc_crawler/Fetcher.py

Public Member Functions

Public Attributes