HCE Project Python language Distributed Tasks Manager Application, Distributed Crawler Application and client API bindings.  2.0.0-chaika
Hierarchical Cluster Engine Python language binding
search_engine_parser Namespace Reference

Functions

def process (input_data)
 
def getContent (url)
 

Variables

 filename
 
 filemode
 
 logger = logging.getLogger("search_engine")
 
def output = process(input_url)
 

Detailed Description

HCE project, Python bindings, Distributed Tasks Manager application.
Event objects definitions.

@package: dc
@file prepairer.py
@author Oleksii <developers.hce@gmail.com>
@link: http://hierarchical-cluster-engine.com/
@copyright: Copyright &copy; 2013-2014 IOIX Ukraine
@license: http://hierarchical-cluster-engine.com/license/
@since: 0.1

Function Documentation

◆ getContent()

def search_engine_parser.getContent (   url)

Definition at line 62 of file search_engine_parser.py.

62 def getContent(url):
63  # wget -S --no-check-certificate -U "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3" "https://www.google.com/search?q=mac+os"
64  cmd = "wget -qO- -S --no-check-certificate -U 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3' '" + url + "'"
65  # cmd = "wget -qO- -S --no-check-certificate -U 'Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3' 'https://www.google.com/search?q=mac+os'"
66  process = Popen(cmd, stdout=PIPE, stdin=PIPE, stderr=PIPE, shell=True, close_fds=True)
67  (output, err) = process.communicate()
68  exit_code = process.wait()
69  # output = open("google.out", "rb").read()
70  # raw_html = output
71  # open("/tmp/google.out", "wb").write(output)
72  #logger.debug("Raw content output: %s", output)
73  # logger.debug("Raw content error: %s", str(err))
74  # print raw_html
75 
76  # headers = {}
77  # headers["User-Agent"] = "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3"
78  # content = requests.get(url=url, headers=headers, verify=False)
79  # open("del.txt", "wb").write(output)
80  # logger.debug("request response: %s", content.text)
81  return output
82 
83 
Here is the caller graph for this function:

◆ process()

def search_engine_parser.process (   input_data)

Definition at line 38 of file search_engine_parser.py.

38 def process(input_data):
39  logger.debug("input: %s" % input_data)
40  splitted_data = input_data.split(',')
41  url = splitted_data[0]
42  site_id = "d57f144e7b26c9976769ea94f18b9064" if "google" in url else "1fe592caf03fd50c5f065c30f82b13bb"
43  #site_id = hashlib.md5(app.Utils.UrlParser.generateDomainUrl(url)).hexdigest()
44  logger.debug("site_id: %s" % str(site_id))
45  template = None
46  if len(splitted_data)==2:
47  template = splitted_data[1]
48  content = getContent(url)
49  lastModified = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
50  input = ScraperInData(url, None, site_id, content, "", None, lastModified, None)
51  input_pickled_object = pickle.dumps(input)
52  #logger.debug("scraper input: %s", str(input_pickled_object))
53  cmd = "./scraper.py --config=../ini/scraper_search_engine.ini"
54  process = Popen(cmd, stdout=PIPE, stdin=PIPE, stderr=PIPE, shell=True, close_fds=True)
55  (output, err) = process.communicate(input=input_pickled_object)
56  logger.debug("scraper response output: %s", str(output))
57  logger.debug("scraper response error: %s", str(err))
58  exit_code = process.wait()
59  return output
60 
61 
Here is the call graph for this function:

Variable Documentation

◆ filemode

search_engine_parser.filemode

Definition at line 33 of file search_engine_parser.py.

◆ filename

search_engine_parser.filename

Definition at line 33 of file search_engine_parser.py.

◆ logger

search_engine_parser.logger = logging.getLogger("search_engine")

Definition at line 34 of file search_engine_parser.py.

◆ output

def search_engine_parser.output = process(input_url)

Definition at line 86 of file search_engine_parser.py.