Associative Search Engine
– it is a distributed computational cluster system that integrates several modern technological approaches and solutions like: multi threaded high specialized binary applications, web-server, script-based web-applications, relational sql database, OS Linux and many specialized and well known protocols, data formats, algorithms and technologies.
The main aim and functionality reason of ASM – is to build the kernel of the thematic web search systems like the thematic web portals with powerful full-text search and flexible fast data crawling. It implements all components and subsystems that need to deploy the natural web search engine:
- Multi-threaded distributed incremental indexation subsystem for web sites of huge depth and large number of pages. It consists of set of multi threaded crawler applications (include dedicated images crawler) and indexers that implements high productivity and flexible configurable crawling and indexation processes.
- Distributed index storage engine – search machines/nodes subsystem for fast access of indexed data and search with many optimized high specialized algorithms. It implements conveyer architecture of search requests processing and uses hybrid incremental multi-parametrical full text search algorithms mixing typified search, fuzzy logic and elements of artificial intelligence to balance the quality of search results rank and time to solve the search query.
- Distributed textual data repository storage engine that implements set of algorithms of storage and fast access of indexed and searched textual fragments for visualization of search results for client-side visualization.
- Multi-threaded high productive text-mining subsystem that implements the set of multi-level text parsers include algorithms of template-based semantic web and so on…
- Linguistic kernel subsystem – that implements morphological analysis, structural analysis and normalization of words and phrases. It is the subsystem based on the reconstruction of the paradigms of the words, languages detection, and short phrases analysis algorithms.
- Hierarchical multi-threaded search query handler subsystem that processes the client-side search queries, interacts with the distributed index data storage search machines/nodes, collects the responses of search results sets, sort and classify them according the relevancy rules, merge and filter search results sets in one array and returns it to the client-side applying the custom defined templates substitutions with the pagination and caching architecture. This is high productivity subsystem that allowed building the vertical hierarchical structured systems that unites sets of single ASM clusters in huge network with tree structure.
- High specialized service applications like: the “Related Words” service that gives possibility to get thematic pairs of words from indexed contents ordered by popularity; the “Resource Data” service that provides direct (without search process) fast access to the data of the indexed web pages.
- Backend management application – implemented as regular DB-driven web application that executes all administration tasks and provides set of client-side interfaces for web portals for tools like widgets.
The key ASM’s feature – it is algorithms of search and indexation in distributed computing cluster. Because this is associative search engine – the methods of candidates search, selection, and calculation of relevancy and ordering of results complete different from global web search engines like Google uses. They are based on the words sequences analysis and morphological attributes and not use different pre-calculated rates like citation, cross-liking, etc – to select candidates and order results. Instead of this the semantic web based search mixed with parametric typified search ordering and usage of elements of artificial intelligence in process of filtration of results as well as to escape the full candidates scan on manner the SQL DB. This complex hybrid algorithms can be tuned many ways and criteria, but main principles always to show on top the resources that are more closed to search query as textual information, contains bigger number of longest chains of searched words in forms closer to user-defined phrase in search query; contains them in the title of the resource and bolded text of source HTML code of web-page; as well as maximizes detection of associated data like quantitative and qualitative attributes and so on…
The set of search optimization algorithms increases the productivity of the search allows to guarantee the process of hundreds searches per seconds and maximizes interest rate of resources. ASM provides the possibility to create groups of sites for indexation and search by all of them in group, by one of them, by domain name or all of them in one installation’s data center. The group of sites is related with the user’s account that allows creating multi users indexation environment.
Template-base indexation gives possibility to index only defined area of web page by set of the marked fragments in template. Each site can have many different templates with many fragments defined.
Extended indexation algorithms can separate the significant and negligible parts of web page. This solution allows decreasing value of noise of some words that was got from high popularity web page area like menu, headers and footers. Significant or negligible parts can be searched separately or can be ignored and not indexed.
Index storage supports eight main and eight extended content sources and set of special attributes. Main content sources are typified and each of them stores data of special kind like URL, HREFs, TITLE, H, and other parts of the regular web page. Extended content sources are numerical integers or logical bits/masks and can store any numbers that was got by direct detection and converting or by some additional complex algorithms like mapped indexation.
ASM’s indexation and search supports detection of many textual formats like HTML, XML, RSS, PDF, plain text; graphical formats like JPG, PNG, GIF images and video content in regular HTML page and raw SQL database data. Also, external data filtration gateway can extend the data sources to any kind of possible include local file or network sources and third party applications data… Crawling and indexation has more than fifty configuration parameters allows flexible settings for each indexed site.
Template based search results response generation subsystem allows to create the data format of response of any kind from textual like HTML, XML, JSon, plain-text and even most of binary. Network subsystem uses modern fast high productivity sockets’ polling that allows handling huge number of simultaneous client connections.
HTTP-based networking supports HTTP 1.0 and 1.1, chunking, mode gzip and deflates and mostly acts as regular web server in client-side interactions.
Linguistic subsystem supports multiple languages (up to 32 languages simultaneously. Currently implemented are: English, German, French, Russian, Ukrainian, and Japanese as basic dictionaries and Russian, English and Japanese – with extended morphological analysis support). Integrated administering web application gives many automated administration actions and statistical reports of internal and external system activity.
The main idea of the resource data indexing in ASM – it is a words associations in the resource context. Search queries processed by search machines that works with the distributed data repository. Than results are sorted according relevancy, combined from different parts of the distributed repository, filtered by filters criteria and formed for the client response. Finally, results will be returned to user as a response with list of resources related data like link, title, part or full of source web-page context, images and video links and so on…
Crawling and processing
Naturally – the web-crawling is a process of seek the web and fetch resources from it. But, implementation of the crawling can include some additional kinds of processing or pre-processing of the data fetched from the web.
ASM supports two different kinds of crawling and indexation – natural web and typified template-based.
In the natural web-crawling mode ASM’s crawler supports several main textual sources of the textual data from the source web-page. These sources are:
- Title text – context of the HTML TITLE tag
- H text – context of the HTML Hn tag
- Alt text – context of the HTML ALT and TITLE attributes
- Body text – context of the HTML BODY tag
- URL’s text – context of the URLs that have been captured from the page.
- Keywords – context of the HTML META tag name KEYWORDS
- Description – context of the HTML META tag name DESCRIPTION
In the template-based crawling mode ASM’s can detect and evaluate eight extended numerical fields. These extended fields can be used as extended search criterions for set of combinations of additional conditions like equal, less, grate and bitwise. It can be any kind of single numerical value as quantity, number, time, date, string Id/crc, counter or single or multiple bits set/mask, etc…
During the crawling process, web-page context prepared for indexing and split on several parts of the text sources. Then during search process entrances of searched words will be included in to the relevancy calculations according the order of list of sources listed above. So if some searched term found in the Title text and in the body text of the several resources, resources with entrance in the title will be moved up in order and displayed first.
The next part of the preparations – it is a dictionary normalization and words indexes calculations.
Normalizations – it is a word transformation from source form to normalized, in a result of those words becomes to more common form that exists in the main dictionary. Part of this process – it is linguistic-dependent processing like splitting of Japanese context. (Now ASM dictionary uses only simple dictionary-based template-oriented algorithms, but in future deep morphological linguistic analysis will be used also.)
Indexation – it is a main important part of the ASM engine. It includes pre-relevance calculations per each resource and text source, relations and frequencies analysis and calculations and so on. In a result index data and calculated parameters are stored in the repository and became accessible for the search machines for the search process.
Search requests are accepted by the search handler and after a preparation similar that has done for the web-page context – will be transferred to the search machines.
Now ASM supports search requests with two: simple and complex form, and several different algorithms of search.
Simple form of the query does not contain any equations and operators excerpt the searched terms. In this case all terms are searched according to the AND logic. For example, by default and if any another specified the query:
quick brown fox
will search the resources that contains all three words. The relevance will be calculated according this rate:
- all words in “Title text” source
- maximum length of words chains (maximization of the count of the words with the minimal distance between them)
- words in the high rated textual data source
- maximum searched words count on the resource
- max frequency or value of extended fields
and found resources will be ranged according relevance index and represented as ordered list. This type of search uses blacklist/stop-word for many English words like a, and, or, etc… and skips the one and two characters dictionary words. (A “dictionary word” means that word exists in the main ASM’s dictionary.)
– changes the main algorithm of search described above and searches only with usage of the limited simple criterions like presence of the searched terms in some main content source fields, searched terms frequencies, and so on, but not using detailed information about the chains and sequences of words in context. Also, this method saves the priority of main content sources like entrance in title under entrance in body of the web page. This method called “fast”, because this way it not uses the data typically located on the disk, but only memory resides and as a result – processed very rapidly without unpredictable delays by the OS I/O reasons…
Include single words – change the main algorithm of search the way that will to use only single words entrance condition instead of using relations between words and words chains. In this case resources with at least one searched word will be included in to the results.
Quoted text will be searched “as is” without the blacklisted words escaping. In this case only resources with full text entrances will be added in to the results, but the relevance index calculation algorithm the same. So resources with the searched phrase in the HTML document title will be rated upper than in the body and with the searched phrase in the HTML Hn tag – upper than in the regular text sequence and so on. More than one quoted phrases will be combined by AND logic.
Logical operators Logical operators combine the searched terms and phrases in to the equations. There logical operators supported:
“&” – logical AND (spaces are treated as AND by default if “include single words” modifier is Off). This operator means that all terms must be found on the same resource.
“|” – logical OR (spaces are treated as OR if “include single words” modifier is On.) This operator means that at least one term entrance must be present on resource.
“-” – logical NOT, all logical NOTs are combined in one list of terms that will be used as condition to skip resources with these terms entrances. NOT is unary operator preceded the term or phrase.
“+” – ADD operator that means what term follows after
“+” sign need to be deliberately included in to the search process and default blacklist, stop-words or filters rules are ignored. So, blacklisted words can be included this way.
Examples of the complex requests:
+The quick “brown fox”
“brown fox” | “lazy dog”
“brown fox jumps over” ! “lazy dog”
– it is an additional condition that brings more strict rules to the search process and helps to select words from more proper context environment area. Distances can be set between the any two words or for the all words in the search request.
If the distance is set for the pair(s) of the words it will limit the maximum number of any words between this two searched words in the context. This limitation will work not as filter, but while selection of the resource and will lead to choose only those resources that will satisfy the distance limitation.
If the distance is set for all words in the request – this condition will limit the total count of the words between all words in the request. This will help to find a more compact localization of the searched words together maximizes the words grouping.
Distance syntax example:
Brown <5> fox jumps over <15> “lazy dog”
This request means that words “Brown” and “fox” must have no more than five words between them and words “over” and “lazy” must have no more than 15 words between them. All another words can be located in any places of the context.
Brown fox jumps over +the lazy dog <25>
In the example above all words must be located not far than around of 25 words.
It is a filter criterion allows choosing only those resources that has entrance of searched words in the selected main text source. While search system crawl the resource content has been split on to the several types by the HTML tag source text:
- title – from HTML TITLE tags
- keyword – from HTML METHA “keywords” tag attribute
- description – text from HTML METHA “description” tag attribute
- H – text from HTML H tags
- alt – text from “alt” and “text” attributes of the HTML IMG and A tags
- reference URL – text from the href and src attributes of the HTML A, IMG, FRAME and another including own resource’s URL
- body – text from another sources of HTML document
and each type of the source text is accumulated in one separated content source. So, for example, all texts from tag H (H1-H9) are concatenated in to one long text sequence and can be searched separately. User can choose any combination of the content source types searched. For each site user defines the list of the supported content sources. Data from unlisted content sources will be indexed as body.
Search in dedicated content fields
ASM supports eight main textual and eight extended numeric or logical content fields. Each field can be searched by own keywords set or search string in combination with logical NOT operation. This gives possibility to unite the complex logical conditions and to include and to exclude words, for example include in title and exclude from document body…
Another kind of usage of the search in some content fields – it is usage extended numeric fields in combination with mapped indexation (MI). Several qualitative attributes that was detected during indexation process can be used in search. There is several operations like equal, less, greater, and masked bit AND and bit OR are available. Masked bit operations acts with set of bits defined by mask. Typically each bit represents the one qualitative attribute and can to be a criterion to include or exclude resource from search results.
Document’s types – this filter allows choosing only resources with the specified HTTP MIME types. Foe each site user defines the list of the supported HTTP MIME types. Data from unsupported MIME types is not indexed.
Now only PDF document type has own dedicated algorithm for parsing. Resources with another document types will be parsed as HTML TEXT or plain text.
Languages – it is a filter that allows choosing only resources satisfied with the set of languages named language mask. While crawling of resource the languages mask (set of languages) detected and set for each word from context. While search user can set a set of languages. If user has been chosen the English language – this means for the search system that user want only resources in English language. But for another supported languages if resources has even one language from the list of chosen for search – this lead to include this resource. So, English language criterion works with AND logic, but all another with OR. This differences between the English and another languages because sites almost contains English text.
Date – it is a filter that allows choosing only resources added for the proper time period. The time period can be defined by the one or pair time borders to filter the resources by resource add date in range.
Site and user – filters allow to choosing only resources belongs to the correspondent site or user by specifying the site or resource Id in the request and will lead to search only in the resources of the sites that has belongs to user specified. By default if any Id specified search will be done in all resources. User Id can be replaced with the prefix of domain name of the server where search engine resides.
Similarity – it is a filter that allows eliminating the similar resources from the list of resources found. Similar resource – it is a resource which has fully or partially identical title or body content with the resource that has been already added in to the results list. All resources with the same titles are collected in the separated list (can be displayed with the indent) and number of the resources with the different bodies in this list has been limited.
Results number, pagination and results cache
The number of resources that can be returned from search can be limited and ranged by page number and pagination. Search process is significantly differs from selection of records in regular SQL db. Because this differences results candidates number in most cases bigger than actual number of returned resources. This is because filtration and selection using different criterions applied after selection by words entrances as a main part of full text search process. The responses cache used to store results of search query to prevent repeat of regular full text search in case of the same query or very similar. Search cache can be filled step by stem while client requests next pages of results or can be pre-filled with maximum possible results after first kind of search query. This is optionally.
Because search processing acts as incremental search – the number of potential results can grow step by step when client requests next pages in pagination. This approach provide the search system with the additional possibility to decrease the load and free some resources of computational unit, but in case of client side requirements of exact number of search results – the second method of cache fill can be used. Also, in cases of high popular words search – the search systems usually returned not all possible results, but predefined maximum that can be less than 5% of candidates. The cache pre-fill method also acts this way, but pre-fills the cache on maximum possible items per query.