Love Shanghai Lee the search engine index system overview


as everyone knows, the main process of search engine includes: capture, storage, page analysis, indexing and retrieval of some key processes. Over the past few weeks to introduce briefly the process of grasping related. Today, briefly introduce the index system to find certain keywords billion "library specific units like the sea inside the needle, perhaps a certain period of time can complete the search, but the user can not afford, we must in milliseconds to give users satisfactory results from the perspective of the user experience, otherwise the user only drain. How to meet the requirements of

(3) before the completion of the preparatory work, the next is to set up the inverted index, the formation of {termà doc} can be roughly understood as follows, why is [term->]; doc, rather than the direct application of [doc->]; term

(2) segmentation process actually includes segmentation, word segmentation, synonyms, synonyms conversion and so on, to a page on title as an example, will receive the data: term text, termId,


(1) is actually the process of web page analysis of different parts of the original pages are recognized and labeled, such as: title, keywords, content, link, anchor, and other important areas of non comments and so on;

well, the more love released in Shanghai, of course, is very simple, want to know more can see the wood of the Shanghai dragon "do not understand the principle of search engine is in" naked I think you can understand, in more detail on the inside. In addition to the above article inside a few words we may not understand, simple to say: term is the word text, namely keyword; term>

is the inverted index process of index system, is to achieve millisecond level retrieval is a very important link of search engine.

love Shanghai ?

from Shanghai Webmaster Platform Lee love last August issued on the search engine grab information after 2 months have passed, the Lee continue to publish the information search engine index system. Anyway, the wooden Shanghai dragon that love Shanghai’s official announcement we still need to understand and analyze the. The following is the official announcement:

and so on, lexical category of speech; ?

to know if users find the keywords (query after segmentation have appeared in which pages), so that the user retrieval process can be thought of as containing query in different parts of the page after the segmentation process set intersection, and retrieval becomes the comparison, the page name of intersection between. So, in milliseconds to 100 million units of retrieval is possible. This is commonly referred to as the intersection of inverted indexing and retrieval process. The following basic process for the establishment of inverted index:

