An Architecture Overview

 
Do you want to know more about the high paying keywords in your market segment? Check out our latest Keyword Bid Price Tracking Tool. More keywords to watch!
  

In Google, there are three main part in its architecture.

First, there is a web crawler to download web page from the Internet. This is done by the URL Server that sends a list of URL to the crawler for fetching. The fetched page are then stored in the Store Server which compresses and stores the web page into a repository. For every web page, there is an docID associated with.

| URLServer | --> | Crawlers | --> | Store Server | --> | Repository |

Second, there are an indexer and sorter then performs the indexing function.

The indexer reads the document and then parses them. Each document is converted into a set of word occurences called "hits". The hit records the word's properties like position, font size etc. The indexer then distributes these hits into a set of "barrels", creating a partially sorted forward index.

The indexer also parse out the information from the link forming the anchor files which contains the information of which page connects to which page.

The sorter then takes the barrels and resorted them to generate the inverted index with the wordID.

| Repository | --> | Indexer | --> | Anchor, Barrels (Forward Index) |
| Barrels | --> | Sorter | --> | Inverted Index |

Third, the URLresolver then responsible for converting the URL to absolute URL and finally to docID, forming a pairs of docID with link associated with. This link database is then used by calculating the PageRank of each page.

| Anchor | --> | URLresolver | --> | Links | --> | PageRank |

Finally, there is a program called DumpLexicon, it takes this inverted list together with the lexicon produced by the index and generates a new lexicon for the query. The searcher uses the lexicon built by the DumpLexicon and the PageRank to answer the query.

phpbb_admin ??Mon, 2006 ??03 ??20 14:40
hgfunfj by erew


Google
 
Web www.seoearnings.com