An Architecture Overview
In Google, there are three main part in its architecture.
First, there is a web crawler to download web page from the Internet. This is done by the URL Server that sends a list of URL to the crawler for fetching. The fetched page are then stored in the Store Server which compresses and stores the web page into a repository. For every web page, there is an docID associated with.
| URLServer | --> | Crawlers | --> | Store Server | --> | Repository |
Second, there are an indexer and sorter then performs the indexing function.
The indexer reads the document and then parses them. Each document is converted into a set of word occurences called "hits". The hit records the word's properties like position, font size etc. The indexer then distributes these hits into a set of "barrels", creating a partially sorted forward index.
The indexer also parse out the information from the link forming the anchor files which contains the information of which page connects to which page.
The sorter then takes the barrels and resorted them to generate the inverted index with the wordID.
| Repository | --> | Indexer | --> | Anchor, Barrels (Forward Index) |
| Barrels | --> | Sorter | --> | Inverted Index |
Third, the URLresolver then responsible for converting the URL to absolute URL and finally to docID, forming a pairs of docID with link associated with. This link database is then used by calculating the PageRank of each page.
| Anchor | --> | URLresolver | --> | Links | --> | PageRank |
Finally, there is a program called DumpLexicon, it takes this inverted list together with the lexicon produced by the index and generates a new lexicon for the query. The searcher uses the lexicon built by the DumpLexicon and the PageRank to answer the query.
Crawl the web!
A web crawler (also known as a web spider or ant) is a program which browses the World Wide Web in an automated manner.
Google uses a distributed crawler for fetching web document from the Internet. A single URLserver serves a list of URLs to a number of crawlers (around 3 at the same time).
The crawlers download the web page and then send them to a store server. The store server then compresses and stores the page in the repository.
What does Repository store?
In Google, the respository contains full HTML of every web page fetched by the crawler. Google uses the zlib (RFC 1950) for compressing the data with the trade off of compression ratio. For every web page, it is stored as a packet with the following format.
| docid | encode | url-lenght | page-len | url | page |
The respository plays an important role for the origin of the information sources in the search engine. Data can be re-built from this database.
Your searching keyword - Lexicon
Lexicon is a word of Greek origin meaning vocabulary, and this is what your query in the search engine.
In Google, the lexicon is a database of words used for search query from the user. This database contains around 14 million of words. For every lexicon, there is a wordID in associated with it.
| wordID | lexicon |
The breakdown of your page - Hit lists
After the web page has been crawled into the repository, the indexer parses through every web page and break down it into a logical structure called hit lists.
A hit list respresents a list of word occurence in the web document fetching from the Internet, it records the position, font and capitalization information.
There are two main types of hits: fancy hits and plain hits. Fancy hits include hit occurring in a URL, title, anchor text, or meta tag. Plain hits include everything else.
In the barrel, the indexer builds up a forward indexer for every web page with a list of hits associated with it. With the docID, the search engine can query what information of that page composed of with ease.
Example: a list of doc in the barrel with forward index
| docID |
| wordID | no-of-hit | hit, hit, hit |
| wordID | no-of-hit | hit, hit, hit, hit|
| null wordID |
| docID |
| wordID | no-of-hit | hit, hit, hit |
| wordID | no-of-hit | hit, hit, hit |
| wordID | no-of-hit | hit, hit, hit |
| null wordID |
Building an Inverted Index!
The inverted index are from the same barrels, but it is built by the sorter. For every lexicon, there is a pointer pointed into barrel that the wordID falls into and the wordID points to a list of docID, and finally to a list of hit lists.
| Lexicon | --> | wordID | --> | docID | --> | hit list |
With this inverted index, Google can locate the document from the lexicon submitted by the searching query.