The breakdown of your page - Hit lists
After the web page has been crawled into the repository, the indexer parses through every web page and break down it into a logical structure called hit lists.
A hit list respresents a list of word occurence in the web document fetching from the Internet, it records the position, font and capitalization information.
There are two main types of hits: fancy hits and plain hits. Fancy hits include hit occurring in a URL, title, anchor text, or meta tag. Plain hits include everything else.
In the barrel, the indexer builds up a forward indexer for every web page with a list of hits associated with it. With the docID, the search engine can query what information of that page composed of with ease.
Example: a list of doc in the barrel with forward index
| docID |
| wordID | no-of-hit | hit, hit, hit |
| wordID | no-of-hit | hit, hit, hit, hit|
| null wordID |
| docID |
| wordID | no-of-hit | hit, hit, hit |
| wordID | no-of-hit | hit, hit, hit |
| wordID | no-of-hit | hit, hit, hit |
| null wordID |