IMPLEMENTATION
DOCUMENT
Introduction
This document describes the project implementation for the CSE 345/445 Class Project, LSWWW. The aim of the project is to implement a back-end parser and indexer, and to provide a basic search functionality. The project has been developed in Java. The project is capable of running on standard UNIX machines. However, to handle the large amount of crawled data, it is recommended to run it on a high performance server. A command line interface is presented to the user in order to access the build and search APIs. The GUI will be designed and implemented in the later phases of the project.
Definitions
DocIndex: This contains the document id, URL and title which are used for display purposes. The DocIndex is written onto the file named Ôdoc_index.txtÕ.
WordIndex: This contains the entire hash table which is stored onto disk. The hash key is the actual term to be hashed. The WordIndex contains only the hash keys generated for the terms. Each hash key consumes 2 bytes. Hence, the WordIndex is of the size 2 * (number of terms to be indexed).
Postings list: This list contains the postings for every term that is hashed. A separate file is made for every word. This makes available direct access to the word in case of update and retrieval. The postings list is a list of triplet fields of the form <doc id, position, weight>. Weight is assigned based on the HTML tag the word belongs to. Title is assigned the weight 2 as it is considered to have the highest priority. Plain text is assigned the least priority, and headings <h1-h6>, bold and italics are assigned the weights 1.
Functionality
Function call for build: java lswww_package/lswww -build summary0-25.warc.gz
This command is to be executed from the Lswww package which contains all of the
class files.
The word and document indexes are built upon execution of the above command.
Function call for search: java lswww_package/lswww -search
This command takes in a single term as input and lists all documents (URLs and
titles) which have an occurrence of the word. A semicolon (Ô;Õ) needs to
be entered by the user as a query in order to quit the search.
Function call for searching documents on their id: java lswww_package/lswww
-searchdoc <document id>
This command takes as input the document id and returns the corresponding URL
and Title.
Build:
When the program is run from the command line with the build argument and the
name of the crawled '.gz' file, the readGZIPFile module is called which
extracts html documents, one by one, from the archive and writes them to a
temporary file. This file is read by the HTMLParser which then separates the
tags and gives a call to the lineParse module for every document. The lineParse
module separates the string into words and calls the hash function for every
word. It then calls the lineParserHelper module which checks for the particular
word in the stop list and adds the word accordingly by giving a call to the
hash function for every word. Next the hash function calls either the Add or
the Update function depending on whether the hash key for that word exists or
not. The generated hash table is written to the disk as a word index after
every 1000 documents in order to facilitate searching even if the program
crashes due to an abnormal termination.
Usage: java lswww -build summary0-25.warc.gz
Search:
When the program is run from the command line with the search argument, the
following actions occur in the course of program execution. The entire word index is loaded into
memory (maintained as a hash table) and a message appears on the user screen
asking for a user term query. Once the word has been entered by the user, the
hash key for the word is checked for in the hash table. If it is not found in
the hash table, the key is checked for in the stop list. If the word is in the
stop list a message is thrown to the user asking him to refine his search as
the word entered by him is very common. However, if the word is not in the
stop list too, then the 'word not found in any document' message is displayed
to the user. If the word if present in the hash table, the URL and the title of
the documents in which it is present, and the frequency of occurrence in these
documents are displayed.
Usage: java lswww -search
Searchdoc:
Given the document id, while running the API with the searchdoc argument, this module reads from the document index file and displays the URL and the title.
Usage: java lswww -searchdoc
Flow Chart


Modules
1. GzipReader Module
This module handles the extraction of individual html files from the archived
file. Files are saved in a temporary document which is then sent to the
HtmlParser for further processing. The next file is extracted once the current
file has been processed and indexed.
Module GzipReader
Input: gzip file
Output: temporary html document
Read file with extension Ò.gzÓ
While !EOF
Do{
Read line from file
Write line to temp.html
}until line=Ó</html>Ó
Call HtmlParser(temp.html)
End of while
End of Module
2. HtmlParser Module
This module reads the temporary document sent by the GzipReader module and
extracts out the plaintext which is then sent to the LineParser module. During
this time the HtmlParser checks for all anchor text as well as font styles,
headings, title etc. which are later assigned special priorities.
Module HtmlParser
Input: temporary html document
Output: plaintext line
Read file with extension Ò.htmlÓ
While !EOF
Do{
Extract line from file
Remove HTML tags from line
Call LineParser(string)
}until line=Ó</html>Ó
End of while
End of Module
3. LineParser Module
The LineParser module reads strings and separates them into words. As the words are being extracted they are sent to the LineParserHelper module which checks for stop words. The word is then sent to the Hash module to be hashed and indexed.
Module LineParser
Input: plaintext string
Output: word to be hashed
Read string
While !EOL
Extract word from string
Call LineParserHelper(word)
If(LineParserHelper(word)!=1) // Not a stopword
Hash(word)
else
skip word // Is a stop word
End If
End of while
End of Module
4. LineParserHelper Module
This module checks if the word is found in the stop words
list and returns an appropriate value to the LineParser.
Module LineParserHelper
Input: plaintext word
Output: 0 if word is not a stop word, 1 otherwise
Read word
If(word != stopword) // Not a stopword
Return 0
else
Return 1 // Is a stop word
End If
End of Module
5. Hash Module
The Hash module hashes every word received from the LineParser and stores it in a hashtable. The hash function calls for an update when the hash key for the word is found in the hash table and calls the Add module in case the hashkey for the specific word is not already present. This hashtable is written onto the disk after every 1000 documents are parsed.
Module Hash
Input: plaintext word
Output: hashkey
Read word
Hash(word)
If hashkey exists in hash table
Call Update
Else
Call Add
End If
If documents parsed = 1000 then
Write hash table to disk
Else
Continue
End if
End of Module
6. Add/Update Module
These modules are used for adding to or updating the
postings list.
Module Add
Input: Term
Output: New postings list for the term
Create a new postings list for the term.
End of Module
Module Update
Input: Term
Output: Modified postings list for the term
Update the postings list for the term.
End of Module
Testing
Testing was carried out on individual modules of our programs followed by an
integrated testing which was done by combining all the different modules into
one program. our testing was first conducted initially on 25 documents and then
we scaled it up to 100 documents to observe its working on a slightly larger
set.
Performance
The program works in a smart way as it has a direct access to the term by
looking for that term in the hash table. A separate file is created for every
word on disk. The performance factor was tweaked by storing only the hash keys
for the terms in the index. This allows us to update the postings list for the
term and to retrieve the term during search in just a single access. This also reduces
the space on disk to a considerable amount.