IMPLEMENTATION DOCUMENT

 


Introduction

 

This document describes the project implementation for the CSE 345/445 Class Project, LSWWW. The aim of the project is to implement a back-end parser and indexer, and to provide a basic search functionality. The project has been developed in Java. The project is capable of running on standard UNIX machines. However, to handle the large amount of crawled data, it is recommended to run it on a high performance server. A command line interface is presented to the user in order to access the build and search APIs. The GUI will be designed and implemented in the later phases of the project.

 

Definitions

 

DocIndex: This contains the document id, URL and title which are used for display purposes. The DocIndex is written onto the file named Ôdoc_index.txtÕ.

 

WordIndex: This contains the entire hash table which is stored onto disk. The hash key is the actual term to be hashed. The WordIndex contains only the hash keys generated for the terms. Each hash key consumes 2 bytes. Hence, the WordIndex is of the size 2 * (number of terms to be indexed).

 

Postings list: This list contains the postings for every term that is hashed. A separate file is made for every word. This makes available direct access to the word in case of update and retrieval. The postings list is a list of triplet fields of the form <doc id, position, weight>. Weight is assigned based on the HTML tag the word belongs to. Title is assigned the weight 2 as it is considered to have the highest priority. Plain text is assigned the least priority, and headings <h1-h6>, bold and italics are assigned the weights 1.

 

Functionality

Function call for build: java lswww_package/lswww -build summary0-25.warc.gz
This command is to be executed from the Lswww package which contains all of the class files.
The word and document indexes are built upon execution of the above command.

Function call for search: java lswww_package/lswww -search
This command takes in a single term as input and lists all documents (URLs and titles) which have an occurrence of the word.  A semicolon (Ô;Õ) needs to be entered by the user as a query in order to quit the search.

Function call for searching documents on their id: java lswww_package/lswww -searchdoc <document id>
This command takes as input the document id and returns the corresponding URL and Title.


Build:


When the program is run from the command line with the build argument and the name of the crawled '.gz' file, the readGZIPFile module is called which extracts html documents, one by one, from the archive and writes them to a temporary file. This file is read by the HTMLParser which then separates the tags and gives a call to the lineParse module for every document. The lineParse module separates the string into words and calls the hash function for every word. It then calls the lineParserHelper module which checks for the particular word in the stop list and adds the word accordingly by giving a call to the hash function for every word. Next the hash function calls either the Add or the Update function depending on whether the hash key for that word exists or not. The generated hash table is written to the disk as a word index after every 1000 documents in order to facilitate searching even if the program crashes due to an abnormal termination.
Usage: java lswww -build summary0-25.warc.gz

Search:


When the program is run from the command line with the search argument, the following actions occur in the course of program execution.  The entire word index is loaded into memory (maintained as a hash table) and a message appears on the user screen asking for a user term query. Once the word has been entered by the user, the hash key for the word is checked for in the hash table. If it is not found in the hash table, the key is checked for in the stop list. If the word is in the stop list a message is thrown to the user asking him to refine his search as the word entered by him is very common. However, if the word is not in the stop list too, then the 'word not found in any document' message is displayed to the user. If the word if present in the hash table, the URL and the title of the documents in which it is present, and the frequency of occurrence in these documents are displayed.

Usage: java lswww -search

 

 

Searchdoc:

 

Given the document id, while running the API with the searchdoc argument, this module reads from the document index file and displays the URL and the title.

Usage: java lswww -searchdoc

 


Flow Chart







Modules

1. GzipReader Module
This module handles the extraction of individual html files from the archived file. Files are saved in a temporary document which is then sent to the HtmlParser for further processing. The next file is extracted once the current file has been processed and indexed.

Module GzipReader

            Input: gzip file

            Output: temporary html document

 

Read file with extension Ò.gzÓ

While !EOF

            Do{

                        Read line from file

                        Write line to temp.html

            }until line=Ó</html>Ó

            Call HtmlParser(temp.html)

End of while

End of Module


2. HtmlParser Module
This module reads the temporary document sent by the GzipReader module and extracts out the plaintext which is then sent to the LineParser module. During this time the HtmlParser checks for all anchor text as well as font styles, headings, title etc. which are later assigned special priorities.

 

Module HtmlParser

            Input: temporary html document

            Output: plaintext line

 

Read file with extension Ò.htmlÓ

While !EOF

            Do{    

                        Extract line from file

                        Remove HTML tags from line

                        Call LineParser(string)

            }until line=Ó</html>Ó

           

End of while

End of Module




3. LineParser Module

The LineParser module reads strings and separates them into words.  As the words are being extracted they are sent to the LineParserHelper module which checks for stop words. The word is then sent to the Hash module to be hashed and indexed.

 

Module LineParser

            Input: plaintext string

            Output: word to be hashed

 

Read string

While !EOL

                        Extract word from string

                        Call LineParserHelper(word)

                        If(LineParserHelper(word)!=1) // Not a stopword

                                    Hash(word)

                        else     

                                    skip word //  Is a stop word   

                        End If

End of while

End of Module

 

4. LineParserHelper Module

This module checks if the word is found in the stop words list and returns an appropriate value to the LineParser.

Module LineParserHelper

            Input: plaintext word

            Output: 0 if word is not a stop word, 1 otherwise

 

Read word

If(word != stopword) // Not a stopword

            Return 0

else     

            Return 1 //  Is a stop word                  

End If

End of Module


5. Hash Module

The Hash module hashes every word received from the LineParser and stores it in a hashtable. The hash function calls for an update when the hash key for the word is found in the hash table and calls the Add module in case the hashkey for the specific word is not already present. This hashtable is written onto the disk after every 1000 documents are parsed.

 

Module Hash

            Input: plaintext word

            Output: hashkey

 

            Read word

            Hash(word)

            If hashkey exists in hash table

                        Call Update

            Else

                        Call Add

            End If

            If documents parsed = 1000 then

                        Write hash table to disk

            Else

                        Continue

            End if

End of Module

 

6. Add/Update Module

These modules are used for adding to or updating the postings list.

Module Add

            Input: Term

            Output: New postings list for the term

 

            Create a new postings list for the term.

End of Module

 

Module Update

            Input: Term

            Output: Modified postings list for the term

 

            Update the postings list for the term.

End of Module


Testing
Testing was carried out on individual modules of our programs followed by an integrated testing which was done by combining all the different modules into one program. our testing was first conducted initially on 25 documents and then we scaled it up to 100 documents to observe its working on a slightly larger set.

Performance
The program works in a smart way as it has a direct access to the term by looking for that term in the hash table. A separate file is created for every word on disk. The performance factor was tweaked by storing only the hash keys for the terms in the index. This allows us to update the postings list for the term and to retrieve the term during search in just a single access. This also reduces the space on disk to a considerable amount.