Site hosted by Angelfire.com: Build your free website today!
next up previous
Next: About this document ... Up: CS2012. Algorithms and Data Previous: CS2012 Lab 2: Timing

Subsections

CS2012 Lab 3: A simple spell checker (2 lab sessions)

Purpose

This lab is an exercise in implementing some common datatypes, together with a framework in which they can be used.


Background

The exercise is to implement a very simple spelling checker which uses a set in which to store all the words in a dictionary. A text is then read and words which are in the text which can not be found in the dictionary are reported to the user, together with the line numbers on which these words occur in the text.

The dictionary is to be contained in a collection which implements the following very simplified collection interface

public interface Coll
{
  public boolean add(Object o);
  public boolean contains(Object o);
  public void printColl();
}

The behaviour of the add and contains methods should be the same as the corresponding methods in Set. printColl should print the contents of collection to standard output in a suitable manner.


Tasks

Your tasks for this lab are to provide two different implementations of this interface, both using Hash Tables, one using Open Hashing and the other using Closed Hashing. Although these classes are to be used to hold String data, they should be general enough to contain any Object. You do not need to implement any hash functions, you should use the method hashCode provided in the JDK.

You should then implement a class SpellCheck which uses either of these implementations to produce a simple spell checker, as described above.

There is just one deadline for this lab, at the end of the second lab session, when both implementations will be marked together.


Task 1: Open Hashing based implementation

The first task is to create a class HashSetOpen which implements the Coll interface using hash tables based on open hashing, so that collisions are dealt with by using a list of entries. You can probably reuse a modified form of your LinkedList code from lab 1 for list handling purposes.

In addition to the methods required for the interface you will need suitable constructors and a main method which includes code to test your methods.


Task 2: Closed Hashing based implementation

The second task is to reimplement the Coll interface using hash tables based on closed hashing, so that collisions are dealt with by using a collision resolution mechanism, such as linear probing. You could alternatively use quadratic probing or double hashing. This class should be called HashSetClosed.

For this style of hash table you will also need to implement a rehash method, which is invoked by add when the the table reaches a predetermined load factor (0.5 is a suitable value), otherwise the table risks overfilling.

Your implementation should, of course, include suitable testing.

For both types of hash table it is preferable to use hash tables whose size is a prime number. A class Primes is provided in $CS2012/labs/lab3/Primes.java, to help you identify suitable prime numbers. In particular the method nextPrime returns the first prime number greater than or equal to its argument.


Task 3: The Spell Checker

The final task is to write a class SpellCheck which uses either of these implementations to produce a simple spell checker, as described above. The text file to be checked should be specified via a command line argument, as should the Coll implementation to be used and the dictionary. One possible example of how this might be used is:-
java SpellCheck -o  -d /usr/share/dict/words test-text
This should use the dictionary /usr/share/dict/words and the open hashing based hash table implementation to spell check the file test-text.

The simpler command

java SpellCheck test-text
might use a default dictionary and text file.

The recommended way to switch between the two implementations is to have a SpellCheck constructor which takes an argument of type Coll which is then used to store the dictionary. This can then be used in conjunction with a method to do the spell check, which can be invoked by that object.

To populate the dictionary and also to read the text to be checked, you need to read a file and identify the words in it. For this purpose we regard a word as a sequence of consecutive alphabetic characters. Any non-alphabetic character can be regarded as a separator. You may find the StringTokenizer class useful here. The dictionary is just a text file containing white space separated words. It may, or may not, have only one word per line.

A sample dictionary and text file to be spell checked can be found in the directory $CS2012/labs/lab3/, together with a Java file containing some examples of command line handling. There is a good tutorial on this subject at http://www.ecs.umass.edu/ece/wireless/people/emmanuel/java/java/cmdLineArgs/cmdLineArgs.html, from where the above Java source was obtained.

The output from the program should be in the form

3: thiss
4: iss
5: ze
7: owtput
where the numbers are the line numbers in the file where the wrongly spelt words occur.

Your spell checker should also deal sensibly with the fact that words are sometimes written in a mixture of upper and lower case, so that, if word is in the dictionary, the checker should not complain about Word, or even wOrD. This aspect should be handled in the spell checker itself, and not in either of the HashSet implementations.

Optional extras

There are many ways in which this program could be enhanced. One optional extra (not assessed) is to give your spelling checker the capability to read from a user's personal dictionary in addition to the system dictionary, and to add `incorrect' words to this personal dictionary if the user so chooses.

Another is to add extra constructors to the SpellCheck class to use the implementations of Set provided by the JDK, for example HashSet and TreeSet; this would also involve creating small wrapper classes which implemented Coll.


Marking Scheme

HashSetOpen implementation 6
HashSetOpen testing 1
HashSetClosed implementation 6
HashSetClosed testing 1
Basic SpellCheck implementation 3
Command line handling 2
SpellCheck testing 1
Total 20


next up previous
Next: About this document ... Up: CS2012. Algorithms and Data Previous: CS2012 Lab 2: Timing
Graham Gough
2004-04-30