Site hosted by Angelfire.com: Build your free website today!

Text Classification using Identification Trees

I- Classes:
The classes used in the classification are:
1-    Fuzzy.
2-    Learning.
3-    Other.
The classes fuzzy and learning are not disjoint; they have some document and keywords in common. The classification in this case is more complicated and makes sense.
The class other is added to avoid the misclassification of a document that does not belong to both the first and the second classes.
Each class is represented by a reasonable number of keywords that have a high occurrence in the documents within this class.

II- Documents extraction
Extraction of documents is accomplished by using the web search engines. The class of each document is known in advance according to the specific given queries to the search engines.
The number of the retrieved documents should be reasonable, so that the ID3 learning algorithm can learn a sufficient number of examples to construct rules choosing the appropriate attributes.
The document extracted for the learning the class other are documents that do not belong to any of the other first classes.

II –1 Extraction of attributes (keywords)
:
 Using an indexing project that gives the occurrence of each word in the document, we choose the ones that have a high occurrence in the documents but we exclude the words in the stopping list like: in, and, out ………
The keywords we have chosen are very significant in the classification.
Examples of useful keywords in the classification:

1-    In the class learning: supervised, space and networks.
2-    In the class fuzzy: fuzzy, logic.

II –2 How to calculate the frequency of the keywords?
We store the keywords in a linked list then each time we extract a word from a document we compare it to this list, if this word exists in the linked list we increment the frequency of the keyword that this word matches.
The use of a keyword list makes sense because in the ID3 we should know the attributes in advance. (Attributes = keywords)
(Value of the attribute   =  keyword frequency).   
The document vectors represent the frequency of each keyword.
Example:

              (Logic=20, supervised=30, fuzzy=25 …….)
III- Implementation of the ID3 learning:
1-Because the ID3 does not deal with continuous attribute values, we give all possible values that the attribute value can take to the ID3.
2-We give the ID3 the attributes as well as their possible values.
3-We put a reasonable number of document vectors in the training set.
4-We put also an appropriate number in the testing set.
5-Then we start testing.

Example of a rule:
If the feature one value is 52 then all examples are in class fuzzy.
The number of keyword is 30:
Learning, logic, supervised, impurity, clustering, inductive, fuzzy, artificial intelligence, knowledge, knowledge base, ID3, neural networks, membership, entropy function, genetic algorithms, version space, uncertainty, vagueness, fuzzy logic, attribute, random, probability, space, algorithm, unsupervised, training set, training, layers. 
The number of documents is 54: 24 in the class fuzzy, 22 in the class learning and 8 in the class other.

IV Performance of classification
*  We consider a training set of 42 document vectors: 18 from the class fuzzy, 18 from the class leaning and 6 from the last class.

*   The testing set consists of 12 patterns: 6 from the class fuzzy, 4 from the class learning and 2 from the last class.
*      Each time we add some patterns to the training set from the testing set and we retrieve some other patterns from the training set and add them to the testing set.
*   We have done this procedure 7 times.
*   The general performance is 14.32%of misclassification.

V-Comment on the ID3
* The performance of the learning algorithm is increased not only by the number of examples in the training set but also by the choice of the vectors, considering the class 1 the percentage of misclassification is very low even if using the same number of labelled examples from each class.
* The performance of the learning algorithm is increased not only by the number of examples in the training set but also by the choice of the vectors, considering the four experiences, we can observe the difference of the performance even if having the same number in the training set the choice of vectors can either improve or decrease the performance depending on how good is the choice the learning program can learn from some vectors rather than learn from the others.  
* Is this just memorizing? No, because some new examples are classified in the appropriate class.
The Id3 learning algorithm makes the rules more compact
and reduces rules by selecting the most appropriate features.
* The following comments are extracted from www documents:
The id3 doesn’t directly deal with continuous attribute values to do so we have to add some modifications, for example replace continous values by ranges.
7.    In building decision tree we can deal with training sets that have records with unknown attribute's value by evaluating the gain of ratio for an attribute by considering only records where that attribute is defined.
8.    Gain-ratio (D, T) =Gain (D,T)/ split-info(D,T).
Where D is an attribute and gain is the decrease of impurity.
In using a decision tree, we can classify records that have unknown attribute values by estimating the probability of the various possible results.
10.            Pruning of the decision tree is done by replacing a whole sub-tree by a leaf node. The replacement takes if a decision tree establishes that the expected error rate in the sub-tree is greater than in the simple leaf.
11.            Id3 classifies a document in one class but in our project we need to classify a document in more than one class.