Text Classification using Identification Trees
I- Classes:II
2 How to calculate the frequency of the keywords?
We store the keywords in a linked list
then each time we extract a word from a document we compare it to this list,
if this word exists in the linked list we increment the frequency of the keyword
that this word matches.
The use of a keyword list makes sense because in the ID3 we should know the
attributes in advance. (Attributes = keywords)
(Value of the attribute = keyword frequency).
The document vectors represent the frequency
of each keyword.
Example:
(Logic=20, supervised=30, fuzzy=25
.)
III- Implementation of the ID3 learning:
1-Because the ID3 does
not deal with continuous attribute values, we give all possible values that
the attribute value can take to the ID3.
2-We give the ID3 the attributes as well as their possible values.
3-We put a reasonable number of document vectors in the training set.
4-We put also an appropriate number in the testing set.
5-Then we start testing.
Example of a rule:
If the feature one value is 52 then all examples are in class fuzzy.
The number of keyword is 30:
Learning, logic, supervised, impurity, clustering, inductive, fuzzy,
artificial intelligence, knowledge, knowledge base, ID3, neural networks, membership,
entropy function, genetic algorithms, version space, uncertainty, vagueness,
fuzzy logic, attribute, random, probability, space, algorithm, unsupervised,
training set, training, layers.
The number of documents is 54:
24 in the class fuzzy, 22 in the class learning and 8 in the class other.
IV
Performance of classification
We consider a training set of 42 document vectors: 18
from the class fuzzy, 18 from the class leaning and 6 from the last class.
The testing set consists
of 12 patterns: 6 from the class fuzzy, 4 from the class learning and 2 from
the last class.
Each time
we add some patterns to the training set from the testing set and we retrieve
some other patterns from the training set and add them to the testing set.
We have done this procedure 7 times.
The general performance is 14.32%of misclassification.
V-Comment on the ID3
The performance of the learning
algorithm is increased not only by the number of examples in the training set
but also by the choice of the vectors, considering the class 1 the percentage
of misclassification is very low even if using the same number of labelled examples
from each class.
The performance of the learning
algorithm is increased not only by the number of examples in the training set
but also by the choice of the vectors, considering the four experiences, we
can observe the difference of the performance even if having the same number
in the training set ่ the choice of vectors can either improve
or decrease the performance depending on how good is the choice ่ the learning program can learn from some vectors rather than learn from
the others.
Is this
just memorizing? No, because
some new examples are classified in the appropriate class.
The Id3 learning algorithm makes the rules more compact and reduces rules by selecting the most appropriate
features.
The following comments are extracted from
www documents:
The id3 doesnt directly
deal with continuous attribute values to do so we have to add some modifications,
for example replace continous values by ranges.
7. In building decision tree
we can deal with training sets that have records with unknown attribute's value
by evaluating the gain of ratio for an attribute by considering only records
where that attribute is defined.
8. Gain-ratio (D, T) =Gain (D,T)/ split-info(D,T).
Where D is an attribute and gain is the decrease of impurity.
In using a decision
tree, we can classify records that have unknown attribute values by estimating
the probability of the various possible results.
10.
Pruning of the decision tree is done by replacing a whole sub-tree by
a leaf node. The replacement takes if a decision tree establishes that the expected
error rate in the sub-tree is greater than in the simple leaf.
11.
Id3 classifies a document in one class but in our project we need to
classify a document in more than one class.