I
Motivation
The purpose is the comparison of two algorithms that
are C4.5 with rules and NEFCLASS (Neural Network), then to choose the appropriate
method that meets the needs of the business. In other words the method that
performs well on the data (low error rate), and offers an explanation of its
decisions.
II
Description of Data
The number of examples is 1000.
The number of classes is 2.
In class 1 we have 700 examples.
In class 2 we have 300 examples.
We have 20 attributes divided into 7 numerical and 13 categorical attributes.
II-1
Examples of attributes
Duration of the credit.
The amount of the credit.
The purpose of the credit.
The job of the customer.
Number of existing credits at this bank.
Other
debtors and guarantors….
We
observe that these attributes focus on the life style of the customer and the
guaranties that he offers such as a high amount in his saving account.
II-2 Difficulties
Faced Before Building Models
1. The data should be put in a specific format.
2. The categorical attributes should be converted to numerical
ones to be used by the Neural Network (NEFCLASS).
3. A random choice of the training and testing sets.
4. An explicit format for the rules to be generated (C4.5).
II-3 To overcome
these difficulties I have implemented three programs:
1.
A program that converts
categorical attributes to numerical ones.
2.
A program that generates
randomly the training and the testing sets and put them in an appropriate format
to be processed by the methods.
3. A program that transforms the rules to an explicit
format.
II-4 Description
of Training and Testing Sets
The training set contains 2/3 (666)
of the whole set while the testing set contains 1/3 (334).These two sets are generated randomly.
III C4.5
The default class that is used in the case where
no rule can classify a pattern is class 1(don’t give a loan).
III-1 Output
Format
1. The size of the rule (how many antecedents that its premise contains).
2. The misclassification rate using this rule.
3. How many times it was used to classify patterns.
4. How many patterns a rule has incorrectly classified and a ratio = incorrectly
classified patterns / (divided) how many times it was used to classify.
5. The advantage: How many patterns that were correctly classified can be incorrectly
classified if we don’t use this rule – (Minus) how many patterns that were incorrectly
classified and by using this rule will be good classified.
III-2 Generated rules
The rules after the translation to an explicit format:
Example
IF status of the existing
checking account <0 $
THEN
Don’t
give a loan (class 1) with 88.3%
The last ratio is the prediction that this classification will be correct for
88.3% of unseen cases that satisfy this rule’s left hand side.
IV-1 NEFCLASS output
1. The error rate.
2. The number of misclassified patterns.
3. The Total error and the Mean error.
PS: After 15 epochs of training the error rate remains constant.
V The Overall Performance
Each method was trained then tested
on the same training and testing set. After that the error rate was evaluated.
This was done 30 times.
|
Method |
Average Training Error % |
Average Testing Error % |
|
C4.5 |
21.43% |
26.95% |
|
NEFCLASS |
28.70% |
30.14% |
|
Method |
Max Testing Error % |
Min Testing Error % |
|
C4.5 |
28.53% |
25.71% |
|
NEFCLASS |
34.73% |
27.25% |
|
Method |
Max Training Error % |
Min Training Error % |
|
C4.5 |
29.1% |
21.0% |
|
NEFCLASS |
31.38% |
27.78% |
C4.5
offers an explicit explanation of the decision made for the classification of
each pattern, so we can exploit its rules to explain to the customer the decision
made by the model each time we use it to make a decision.
C4.5
generates a limited number of rules.
The
rules generated meet the criteria required by banks to accept a loan request.
C4.5
is well suited for categorical attributes. But not well suited for continuous
values.
One
reason for the low performance of the NEFCLASS is that it is not well suited
to work with categorical attributes (13 categorical ¹ 7 numerical), but as NN it can work well with continuous values.
The
way taken for the conversion can have an impact on the performance of the NEFCLASS.
If
we look at the rules generated by the C4.5 we can conclude that the criteria
required by the bank are based on guaranties offered by the customer. This can
lead the bank to lose some good customers that can pay back to the bank even
if they have not enough guaranties.
Both
methods offer a quick answer to the customer. This is a very important criterion
for business strategy.
The
methods offer decisions transparency. So the decisions are not taken arbitrary
even if in some cases they are exceptions.