Site hosted by Angelfire.com: Build your free website today!

 

 

 Project Title: A Study of Credits Acceptance Using C4.5 and NN

 

 

 

I Motivation
The purpose is the comparison of two algorithms that are C4.5 with rules and NEFCLASS (Neural Network), then to choose the appropriate method that meets the needs of the business. In other words the method that performs well on the data (low error rate), and offers an explanation of its decisions.

II Description of Data
The number of examples is 1000.
The number of classes is 2.
In class 1 we have 700 examples.
In class 2 we have 300 examples.
We have 20 attributes divided into 7 numerical and 13 categorical attributes.

II-1 Examples of attributes
*      Duration of the credit.
*      The amount of the credit.
*      The purpose of the credit.
*
      The job of the customer.
*      Number of existing credits at this bank.
*
      Other debtors and guarantors….
We observe that these attributes focus on the life style of the customer and the guaranties that he offers such as a high amount in his saving account.

II-2 Difficulties Faced Before Building Models
1.   The data should be put in a specific format.
2.
   The categorical attributes should be converted to numerical ones to be used by the Neural Network (NEFCLASS).
3.
   A random choice of the training and testing sets.
4.   An explicit format for the rules to be generated (C4.5).

II-3 To overcome these difficulties I have implemented three programs:
1.   A program that converts categorical attributes to numerical ones.
2.
   A program that generates randomly the training and the testing sets and put them in an appropriate format to be processed by the methods.
3.   A program that transforms the rules to an explicit format.

II-4 Description of Training and Testing Sets
The training set contains 2/3 (666) of the whole set while the testing set contains 1/3 (334).These two sets are generated randomly.

III C4.5

The default class that is used in the case where no rule can classify a pattern is class 1(don’t give a loan).

III-1
Output Format
1.    The size of the rule (how many antecedents that its premise contains).
2.
    The misclassification rate using this rule.
3.
    How many times it was used to classify patterns.
4.
    How many patterns a rule has incorrectly classified and a ratio = incorrectly classified patterns / (divided) how many times it was used to classify.
5.
    The advantage: How many patterns that were correctly classified can be incorrectly classified if we don’t use this rule – (Minus) how many patterns that were incorrectly classified and by using this rule will be good classified.

III-2 Generated rules
The rules after the translation to an explicit format:

Example
                      IF status of the existing checking account <0 $
                                               THEN                         
                             Don’t give a loan (class 1) with 88.3%


The last ratio is the prediction that this classification will be correct for 88.3% of unseen cases that satisfy this rule’s left hand side.

The size of the rule is 1 (1 condition in its left hand side).
The rule was used to classify 97 patterns.
The prediction error of this rule is % (100% - 88.3%).
The error rate is 18.6% (97/18) because we have 18 patterns that were misclassified by this rule.
The advantage of this rule is 7.
If the rule has a negative advantage then it is omitted, so the only rules used for the classification are the ones with a positive advantage.


IV NEFCLASS
NEFCLASS was trained for each training-set during 30 epochs.

IV-1 NEFCLASS output
1.
    The error rate.
2.
    The number of misclassified patterns.
3.
    The Total error and the Mean error.
PS: After 15 epochs of training the error rate remains constant.

V The Overall Performance
Each method was trained then tested on the same training and testing set. After that the error rate was evaluated. This was done 30 times.

Table 1

Method

Average Training Error %

Average Testing Error %

C4.5

21.43%

26.95%

NEFCLASS

28.70%

30.14%


Table 2

Method

Max Testing  Error %

Min Testing Error %

C4.5

28.53%

25.71%

NEFCLASS

34.73%

27.25%


Table 3

Method

Max Training  Error %

Min Training Error %

C4.5

29.1%

21.0%

NEFCLASS

31.38%

27.78%


VI Conclusions and Remarks

*      The C4.5 shows a higher average performance over the data than the NEFCLASS.

*      C4.5 offers an explicit explanation of the decision made for the classification of each pattern, so we can exploit its rules to explain to the customer the decision made by the model each time we use it to make a decision.

*      C4.5 generates a limited number of rules.

*      The rules generated meet the criteria required by banks to accept a loan request.

*      C4.5 is well suited for categorical attributes. But not well suited for continuous values.

*      One reason for the low performance of the NEFCLASS is that it is not well suited to work with categorical attributes (13 categorical ¹ 7 numerical), but as NN it can work well with continuous values.

*      The way taken for the conversion can have an impact on the performance of the NEFCLASS.

*      If we look at the rules generated by the C4.5 we can conclude that the criteria required by the bank are based on guaranties offered by the customer. This can lead the bank to lose some good customers that can pay back to the bank even if they have not enough guaranties.

*      Both methods offer a quick answer to the customer. This is a very important criterion for business strategy.

*      The methods offer decisions transparency. So the decisions are not taken arbitrary even if in some cases they are exceptions.

*      Some attributes are involved in the decision making rarely.