I am writing a C4.5-type decision tree algorithm and wondering if I
should account for the intrinsic information when splitting over a
numeric attribute.
Consider the following sample:
A1 A2
0 play
3 no play
5 play <- split just above this
9 play
15 no play
To evalute various splits at different values, one can calcualte the
information and the gain ratio (i.e. information / intrinsic
information). Then the best split point must be selected.
For example when splitting just above A1 = 5, we have:
info = 2/5 * info([1,1] + 3/5 * info[2,1] // above split = (1
play+1 no play), below split = (2 play+1 no play)
intrinsic_info = info[2,3] // above split =
2 rows, below split = 3 rows
Given the fact that a numeric split will yield only 2 branches, I am
wondering how im****tant it is to incor****ate intrinsic information
(and calculate the information gain) to decide which split point is
best.
Thoughts?
[ comp.ai is moderated ... your article may take a while to appear. ]


|