Title: | C5.0 Decision Trees and Rule-Based Models |
---|---|
Description: | C5.0 decision trees and rule-based models for pattern recognition that extend the work of Quinlan (1993, ISBN:1-55860-238-0). |
Authors: | Max Kuhn [aut, cre], Steve Weston [ctb], Mark Culp [ctb], Nathan Coulter [ctb], Ross Quinlan [aut] (Author of imported C code), RuleQuest Research [cph] (Copyright holder of imported C code), Rulequest Research Pty Ltd. [cph] (Copyright holder of imported C code) |
Maintainer: | Max Kuhn <[email protected]> |
License: | GPL-3 |
Version: | 0.1.8 |
Built: | 2024-10-27 04:04:45 UTC |
Source: | https://github.com/topepo/c5.0 |
Fit classification tree models or rule-based models using Quinlan's C5.0 algorithm
## Default S3 method: C5.0( x, y, trials = 1, rules = FALSE, weights = NULL, control = C5.0Control(), costs = NULL, ... ) ## S3 method for class 'formula' C5.0(formula, data, weights, subset, na.action = na.pass, ...)
## Default S3 method: C5.0( x, y, trials = 1, rules = FALSE, weights = NULL, control = C5.0Control(), costs = NULL, ... ) ## S3 method for class 'formula' C5.0(formula, data, weights, subset, na.action = na.pass, ...)
x |
a data frame or matrix of predictors. |
y |
a factor vector with 2 or more levels |
trials |
an integer specifying the number of boosting iterations. A value of one indicates that a single model is used. |
rules |
A logical: should the tree be decomposed into a rule-based model? |
weights |
an optional numeric vector of case weights. Note that the data used for the case weights will not be used as a splitting variable in the model (see http://www.rulequest.com/see5-win.html#CASEWEIGHT for Quinlan's notes on case weights). |
control |
a list of control parameters; see
|
costs |
a matrix of costs associated with the possible errors. The matrix should have C columns and rows where C is the number of class levels. |
... |
other options to pass into the function (not currently used with default method) |
formula |
a formula, with a response and at least one predictor. |
data |
an optional data frame in which to interpret the variables named in the formula. |
subset |
optional expression saying that only a subset of the rows of the data should be used in the fit. |
na.action |
a function which indicates what should happen
when the data contain |
This model extends the C4.5 classification algorithms described in Quinlan (1992). The details of the extensions are largely undocumented. The model can take the form of a full decision tree or a collection of rules (or boosted versions of either).
When using the formula method, factors and other classes are preserved (i.e. dummy variables are not automatically created). This particular model handles non-numeric data of some types (such as character, factor and ordered data).
The cost matrix should by CxC, where C is the number of
classes. Diagonal elements are ignored. Columns should
correspond to the true classes and rows are the predicted
classes. For example, if C = 3 with classes Red, Blue and Green
(in that order), a value of 5 in the (2,3) element of the matrix
would indicate that the cost of predicting a Green sample as
Blue is five times the usual value (of one). Note that when
costs are used, class probabilities cannot be generated using
predict.C5.0()
.
Internally, the code will attempt to halt boosting if it
appears to be ineffective. For this reason, the value of
trials
may be different from what the model actually
produced. There is an option to turn this off in
C5.0Control()
.
An object of class C5.0
with elements:
boostResults |
a parsed version of the boosting table(s) shown in the output |
call |
the function call |
caseWeights |
not currently supported. |
control |
an echo of the specifications from
|
cost |
the text version of the cost matrix (or "") |
costMatrix |
an echo of the model argument |
dims |
original dimensions of the predictor matrix or data frame |
levels |
a character vector of factor levels for the outcome |
names |
a string version of the names file |
output |
a string version of the command line output |
predictors |
a character vector of predictor names |
rbm |
a logical for rules |
rules |
a character version of the rules file |
size |
n integer vector of the tree/rule size (or sizes in the case of boosting) |
.
tree |
a string version of the tree file |
trials |
a named vector with elements |
The command line version currently supports more data types than the R port. Currently, numeric, factor and ordered factors are allowed as predictors.
Original GPL C code by Ross Quinlan, R code and modifications to C by Max Kuhn, Steve Weston and Nathan Coulter
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, http://www.rulequest.com/see5-unix.html
C5.0Control()
, summary.C5.0()
,
predict.C5.0()
, C5imp()
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333]) treeModel summary(treeModel) ruleModel <- C5.0(churn ~ ., data = mlc_churn[1:3333, ], rules = TRUE) ruleModel summary(ruleModel)
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333]) treeModel summary(treeModel) ruleModel <- C5.0(churn ~ ., data = mlc_churn[1:3333, ], rules = TRUE) ruleModel summary(ruleModel)
Various parameters that control aspects of the C5.0 fit.
C5.0Control( subset = TRUE, bands = 0, winnow = FALSE, noGlobalPruning = FALSE, CF = 0.25, minCases = 2, fuzzyThreshold = FALSE, sample = 0, seed = sample.int(4096, size = 1) - 1L, earlyStopping = TRUE, label = "outcome" )
C5.0Control( subset = TRUE, bands = 0, winnow = FALSE, noGlobalPruning = FALSE, CF = 0.25, minCases = 2, fuzzyThreshold = FALSE, sample = 0, seed = sample.int(4096, size = 1) - 1L, earlyStopping = TRUE, label = "outcome" )
subset |
A logical: should the model evaluate groups of
discrete predictors for splits? Note: the C5.0 command line
version defaults this parameter to |
bands |
An integer between 2 and 1000. If |
winnow |
A logical: should predictor winnowing (i.e feature selection) be used? |
noGlobalPruning |
A logical to toggle whether the final, global pruning step to simplify the tree. |
CF |
A number in (0, 1) for the confidence factor. |
minCases |
an integer for the smallest number of samples that must be put in at least two of the splits. |
fuzzyThreshold |
A logical toggle to evaluate possible advanced splits of the data. See Quinlan (1993) for details and examples. |
sample |
A value between (0, .999) that specifies the random proportion of the data should be used to train the model. By default, all the samples are used for model training. Samples not used for training are used to evaluate the accuracy of the model in the printed output. |
seed |
An integer for the random number seed within the C code. |
earlyStopping |
A logical to toggle whether the internal method for stopping boosting should be used. |
label |
A character label for the outcome used in the output. @return A list of options. |
Original GPL C code by Ross Quinlan, R code and modifications to C by Max Kuhn, Steve Weston and Nathan Coulter
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, http://www.rulequest.com/see5-unix.html
C5.0()
,predict.C5.0()
,
summary.C5.0()
, C5imp()
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333], control = C5.0Control(winnow = TRUE)) summary(treeModel)
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333], control = C5.0Control(winnow = TRUE)) summary(treeModel)
This function calculates the variable importance (aka attribute usage) for C5.0 models.
C5imp(object, metric = "usage", pct = TRUE, ...)
C5imp(object, metric = "usage", pct = TRUE, ...)
object |
an object of class |
metric |
either 'usage' or 'splits' (see Details below) |
pct |
a logical: should the importance values be converted to be between 0 and 100? |
... |
other options (not currently used) |
By default, C5.0 measures predictor importance by determining the percentage
of training set samples that fall into all the terminal nodes after the
split (this is used when metric = "usage"
). For example, the
predictor in the first split automatically has an importance measurement of
100 percent. Other predictors may be used frequently in splits, but if the
terminal nodes cover only a handful of training set samples, the importance
scores may be close to zero. The same strategy is applied to rule-based
models as well as the corresponding boosted versions of the model.
There is a difference in the attribute usage numbers between this output and the nominal command line output. Although the calculations are almost exactly the same (we do not add 1/2 to everything), the C code does not display that an attribute was used if the percentage of training samples covered by the corresponding splits is very low. Here, the threshold was lowered and the fractional usage is shown.
When metric = "splits"
, the percentage of splits associated with each
predictor is calculated.
a data frame with a column Overall
with the predictor usage
values. The row names indicate the predictor.
Original GPL C code by Ross Quinlan, R code and modifications to C by Max Kuhn, Steve Weston and Nathan Coulter
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, http://www.rulequest.com/see5-unix.html
C5.0()
, C5.0Control()
,
summary.C5.0()
,predict.C5.0()
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333]) C5imp(treeModel) C5imp(treeModel, metric = "splits")
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333]) C5imp(treeModel) C5imp(treeModel, metric = "splits")
Plot a decision tree.
## S3 method for class 'C5.0' plot(x, trial = 0, subtree = NULL, ...)
## S3 method for class 'C5.0' plot(x, trial = 0, subtree = NULL, ...)
x |
an object of class |
trial |
an integer for how many boosting iterations are
used for prediction. NOTE: the internals of |
subtree |
an optional integer that can be used to isolate
nodes below the specified split. See
|
... |
options passed to |
No value is returned; a plot is rendered.
Mark Culp, Max Kuhn
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, http://www.rulequest.com/see5-unix.html
mod1 <- C5.0(Species ~ ., data = iris) plot(mod1) plot(mod1, subtree = 3) mod2 <- C5.0(Species ~ ., data = iris, trials = 10) plot(mod2) ## should be the same as above ## plot first weighted tree plot(mod2, trial = 1)
mod1 <- C5.0(Species ~ ., data = iris) plot(mod1) plot(mod1, subtree = 3) mod2 <- C5.0(Species ~ ., data = iris, trials = 10) plot(mod2) ## should be the same as above ## plot first weighted tree plot(mod2, trial = 1)
This function produces predicted classes or confidence values from a C5.0 model.
## S3 method for class 'C5.0' predict( object, newdata = NULL, trials = object$trials["Actual"], type = "class", na.action = na.pass, ... )
## S3 method for class 'C5.0' predict( object, newdata = NULL, trials = object$trials["Actual"], type = "class", na.action = na.pass, ... )
object |
an object of class |
newdata |
a matrix or data frame of predictors |
trials |
an integer for how many boosting iterations are used for prediction. See the note below. |
type |
either |
na.action |
when using a formula for the original model fit, how should missing values be handled? |
... |
other options (not currently used) |
Note that the number of trials in the object my be less than
what was specified originally (unless earlyStopping = FALSE
was used in C5.0Control()
. If the number requested
is larger than the actual number available, the maximum actual
is used and a warning is issued.
Model confidence values reflect the distribution of the classes in terminal nodes or within rules.
For rule-based models (i.e. not boosted), the predicted confidence value is the confidence value from the most specific, active rule. Note that C4.5 sorts the rules, and uses the first active rule for prediction. However, the default in the original sources did not normalize the confidence values. For example, for two classes it was possible to get confidence values of (0.3815, 0.8850) or (0.0000, 0.922), which do not add to one. For rules, this code divides the values by their sum. The previous values would be converted to (0.3012, 0.6988) and (0, 1). There are also cases where no rule is activated. Here, equal values are assigned to each class.
For boosting, the per-class confidence values are aggregated over all of the trees created during the boosting process and these aggregate values are normalized so that the overall per-class confidence values sum to one.
When the cost
argument is used in the main function, class
probabilities derived from the class distribution in the
terminal nodes may not be consistent with the final predicted
class. For this reason, requesting class probabilities from a
model using unequal costs will throw an error.
when type = "class"
, a factor vector is returned.
When type = "prob"
, a matrix of confidence values is returned
(one column per class).
Original GPL C code by Ross Quinlan, R code and modifications to C by Max Kuhn, Steve Weston and Nathan Coulter
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, http://www.rulequest.com/see5-unix.html
C5.0()
, C5.0Control()
,
summary.C5.0()
, C5imp()
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333]) predict(treeModel, mlc_churn[3334:3350, -20]) predict(treeModel, mlc_churn[3334:3350, -20], type = "prob")
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333]) predict(treeModel, mlc_churn[3334:3350, -20]) predict(treeModel, mlc_churn[3334:3350, -20], type = "prob")
This function prints out detailed summaries for C5.0 models.
## S3 method for class 'C5.0' summary(object, ...)
## S3 method for class 'C5.0' summary(object, ...)
object |
an object of class |
... |
other options (not currently used) |
The output of this function mirrors the output of the C5.0 command line version.
The terminal nodes have text indicating the number of samples covered by the node and the number that were incorrectly classified. Note that, due to how the model handles missing values, the sample numbers may be fractional.
There is a difference in the attribute usage numbers between this output and the nominal command line output. Although the calculations are almost exactly the same (we do not add 1/2 to everything), the C code does not display that an attribute was used if the percentage of training samples covered by the corresponding splits is very low. Here, the threshold was lowered and the fractional usage is shown.
A list with values
output |
a single text string with the model output |
comp2 |
the call to this function |
Original GPL C code by Ross Quinlan, R code and modifications to C by Max Kuhn, Steve Weston and Nathan Coulter
Quinlan R (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, http://www.rulequest.com/see5-unix.html
C5.0()
, C5.0Control()
,
summary.C5.0()
, C5imp()
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333]) summary(treeModel)
library(modeldata) data(mlc_churn) treeModel <- C5.0(x = mlc_churn[1:3333, -20], y = mlc_churn$churn[1:3333]) summary(treeModel)