--- title: "C5.0 Classification Models" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{C5.0 Classification Models} output: knitr:::html_vignette: toc: yes --- ```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE) library(C50) library(modeldata) ``` The `C50` package contains an interface to the C5.0 classification model. The main two modes for this model are: * a basic tree-based model * a rule-based model Many of the details of this model can be found in [Quinlan (1993)](https://books.google.com/books?id=b3ujBQAAQBAJ&lpg=PP1&ots=sQ2nTTEpC1&dq=C4.5%3A%20Programs%20for%20Machine%20Learning&lr&pg=PR6#v=onepage&q=C4.5:%20Programs%20for%20Machine%20Learning&f=false) although the model has new features that are described in [Kuhn and Johnson (2013)](http://appliedpredictivemodeling.com/). The main public resource on this model comes from the [RuleQuest website](http://www.rulequest.com/see5-info.html). To demonstrate a simple model, we'll use the credit data that can be accessed in the [`modeldata` package](https://github.com/tidymodels/modeldata): ```{r credit-data} library(modeldata) data(credit_data) ``` The outcome is in a column called `Status` and, to demonstrate a simple model, the `Home` and `Seniority` predictors will be used. ```{r credit-vars} vars <- c("Home", "Seniority") str(credit_data[, c(vars, "Status")]) # a simple split set.seed(2411) in_train <- sample(1:nrow(credit_data), size = 3000) train_data <- credit_data[ in_train,] test_data <- credit_data[-in_train,] ``` ## Classification Trees To fit a simple classification tree model, we can start with the non-formula method: ```{r tree-mod} library(C50) tree_mod <- C5.0(x = train_data[, vars], y = train_data$Status) tree_mod ``` To understand the model, the `summary` method can be used to get the default `C5.0` command-line output: ```{r tree-summ} summary(tree_mod) ``` A graphical method for examining the model can be generated by the `plot` method: ```{r tree-plot, fig.width = 10} plot(tree_mod) ``` A variety of options are outlines in the documentation for `C5.0Control` function. Another option that can be used is the `trials` argument which enables a boosting procedure. This method is model similar to AdaBoost than to more statistical approaches such as stochastic gradient boosting. For example, using three iterations of boosting: ```{r tree-boost} tree_boost <- C5.0(x = train_data[, vars], y = train_data$Status, trials = 3) summary(tree_boost) ``` Note that the counting is zero-based. The `plot` method can also show a specific tree in the ensemble using the `trial` option. # Rule-Based Models C5.0 can create an initial tree model then decompose the tree structure into a set of mutually exclusive rules. These rules can then be pruned and modified into a smaller set of _potentially_ overlapping rules. The rules can be created using the `rules` option: ```{r rule-mod} rule_mod <- C5.0(x = train_data[, vars], y = train_data$Status, rules = TRUE) rule_mod summary(rule_mod) ``` Note that no pruning was warranted for this model. There is no `plot` method for rule-based models. # Predictions The `predict` method can be used to get hard class predictions or class probability estimates (aka "confidence values" in documentation). ```{r pred} predict(rule_mod, newdata = test_data[1:3, vars]) predict(tree_boost, newdata = test_data[1:3, vars], type = "prob") ``` # Cost-Sensitive Models A cost-matrix can also be used to emphasize certain classes over others. For example, to get more of the "bad" samples correct: ```{r cost} cost_mat <- matrix(c(0, 2, 1, 0), nrow = 2) rownames(cost_mat) <- colnames(cost_mat) <- c("bad", "good") cost_mat cost_mod <- C5.0(x = train_data[, vars], y = train_data$Status, costs = cost_mat) summary(cost_mod) # more samples predicted as "bad" table(predict(cost_mod, test_data[, vars])) # that previously table(predict(tree_mod, test_data[, vars])) ```