---
title: "C5.0 Classification Models"
vignette: >
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteIndexEntry{C5.0 Classification Models}
output:
  knitr:::html_vignette:
    toc: yes
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(C50)
library(modeldata)
```

The `C50` package contains an interface to the C5.0 classification model. The main two modes for this model are:

* a basic tree-based model 
* a rule-based model

Many of the details of this model can be found in [Quinlan (1993)](https://books.google.com/books?id=b3ujBQAAQBAJ&lpg=PP1&ots=sQ2nTTEpC1&dq=C4.5%3A%20Programs%20for%20Machine%20Learning&lr&pg=PR6#v=onepage&q=C4.5:%20Programs%20for%20Machine%20Learning&f=false) although the model has new features that are described in [Kuhn and Johnson (2013)](http://appliedpredictivemodeling.com/). The main public resource on this model comes from the [RuleQuest website](http://www.rulequest.com/see5-info.html). 

To demonstrate a simple model, we'll use the credit data that can be accessed in the [`modeldata` package](https://github.com/tidymodels/modeldata):

```{r credit-data}
library(modeldata)
data(credit_data)
```
The outcome is in a column called `Status` and, to demonstrate a simple model, the `Home` and `Seniority` predictors will be used. 

```{r credit-vars}
vars <- c("Home", "Seniority")
str(credit_data[, c(vars, "Status")])

# a simple split
set.seed(2411)
in_train <- sample(1:nrow(credit_data), size = 3000)
train_data <- credit_data[ in_train,]
test_data  <- credit_data[-in_train,]
```

## Classification Trees

To fit a simple classification tree model, we can start with the non-formula method:

```{r tree-mod}
library(C50)
tree_mod <- C5.0(x = train_data[, vars], y = train_data$Status)
tree_mod
```
To understand the model, the `summary` method can be used to get the default `C5.0` command-line output:

```{r tree-summ}
summary(tree_mod)
```

A graphical method for examining the model can be generated by the `plot` method:


```{r tree-plot, fig.width = 10}
plot(tree_mod)
```

A variety of options are outlines in the documentation for `C5.0Control` function. Another option that can be used is the `trials` argument which enables a boosting procedure. This method is model similar to AdaBoost than to more statistical approaches such as stochastic gradient boosting. 

For example, using three iterations of boosting:

```{r tree-boost}
tree_boost <- C5.0(x = train_data[, vars], y = train_data$Status, trials = 3)
summary(tree_boost)
```

Note that the counting is zero-based. The `plot` method can also show a specific tree in the ensemble using the `trial` option. 

# Rule-Based Models

C5.0 can create an initial tree model then decompose the tree structure into a set of mutually exclusive rules. These rules can then be pruned and modified into a smaller set of _potentially_ overlapping rules. The rules can be created using the `rules` option:

```{r rule-mod}
rule_mod <- C5.0(x = train_data[, vars], y = train_data$Status, rules = TRUE)
rule_mod
summary(rule_mod)
```

Note that no pruning was warranted for this model. 

There is no `plot` method for rule-based models. 
 
# Predictions

The `predict` method can be used to get hard class predictions or class probability estimates (aka "confidence values" in documentation). 

```{r pred}
predict(rule_mod, newdata = test_data[1:3, vars])
predict(tree_boost, newdata = test_data[1:3, vars], type = "prob")
```


# Cost-Sensitive Models

A cost-matrix can also be used to emphasize certain classes over others. For example, to get more of the "bad" samples correct:

```{r cost}
cost_mat <- matrix(c(0, 2, 1, 0), nrow = 2)
rownames(cost_mat) <- colnames(cost_mat) <- c("bad", "good")
cost_mat

cost_mod <- C5.0(x = train_data[, vars], y = train_data$Status, 
                 costs = cost_mat)
summary(cost_mod)

# more samples predicted as "bad"
table(predict(cost_mod, test_data[, vars]))

# that previously
table(predict(tree_mod, test_data[, vars]))
```