Iris Dataset

A simple low dimensional and linearly separable case to see how SVM classfication works. In this multi-class example, the algorithm has a one-against-one approach, in which k(k-1)/2 binary classifiers are trained; the appropriate class is found by a voting scheme.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Build SVM model with linear kernel

## 
## Call:
## svm(formula = Species ~ ., data = train, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.25 
## 
## Number of Support Vectors:  26
## [1] "Linear Output Table: "
##             y.tst
## pred.test    setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          8         1
##   virginica       0          0        11

Build SVM model with polynomial kernel

## 
## Call:
## svm(formula = Species ~ ., data = train, kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.25 
##      coef.0:  0 
## 
## Number of Support Vectors:  44
## [1] "Polynomial Output Table:"
##             y.tst
## pred.test    setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          8         3
##   virginica       0          0         9

Tuning

Tuning parameters: Kernel Type, Classification Type, C (regularization term), Gamma
Larger values of C are more precise, but may overfit on training data, while smaller values of C are less precise and have a larger error rate.
Larger gamma considers points closest to determine margin, and inversely, smaller gammas consider points further away from “center” to determine margin.

# Declare vectors for tuning
kernels <- c("linear", "polynomial", "radial", "sigmoid")
types <- c("C-classification", "nu-classification")
gam.vect <- c(0.01, 0.1, 0.25, 0.5, 1, 5, 10)
c.vect <- c(0.01, 0.1, 0.5, 1, 2, 5)

Grid-search to find optimal model

Iris Grid-Search Results
kernel type best.gamma best.c training.acc test.acc num.sp.vects
linear C-classification 0.01 1 0.967 0.967 26
linear nu-classification 0.01 0.01 0.983 0.933 86
polynomial C-classification 1 0.1 0.967 0.933 28
polynomial nu-classification 0.1 0.01 0.9 0.833 81
radial C-classification 0.25 1 0.983 0.933 48
radial nu-classification 0.01 0.01 0.983 0.933 86
sigmoid C-classification 0.1 1 0.967 0.933 60
sigmoid nu-classification 0.01 0.01 0.983 0.933 86

In the case of the Iris dataset, the linear kernel, with gamma = 0.01 and c = 1, performs the best out of all models. It is most accurate and requires the lease number of support vectors. Generally, when selecting a model, we want to select the model where the number of support vectors is relatively low. This is to have a generalized model that is not overfitted to the training data.


Wisconsin Breast Cancer Dataset

As compared to the Iris dataset, this dataset has only two classes, ideal for SVM classification. This dataset was chosen to show that even in a high dimensional setting with a moderate number of observations, an SVM classifier is still effective at performing classification. This is because in a high dimensional setting, the decision boundary (or separation hyperplane) has many opportunities to find a separation between the two classes of observations.

##         id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1   842302         M       17.99        10.38         122.80    1001.0         0.11840
## 2   842517         M       20.57        17.77         132.90    1326.0         0.08474
## 3 84300903         M       19.69        21.25         130.00    1203.0         0.10960
## 4 84348301         M       11.42        20.38          77.58     386.1         0.14250
## 5 84358402         M       20.29        14.34         135.10    1297.0         0.10030
## 6   843786         M       12.45        15.70          82.57     477.1         0.12780
##   compactness_mean concavity_mean concave.points_mean symmetry_mean fractal_dimension_mean
## 1          0.27760         0.3001             0.14710        0.2419                0.07871
## 2          0.07864         0.0869             0.07017        0.1812                0.05667
## 3          0.15990         0.1974             0.12790        0.2069                0.05999
## 4          0.28390         0.2414             0.10520        0.2597                0.09744
## 5          0.13280         0.1980             0.10430        0.1809                0.05883
## 6          0.17000         0.1578             0.08089        0.2087                0.07613
##   radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se
## 1    1.0950     0.9053        8.589  153.40      0.006399        0.04904      0.05373
## 2    0.5435     0.7339        3.398   74.08      0.005225        0.01308      0.01860
## 3    0.7456     0.7869        4.585   94.03      0.006150        0.04006      0.03832
## 4    0.4956     1.1560        3.445   27.23      0.009110        0.07458      0.05661
## 5    0.7572     0.7813        5.438   94.44      0.011490        0.02461      0.05688
## 6    0.3345     0.8902        2.217   27.19      0.007510        0.03345      0.03672
##   concave.points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1           0.01587     0.03003             0.006193        25.38         17.33          184.60
## 2           0.01340     0.01389             0.003532        24.99         23.41          158.80
## 3           0.02058     0.02250             0.004571        23.57         25.53          152.50
## 4           0.01867     0.05963             0.009208        14.91         26.50           98.87
## 5           0.01885     0.01756             0.005115        22.54         16.67          152.20
## 6           0.01137     0.02165             0.005082        15.47         23.75          103.40
##   area_worst smoothness_worst compactness_worst concavity_worst concave.points_worst symmetry_worst
## 1     2019.0           0.1622            0.6656          0.7119               0.2654         0.4601
## 2     1956.0           0.1238            0.1866          0.2416               0.1860         0.2750
## 3     1709.0           0.1444            0.4245          0.4504               0.2430         0.3613
## 4      567.7           0.2098            0.8663          0.6869               0.2575         0.6638
## 5     1575.0           0.1374            0.2050          0.4000               0.1625         0.2364
## 6      741.6           0.1791            0.5249          0.5355               0.1741         0.3985
##   fractal_dimension_worst
## 1                 0.11890
## 2                 0.08902
## 3                 0.08758
## 4                 0.17300
## 5                 0.07678
## 6                 0.12440

Below is a sample plot of two variables in the dataset. Many of the pairwise plots in this dataset are similar to this plot in the sense that the two classes are not linearly separable.

Perform grid-search to tune method

Breast Cancer Grid-Search Results
kernel type best.gamma best.c training.acc test.acc num.sp.vects
linear C-classification 0.01 0.1 0.985 0.965 55
linear nu-classification 0.01 0.01 0.934 0.93 230
polynomial C-classification 0.1 1 0.985 0.965 84
polynomial nu-classification 0.01 0.01 0.826 0.789 238
radial C-classification 0.01 2 0.985 0.982 78
radial nu-classification 0.01 0.01 0.945 0.921 234
sigmoid C-classification 0.01 1 0.967 0.965 101
sigmoid nu-classification 0.1 0.01 0.947 0.93 229

In this case, the SVM classifier with Radial Basis(Gaussian) kernel performs better than the classifier with the linear kernel, but does so with a larger number of support vectors. Below, we compare the SVM classifier with the Naive Bayes classifier and Random Forest classifier.

The above two plots are examples of how the classifier is performing on two randomly selected variable pairs in the dataset. It may be the case that the classifier is performing well in other pair instances, but not in these two.

Compare against Naive Bayes model

## [1] "Output Table: "
##          y.tst
## pred.test  B  M
##         B 60  5
##         M  6 43

Compare against Random Forest model

Cancer RF Grid Search Results
mtryUsed nodeSize OOB
5 1 0.0608897
5 5 0.0673867
5 25 0.0601345
5 100 0.0681473
30 1 0.0600910
30 5 0.0672174
30 25 0.0794593
30 100 0.0723158
10 1 0.0636826
10 5 0.0556701
10 25 0.0796607
10 100 0.0505888
## [1] "Random Forest Tuning Results:"
##   node.size mtry
## 1       100   10
## [1] "Output Table:"
##          y.tst
## pred.test  B  M
##         B 62  2
##         M  4 46
Comparison of Various Models
Model Train Acc Test Acc TPR TNR
Optimal SVM 0.958 0.921 0.985 0.833
Naive Bayes 0.947 0.904 0.909 0.896
Optimal Random Forest 0.952 0.947 0.939 0.958

Accuracy of the SVM classifier and the Random Forest classifier are relatively similar, and they both out-perform the Naive Bayes classifier. In such cases, choosing between the SVM model and the Random Forest model would come down to the nature and requirements of the problem being solved. For example, these requirements could include a need high interpretability or high computation efficiency. The nature of the problem, like if the data is observed to have high dimensionality and a large number of observations, would have an effect on the final model being chosen.


Glass Dataset

The Glass dataset was chosen to show how the SVM classifier performs on data with multiple classes, relatively low dimensionality, and without linear separability.

##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00    1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00    1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00    1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00    1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00    1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26    1

Glass Grid-Search Results
kernel type best.gamma best.c training.acc test.acc num.sp.vects
linear C-classification 0.01 2 0.678 0.628 130
polynomial C-classification 1 2 0.959 0.767 107
radial C-classification 0.25 1 0.778 0.791 147
sigmoid C-classification 0.25 0.5 0.561 0.535 154

Compare against Naive Bayes model

## [1] "Output Table:"
##          y.tst
## pred.test  1  2  3  5  6  7
##         1 14  5  2  0  0  0
##         2  2  2  0  1  0  0
##         3  4  5  0  0  0  0
##         5  0  1  0  0  0  0
##         6  0  1  0  2  2  1
##         7  0  0  0  0  0  1

Compare to Random Forest Model

Glass RF Grid Search Results
mtryUsed nodeSize OOB
3 1 0.3420645
3 5 0.3622585
3 25 0.3208567
3 100 0.3536544
9 1 0.4050419
9 5 0.3528957
9 25 0.3684672
9 100 0.3251291
3 1 0.3229690
3 5 0.3946096
3 25 0.3608916
3 100 0.3349995
## [1] "Random Forest Tuning Results:"
##   node.size mtry
## 1        25    3
## [1] "Output Table: "
##          y.tst
## pred.test  1  2  3  5  6  7
##         1 18  5  1  0  1  0
##         2  2  8  0  3  1  1
##         3  0  0  1  0  0  0
##         5  0  1  0  0  0  1
##         6  0  0  0  0  0  0
##         7  0  0  0  0  0  0
Comparison of Various Models
Model Train Acc Test Acc
Optimal SVM 0.959 0.767
Naive Bayes 0.497 0.442
Optimal Random Forest 0.655 0.628

Much better than random guessing, our SVM model is performing well at classifying the glass classes. Although it has en error rate upwards of 20%, it is still performing better than the Random Forest model, and significantly better than the Naive Bayes model.


Binary-Class Concentric Circle Problem

The example below is to illustrate the difference in effectiveness between the different kernels.

Linear Kernel

## [1] "Output Table:"
##          y.tst
## pred.test  1  2
##         1 21 21
##         2 19 19
## 
## Call:
## svm(formula = class ~ ., data = train, kernel = "linear", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  299
## 
##  ( 150 149 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

Polynomial Kernel

## [1] "Output Table:"
##          y.tst
## pred.test  1  2
##         1 23 20
##         2 17 20
## 
## Call:
## svm(formula = class ~ ., data = train, kernel = "polynomial", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.5 
##      coef.0:  0 
## 
## Number of Support Vectors:  304
## 
##  ( 152 152 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

Radial Kernel

## [1] "Output Table:"
##          y.tst
## pred.test  1  2
##         1 40  0
##         2  0 40
## 
## Call:
## svm(formula = class ~ ., data = train, kernel = "radial", type = "C-classification")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.5 
## 
## Number of Support Vectors:  151
## 
##  ( 76 75 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

Comparison of Various Kernels
Kernel Num Sp Vects TPR TNR
linear 299 0.525 0.475
polynomial 304 0.575 0.5
radial 151 1 1

Though this case is for illustrative purposes and is highly unlikely to appear in reality, it can be observed that the radial kernel is more effective at classifying the observations than the linear and polynomial kernel. It achieves a 100% accuracy and requires a significantly fewer number of support vectors to do so.