A simple low dimensional and linearly separable case to see how SVM classfication works. In this multi-class example, the algorithm has a one-against-one approach, in which k(k-1)/2 binary classifiers are trained; the appropriate class is found by a voting scheme.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
##
## Call:
## svm(formula = Species ~ ., data = train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.25
##
## Number of Support Vectors: 26
## [1] "Linear Output Table: "
## y.tst
## pred.test setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 8 1
## virginica 0 0 11
##
## Call:
## svm(formula = Species ~ ., data = train, kernel = "polynomial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## gamma: 0.25
## coef.0: 0
##
## Number of Support Vectors: 44
## [1] "Polynomial Output Table:"
## y.tst
## pred.test setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 8 3
## virginica 0 0 9
Tuning parameters: Kernel Type, Classification Type, C (regularization term), Gamma
Larger values of C are more precise, but may overfit on training data, while smaller values of C are less precise and have a larger error rate.
Larger gamma considers points closest to determine margin, and inversely, smaller gammas consider points further away from “center” to determine margin.
# Declare vectors for tuning
kernels <- c("linear", "polynomial", "radial", "sigmoid")
types <- c("C-classification", "nu-classification")
gam.vect <- c(0.01, 0.1, 0.25, 0.5, 1, 5, 10)
c.vect <- c(0.01, 0.1, 0.5, 1, 2, 5)
kernel | type | best.gamma | best.c | training.acc | test.acc | num.sp.vects |
---|---|---|---|---|---|---|
linear | C-classification | 0.01 | 1 | 0.967 | 0.967 | 26 |
linear | nu-classification | 0.01 | 0.01 | 0.983 | 0.933 | 86 |
polynomial | C-classification | 1 | 0.1 | 0.967 | 0.933 | 28 |
polynomial | nu-classification | 0.1 | 0.01 | 0.9 | 0.833 | 81 |
radial | C-classification | 0.25 | 1 | 0.983 | 0.933 | 48 |
radial | nu-classification | 0.01 | 0.01 | 0.983 | 0.933 | 86 |
sigmoid | C-classification | 0.1 | 1 | 0.967 | 0.933 | 60 |
sigmoid | nu-classification | 0.01 | 0.01 | 0.983 | 0.933 | 86 |
In the case of the Iris dataset, the linear kernel, with gamma = 0.01 and c = 1, performs the best out of all models. It is most accurate and requires the lease number of support vectors. Generally, when selecting a model, we want to select the model where the number of support vectors is relatively low. This is to have a generalized model that is not overfitted to the training data.
As compared to the Iris dataset, this dataset has only two classes, ideal for SVM classification. This dataset was chosen to show that even in a high dimensional setting with a moderate number of observations, an SVM classifier is still effective at performing classification. This is because in a high dimensional setting, the decision boundary (or separation hyperplane) has many opportunities to find a separation between the two classes of observations.
## id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 1 842302 M 17.99 10.38 122.80 1001.0 0.11840
## 2 842517 M 20.57 17.77 132.90 1326.0 0.08474
## 3 84300903 M 19.69 21.25 130.00 1203.0 0.10960
## 4 84348301 M 11.42 20.38 77.58 386.1 0.14250
## 5 84358402 M 20.29 14.34 135.10 1297.0 0.10030
## 6 843786 M 12.45 15.70 82.57 477.1 0.12780
## compactness_mean concavity_mean concave.points_mean symmetry_mean fractal_dimension_mean
## 1 0.27760 0.3001 0.14710 0.2419 0.07871
## 2 0.07864 0.0869 0.07017 0.1812 0.05667
## 3 0.15990 0.1974 0.12790 0.2069 0.05999
## 4 0.28390 0.2414 0.10520 0.2597 0.09744
## 5 0.13280 0.1980 0.10430 0.1809 0.05883
## 6 0.17000 0.1578 0.08089 0.2087 0.07613
## radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se
## 1 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373
## 2 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860
## 3 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832
## 4 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661
## 5 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688
## 6 0.3345 0.8902 2.217 27.19 0.007510 0.03345 0.03672
## concave.points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1 0.01587 0.03003 0.006193 25.38 17.33 184.60
## 2 0.01340 0.01389 0.003532 24.99 23.41 158.80
## 3 0.02058 0.02250 0.004571 23.57 25.53 152.50
## 4 0.01867 0.05963 0.009208 14.91 26.50 98.87
## 5 0.01885 0.01756 0.005115 22.54 16.67 152.20
## 6 0.01137 0.02165 0.005082 15.47 23.75 103.40
## area_worst smoothness_worst compactness_worst concavity_worst concave.points_worst symmetry_worst
## 1 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601
## 2 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750
## 3 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613
## 4 567.7 0.2098 0.8663 0.6869 0.2575 0.6638
## 5 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364
## 6 741.6 0.1791 0.5249 0.5355 0.1741 0.3985
## fractal_dimension_worst
## 1 0.11890
## 2 0.08902
## 3 0.08758
## 4 0.17300
## 5 0.07678
## 6 0.12440
Below is a sample plot of two variables in the dataset. Many of the pairwise plots in this dataset are similar to this plot in the sense that the two classes are not linearly separable.
kernel | type | best.gamma | best.c | training.acc | test.acc | num.sp.vects |
---|---|---|---|---|---|---|
linear | C-classification | 0.01 | 0.1 | 0.985 | 0.965 | 55 |
linear | nu-classification | 0.01 | 0.01 | 0.934 | 0.93 | 230 |
polynomial | C-classification | 0.1 | 1 | 0.985 | 0.965 | 84 |
polynomial | nu-classification | 0.01 | 0.01 | 0.826 | 0.789 | 238 |
radial | C-classification | 0.01 | 2 | 0.985 | 0.982 | 78 |
radial | nu-classification | 0.01 | 0.01 | 0.945 | 0.921 | 234 |
sigmoid | C-classification | 0.01 | 1 | 0.967 | 0.965 | 101 |
sigmoid | nu-classification | 0.1 | 0.01 | 0.947 | 0.93 | 229 |
In this case, the SVM classifier with Radial Basis(Gaussian) kernel performs better than the classifier with the linear kernel, but does so with a larger number of support vectors. Below, we compare the SVM classifier with the Naive Bayes classifier and Random Forest classifier.
The above two plots are examples of how the classifier is performing on two randomly selected variable pairs in the dataset. It may be the case that the classifier is performing well in other pair instances, but not in these two.
## [1] "Output Table: "
## y.tst
## pred.test B M
## B 60 5
## M 6 43
|
## [1] "Random Forest Tuning Results:"
## node.size mtry
## 1 100 10
## [1] "Output Table:"
## y.tst
## pred.test B M
## B 62 2
## M 4 46
Model | Train Acc | Test Acc | TPR | TNR |
---|---|---|---|---|
Optimal SVM | 0.958 | 0.921 | 0.985 | 0.833 |
Naive Bayes | 0.947 | 0.904 | 0.909 | 0.896 |
Optimal Random Forest | 0.952 | 0.947 | 0.939 | 0.958 |
Accuracy of the SVM classifier and the Random Forest classifier are relatively similar, and they both out-perform the Naive Bayes classifier. In such cases, choosing between the SVM model and the Random Forest model would come down to the nature and requirements of the problem being solved. For example, these requirements could include a need high interpretability or high computation efficiency. The nature of the problem, like if the data is observed to have high dimensionality and a large number of observations, would have an effect on the final model being chosen.
The Glass dataset was chosen to show how the SVM classifier performs on data with multiple classes, relatively low dimensionality, and without linear separability.
## RI Na Mg Al Si K Ca Ba Fe Type
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1
kernel | type | best.gamma | best.c | training.acc | test.acc | num.sp.vects |
---|---|---|---|---|---|---|
linear | C-classification | 0.01 | 2 | 0.678 | 0.628 | 130 |
polynomial | C-classification | 1 | 2 | 0.959 | 0.767 | 107 |
radial | C-classification | 0.25 | 1 | 0.778 | 0.791 | 147 |
sigmoid | C-classification | 0.25 | 0.5 | 0.561 | 0.535 | 154 |
## [1] "Output Table:"
## y.tst
## pred.test 1 2 3 5 6 7
## 1 14 5 2 0 0 0
## 2 2 2 0 1 0 0
## 3 4 5 0 0 0 0
## 5 0 1 0 0 0 0
## 6 0 1 0 2 2 1
## 7 0 0 0 0 0 1
|
## [1] "Random Forest Tuning Results:"
## node.size mtry
## 1 25 3
## [1] "Output Table: "
## y.tst
## pred.test 1 2 3 5 6 7
## 1 18 5 1 0 1 0
## 2 2 8 0 3 1 1
## 3 0 0 1 0 0 0
## 5 0 1 0 0 0 1
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
Model | Train Acc | Test Acc |
---|---|---|
Optimal SVM | 0.959 | 0.767 |
Naive Bayes | 0.497 | 0.442 |
Optimal Random Forest | 0.655 | 0.628 |
Much better than random guessing, our SVM model is performing well at classifying the glass classes. Although it has en error rate upwards of 20%, it is still performing better than the Random Forest model, and significantly better than the Naive Bayes model.
The example below is to illustrate the difference in effectiveness between the different kernels.
## [1] "Output Table:"
## y.tst
## pred.test 1 2
## 1 21 21
## 2 19 19
##
## Call:
## svm(formula = class ~ ., data = train, kernel = "linear", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.5
##
## Number of Support Vectors: 299
##
## ( 150 149 )
##
##
## Number of Classes: 2
##
## Levels:
## 1 2
## [1] "Output Table:"
## y.tst
## pred.test 1 2
## 1 23 20
## 2 17 20
##
## Call:
## svm(formula = class ~ ., data = train, kernel = "polynomial", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## gamma: 0.5
## coef.0: 0
##
## Number of Support Vectors: 304
##
## ( 152 152 )
##
##
## Number of Classes: 2
##
## Levels:
## 1 2
## [1] "Output Table:"
## y.tst
## pred.test 1 2
## 1 40 0
## 2 0 40
##
## Call:
## svm(formula = class ~ ., data = train, kernel = "radial", type = "C-classification")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.5
##
## Number of Support Vectors: 151
##
## ( 76 75 )
##
##
## Number of Classes: 2
##
## Levels:
## 1 2
Kernel | Num Sp Vects | TPR | TNR |
---|---|---|---|
linear | 299 | 0.525 | 0.475 |
polynomial | 304 | 0.575 | 0.5 |
radial | 151 | 1 | 1 |
Though this case is for illustrative purposes and is highly unlikely to appear in reality, it can be observed that the radial kernel is more effective at classifying the observations than the linear and polynomial kernel. It achieves a 100% accuracy and requires a significantly fewer number of support vectors to do so.