[中文提示] 我最近从优化角度推导了SVM的Dual和Primal问题，在一个医学数据集上进行了测试，比较了两者差异。这篇文章记录了问题和我的推导过程。

[English Tips] In a lesson about SVM of optimization course, I take two methods to solve it. And there are my records.

## Background

Suppose that you have a set of m data points representing objects that you have classified into two distinct categories. It can be measurements of radius, area, smoothness etc. for tumor cells that have been classified as either malignant or benign. If n properties are measured for each cell, plotting the data points in R^n would form two clusters representing either benign or malignant cells. Pattern classification deals with the problem of setting up a rule by which you can judge to which class you should attribute a new data point so that information from a medical examination can be used to judge whether a patient has the disease or not. In its simplest form, a linear hyper-plane is used to separate one class from the other. The coefficients of the hyper-plane constitute the rule from which automatic classification is done; Which side of the plane a new data point corresponds to determines its classification. The range of applications is enormous: in-line quality control in manufacturing processes, face recognition and combustion engine knock detection are only a few scattered examples.

## The Support Vector Machine (SVM)

A two dimensional illustration of the clustering problem is presented in Fig. 1. The black and the white circles represent known data points classified as either having the property (y_i = -1) or ( y_i = +1 ), respectively. These points are called the training set. Suppose that it is possible to find a line1 (a hyper-plane in the general case) w^{T}x+b = 0 separating the two clusters, like the one indicated in the right figure. A new data point would then be classified as either "black" or "white" depending on the sign of ( w^{T}x+b ). The problem is to find good values of w and b. Ideally, we would like to have a total separation between the points, so that for all points in the training set.

w^Tx_i + b \geq 1 \ for \ y_i = 1\tag{1}

w^Tx_i + b \leq 1 \ for \ y_i = -1\tag{2}

Then every point on or north-east of the line w^Tx + b = +1 belongs to the class of white circles, and all points on or south-west of w^Tx + b = -1 belong to the class of black points. The coefficients of the right hand side are not special in any way. They merely represent a scaling of the problem, but are conventionally chosen to the values \pm1.

A reasonable requirement on w and b is that the line w^Tx + b = 0 they define should separate the two classes as distinctly as possible. This occurs when the distance between the lines w^Tx + b = 1 and w^Tx + b = -1 , the separation margin, is large as possible. From basic linear algebra it can be shown that this distance is \frac{2}{\left| w \right|}. Our requirements can now be formulated as the optimization problem.

minimise \ f(w,b)= \frac{1}{2}w^Tw \tag{3}

subject \ to \ y_i(w^Tx_i + b) \geq 1, \ i=1...m \tag{4}

Some of the points will lie on the dashed lines/hyper-planes. These are the so-called support vectors, and the support vector machine is nothing but the mathematical rule by which new data points can be classified.

The presented approach works as long as the data clusters are separable by a hyper-plane. In reality, this is not always the case. Some measurement points will be "outliers", there can be misclassification etc. bringing about a situation where there is no clear demarcation line between the classes. In order to handle also this situation, we need to allow for some of the points to "trespass" the separating plane and violate Eq. (1)-(2), but of course we would like to find the separating plane that requires as little violation as possible. One way of doing this is to relax the requirement of Eq. (1)-(2) by changing them to

w^Tx_i + b \geq 1-\xi_i \ for \ y_i = 1\tag{5}

w^Tx_i + b \leq -1+\xi_i \ for \ y_i = -1\tag{6}

where \xi_i \geq 0, which effectively moves the two dashed lines against and past each other. In addition, we add a term in the objective function of Eq. (3) that penalizes the violation, so that we get the new optimization problem

minimise \ f(w,b)= \frac{1}{2}w^Tw +C\sum_{i=1}^m\xi_i

subject \ to \ y_i(w^Tx_i + b) \geq 1-\xi_i, \ i=1...m

\xi_i \geq 0

The value of the constant C determines how much emphasis we should put on violation of separation and has to be chosen empirically.

## Dual problem

Most solutions in the internet are about the dual problem. They solve SVM classification by solving the dual problem. Let me show you this work again from optimization.

### Formulation

Firstly, let me assume that we have n data point and each one have m feature. I am going to classify them.

Secondly, define variables.

Denote by x the n×m matrix whose columns are the vector (x_i).

\chi=\left[\begin{array}{ccc}{\chi_{11}}&{\cdots}&{\chi_{m 1}}\\{\vdots}&{\vdots}&{\vdots}\\{x_{1 n}}&{\cdots}&{x_{m n}}\end{array}\right]

Y is the m×1 vector

Y=\left[\begin{array}{c}{y_{1}}\\{y_{2}}\\{\vdots}\\{y_{m-1}}\\{y_{m}}\end{array}\right]

Y_d is the m×m matrix whose diagonal is y_1 \ y_2 \ … \ y_m

\left[\begin{array}{ccccc}{y_{1}}&{}&{\cdots}&{}&{0}\\{}&{y_{2}}&{}&{}&{}\\{\vdots}&{}&{\cdots}&{}&{}&{\vdots}\\{}&{}&{}&{y_{m-1}}&{}\\{0}&{}&{}&{\cdots}&{}&{y_{m}}\end{array}\right]

b is a 1×1 vector

b=[b]

e is a m×1 vector

e=\left[\begin{array}{c}{1}\\{1}\\{\vdots}\\{1}\\{1}\end{array}\right]

(\epsilon) is a m×1 vector

\varepsilon=\left[\begin{array}{c}{\varepsilon_{1}}\\{\varepsilon_{2}}\\{\vdots}\\{\varepsilon_{m-1}}\\{\varepsilon_{m}}\end{array}\right]

W is the n×1 vector

\omega=\left[\begin{array}{c}{\omega_{1}}\\{\omega_{2}}\\{\vdots}\\{\omega_{n-1}}\\{\omega_{n}}\end{array}\right]

so, we can obtain the matrix form of problem

minimise \ f(W, B)=\frac{1}{2} W^{T} W+C e^{T} \varepsilon

subject \ to \ \begin{array}{l}{Y_{d} X^{T} W+b Y \geq e-\varepsilon}\\{\varepsilon \geq 0}\end{array}

Let α and η be the m-dimensional vectors of Lagrange multipliers for the constraints,

\alpha=\left[\begin{array}{c}{\alpha_{1}}\\{\alpha_{2}}\\{\vdots}\\{\alpha_{m-1}}\\{\alpha_{m}}\end{array}\right]

\eta=\left[\begin{array}{c}{\eta_{1}}\\{\eta_{2}}\\{\vdots}\\{\eta_{m-1}}\\{\eta_{m}}\end{array}\right]

Then we obtain the Lagrangian function,

\mathcal{L}(W, \varepsilon, b)=\frac{1}{2} w^{T} w+C e^{T} \varepsilon-\alpha\left(Y_{d} X^{T} W+Y b-e+\varepsilon\right)-\eta \varepsilon

It is easy to get the gradient of the Lagrangian function

\nabla \mathcal{L}(W, \varepsilon, b)=\left[\begin{array}{c}{W^{T}-X Y_{d} \alpha}\\{-\alpha Y_{d}}\\{C e^{T}-\alpha \varepsilon-\eta \varepsilon}\end{array}\right]

Let \nabla \mathcal{L}(W, \varepsilon, b)=0

Hence, we have

W^{T}=X Y_{d} \alpha=\sum_{i}^{m} \alpha_{i} y_{i} x_{i}

Here is its dual problem

maximize \ e^{T} \alpha-\frac{1}{2} \alpha^{T}\left(Y_{d} X^{T} X Y_{d}\right) \alpha

which equals to

minimize \ \frac{1}{2} \alpha^{T}\left(Y_{d} X^{T} X Y_{d}\right) \alpha-e^{T} \alpha

s.t \ \begin{array}{l}{\sum_{i}^{m} y_{i} \alpha_{i}=0}\\{0 \leq \alpha_{i} \leq C}\end{array}

Let G=YX^T XY, which is a symmetric matrix because

G_{i j}=y_{i} y_{\mathrm{j}} X_{j}^{T} X_{i}, \text { where } i, j=1 \ldots m

The dual is a quadratic problem, and we can use **quadprog, **a buit-in function of matlab**,** to solve this problem.

But how to get **w **?

W can be calculated according to

W^{T}=X Y_{d} \alpha=\sum_{i}^{m} \alpha_{i} y_{i} x_{i} \quad \text { where } x_{i} \in \text { support vectors }

For any support vectors,

b=W^{T} x_{i}-y_{i}

We calculate it in this way,

b=\frac{\sum_{i=1}^{S} W^{T} x_{i}-y_{i}}{S}, \text { where } s \text { is the number of support vectors }

## Experiments

Data set is split into train set and test set(500:59).

Here, we set C= 2000 initially.

This is the result in matlab

Iter Fval Primal Infeas Dual Infeas Complementarity

0 4.155697e+15 8.880184e+03 4.527066e+11 1.000000e+01

1 7.151771e+14 3.688063e+03 1.878027e+11 2.569397e+01

2 2.633964e+14 2.247702e+03 1.139724e+11 3.705138e+01

3 1.711919e+13 5.560420e+02 2.905605e+10 3.385880e+01

4 2.704926e+12 2.160981e+02 1.154975e+10 1.839445e+01

5 3.234615e+11 7.472820e+01 3.993983e+09 1.144330e+01

6 3.591546e+06 2.499224e-01 1.335755e+07 6.066012e+00

7 6.009199e+05 1.043848e-01 5.579033e+06 4.344830e+00

8 1.175806e+05 5.093911e-02 2.722532e+06 3.220132e+00

9 -1.256615e+04 1.950497e-02 1.042478e+06 2.383303e+00

10 -3.652240e+04 2.120723e-04 1.133459e+04 7.872474e-01

11 -3.675338e+04 2.178049e-05 1.164097e+03 3.774315e-01

12 -3.678213e+04 1.089029e-08 5.820716e-01 1.889869e-03

13 -3.678232e+04 4.547474e-12 2.812992e-05 1.041803e-09

### Result

I also test this SVM in test data set when \xi_{SV}=1e-15 and get accuracy 95.65%.

## Primal problem

TODO

## 文章评论