September 4, 2023

Support Vector Machines

Support Vector Machines (SVMs) are a type of AI. They are simpler compared to neural networks, as they only seperate data linearly.

The animation above shows the training process for an SVM learning to seperate the blue points from the orange. You can Move each point around to get a feel of how the AI reacts.

The faint gray lines either side of the dividing line represent the margin. This is how the SVM decides which line is best to seperate the data. The best line is chosen as the one that allows for the largest margin without hitting a data point.

Margin Hardness

The Hardness of the margin refers to how leniant the model is to outliers, in effect specifying how closely to fit the line to the data. In some cases this can lead to better performance, as it allows control over weather you want the model to over or underfit, and by how much.

The margin Hardness can be controled in the first demo using the slider below it.

The Kernel Trick

Since linearly seperable data is scarce in practice, SVM's are often used with a function applied to the data before they are processed, and in this way they make the data linearly seperable again, by adding an extra dimension to the data.

here the data is initially one dimensional, and the second dimension is added by the kernel function, allowing the AI to find a line to seperate the two classes where it wasn't possible before.

The Maths

SVM's generally use gradient descent to find the seperating line. In order to define a general seperating line in many dimensions (a hyperplane) the line is defined using the dot product, meaning any point perpendicular to the line is counted as part of the border.

specifically, the line is defined as $$ \vec{w} \cdot \vec{x} + b = 0 $$ Where \( \vec{w} \) is the weights of the SVM (defining the normal), \( \vec{x} \) is the point in space, and \(b\) is the bias. The margin is determined by the length of the vector \( \vec{w} \), and the lines for each margin are in fact defined as \( \vec{w} \cdot \vec{x} + b = \pm 1 \) This formula generalizes to higher dimensions (more features) easily, defining a plane in 3D, and a hyperplane in 4D and so on.

The SVM's predictions then are based on the side of the hyperplane that a given sample lands on. This can be tested using the sign of the expression above. Each of the two classes are labelled as either class "1" or "-1".

Training

The SVM is trained using gradient descent, typically using the hinge loss function, defined as follows: $$ \text{loss}(\vec{x}) = \begin{cases} 0 & y (\vec{w} \cdot \vec{x} - b)=0 \\ 1-y(\vec{w} \cdot \vec{x} - b) & \text{otherwise} \end{cases} $$ $$ y = \pm1 \text{ (the label)} $$ This is just a way to encode the margin, since that is what is being maximised, and weather a data point is on the correct side of the line to be classified properly (handled by the multiplication by \(y\)).