The K nearest neighbours algorithm is a machine learning classification algorithm. This means that it is an algorithm which takes a set of input data and determines a classification for each item. There are two sets of data required for the K nearest neighbours algorithm: the training dataset, and the test data. It also requires another parameter, known as k, which is the main parameter that you will fine tune to make the algorithm optimal for your problem.
The Training Data
The training data I mentioned above is a pre-labelled dataset which your algorithm will use to compare the test data to in order to try and determine the most appropriate labels. Unlike some other algorithms, there is no designated training stage for K nearest neighbours. The model is built during the testing stage, this is known as ‘lazy learning’.
Below is an example of some training data, it has 2 dimensions that we can use to create our model (height and weight) as well as a class column (Sport). This data is from the Wikipedia pages for a selection of professionals from each sport. The data is considered to be labelled because it already has the correct values in the sport column.
training_data = [ [1.85, 71, "Cyclist"], [1.83, 71, "Cyclist"], [1.67, 58, "Cyclist"], [1.78, 67, "Cyclist"], [1.88, 86, "Cyclist"], [1.73, 68, "Boxer"], [1.66, 65, "Boxer"], [1.78, 109, "Boxer"], [1.79, 72, "Boxer"], [1.91, 102, "Boxer"] ]
Visualising The Model
One of the features of the K nearest neighbours algorithm which makes it easy to understand is that it is very easy to visualise. The basic premise of the model is that it plots the feature vectors from the training dataset in N dimensional space (where N is the number of features we have for each row in our data). The more features a dataset has, the more difficult it can be to try and visualise this happening as we can only comprehend 3 dimensions, but our a good K nearest neighbours algorithm should be able to work with as many dimensions as are needed. Below is a graph plotting the points from our training data in the table above, this is very easy to understand and visualise as we only have 2 features and therefore 2 dimensions for each vector.
%matplotlib inline import matplotlib.pyplot as plt import numpy as np training_data = np.array(training_data) # Make the training data a numpy array cyclist_rows = np.where(training_data[:,2] == 'Cyclist') # Filter the Cyclist rows so they can be plotted with different markers boxer_rows = np.where(training_data[:,2] == 'Boxer') # Filter the Boxer rows so they can be plotted with different markers plt.scatter(x=training_data[cyclist_rows,0], y=training_data[cyclist_rows,1], c="b", marker="o", label="Cyclist") # Plot the Cyclist vectors as blue circles plt.scatter(x=training_data[boxer_rows,0], y=training_data[boxer_rows,1], c="g", marker="^", label="Boxer") # Plot the Boxer vectors as green triangles plt.title("Sports personalities weight against height") # Set the title of the graph plt.xlabel("Height (m)") # Set the label for the x axis plt.ylabel("Weight (kg)") # Set the label for the y axis plt.legend(loc="upper right") # Add the legend to the upper right hand corner plt.show() # Display the graph
From this graph we can see that not all of the data is particularly linked, so this dataset may not give the best results. However, the more data that you add in the classifications, the more likely you are to see clusters forming and therefore be able to more reliably classify new, unseen data.
Classifying a new feature vector
Now that we have plotted our training data, we can put the classifier to use! We will introduce a new feature vector which we want to clasify, we don’t yet know if this is a cyclist or a boxer, but we do know this person’s height and weight so we can use the data that we already have to try and get a good estimate of the most likely classification. Below I have defined a test vector with only the two features which we know.
test_vector = [1.9, 69]
We are now able to plot this vector in the same space as the training dataset. The point in this is to allow us to find the points which are closest to our test vector and find out the most common label for these points. So, let’s plot the graph again with the previous data but with the new test vector as a red square as well.
plt.scatter(x=training_data[cyclist_rows,0], y=training_data[cyclist_rows,1], c="b", marker="o", label="Cyclist") # Plot the Cyclist vectors as blue circles plt.scatter(x=training_data[boxer_rows,0], y=training_data[boxer_rows,1], c="g", marker="^", label="Boxer") # Plot the Boxer vectors as green triangles plt.scatter(x=test_vector, y=test_vector, c="r", marker="s", label="Unknown") # Add the new test vector as a red square plt.title("Sports personalities weight against height") # Set the title of the graph plt.xlabel("Height (m)") # Set the label for the x axis plt.ylabel("Weight (kg)") # Set the label for the y axis plt.legend(loc="upper right") # Add the legend to the upper right hand corner plt.show() # Display the graph
It is important to remember at this point that the scales for the two axes are dramatically different. It may immediately appear that the closes vectors are cyclists, suggesting that the test vector should also be labelled as a cyclist. However, the label that we do decide to predict will depend on the actual distance between the vectors as well as our value of K (the number of closest vectors to look at).
The usual way to calculate the euclidian distance between two vectors would be to find the difference in each dimension, find the sum of these differences and then square root the answer. I’m sure that anyone who has got this far through this post will be familiar with this method of finding the distance between vectors.
The distances of vectors from our test vector are as follows.
This table has already been sorted by distance in ascending order, so now we just need to take the smallest k values and then our prediction will be the most common sport among those. Let’s assume that we have decided that the best value of k for our model is 5. This gives us these results:
Here we have 3 cyclists and 2 boxers. This means that the output of our algorithm for this test vector should be the category label of “Cyclist”. This is a good representation of the importance of optimising the value of k as a value much smaller (e.g. 2) could give an equal number of each classification. The same would happen if the value of k was too high (e.g. 6 or 10). Finding a value of k to give us a relevant response is easy enough with a dataset this small, but can be much more difficult as the number of rows in the training data and the number of features being compared increases.
What is different in an implementation?
When implementing a k nearest neighbours algorithm, you don’t really need to calculate the actual euclidian distance. Calculating a square root is an expensive computation, and so individually working out the difference between the features and then finding the square root of the sum of that is often not the most optimal way to find the closest vectors to a test vector. One way to do this more optimally is to use matrix subtraction rather than taking each vector in the dataset separately. This then gives you a matrix of the differences, if you then use the dot product on the difference matrix and itself transposed then the diagonal of the matrix resulting from the dot product is the euclidian distance squared. As the distance squared is directly proportional to the distance, we can apply the same logic to find the closest from this value. In python it is possible to do this in only a few lines without implementing any loops and is strikingly efficient.
diff_matrix = np.array(test_vector) - training_data[:, [0,1]].astype(float) # Returns a matrix of differences between the vector we want and the training matrix distance_squared = np.diag(diff_matrix.dot(diff_matrix.T)) # Gets the diagonal of the dot product because we are only interested in these values
So there it is, a k nearest neighbours algorithm - one of the simplest machine learning algorithms but also a highly powerful one. It is non-parametric which means that it does not have any assumptions about the way in which the data is distributed. It can also be used to classify when the features used are dependent or independent which allows it to be a very flexible algorithm to have in your arsenal.