The Naive Bayes algorithm is a common machine learning classifier, using conditional probability to find the most appropriate class for a set of features. It is based on Bayes theorem to calculate the probability of seeing a class given the feature vector, this can have an added utility depending on the use case in that you may want to be returning the probability that this feature vector belongs to each class. This classifier can be split into two distinct stages, the training stage and the testing stage. The training stage is where the parameters for the calculations are found and then the testing stage is where these calculations are put to use.

## Bayes Theorem

To understand this algorithm, we must first understand Bayes’ theorem. Bayes’ theorem is a way of calculating conditional probability, in our case we will use it to find the probability of seeing a class given the feature vector that we already have.

*P(c | x) = ( P(c) * P(x | c) ) / p(x)*

Bayes theorem can be written mathematically using the formula above, where c is the class and x is the feature. Now, this can also be used with a feature vector. For this case we do not really need to worry about the denominator as it is going to be the same irrelevant of the class which we are trying to calculate the probability for and we only want to know which is the biggest, therefore we are able to use direct proportionality as below.

*P(c | x1,x2,…,xn) ∝ ( P(c) * P(x1 | c) * P(x2 | c) * … * P(xn | c) )*

In this representation x1, x2, … xn represent the individual features from the feature vector which we can use to build up our probability for the class.

## Applying the theorem

So why do we need this formula? As mentioned above, this algorithm is built on these formulae but this may be much more complex probability than you are used to seeing. Let’s take an example dataset and apply the algorithm to show which class a test vector should be given. Let’s use a series of data about gymnatsics medals won in the 2016 Rio Olympics for team GB and the USA.

```
training_data = [
["Male", "Silver", "GB"],
["Male", "Bronze", "GB"],
["Male", "Gold", "GB"],
["Male", "Gold", "GB"],
["Male", "Bronze", "GB"],
["Female", "Bronze", "GB"],
["Male", "Silver", "USA"],
["Male", "Silver", "USA"],
["Male", "Bronze", "USA"],
["Female", "Gold", "USA"],
["Female", "Gold", "USA"],
["Female", "Bronze", "USA"],
["Female", "Gold", "USA"],
["Female", "Silver", "USA"],
["Female", "Silver", "USA"],
["Female", "Silver", "USA"],
["Female", "Silver", "USA"]
]
```

Gender | Medal | Country |
---|---|---|

Male | Silver | GB |

Male | Bronze | GB |

Male | Gold | GB |

Male | Gold | GB |

Male | Bronze | GB |

Female | Bronze | GB |

Male | Silver | USA |

Male | Silver | USA |

Male | Bronze | USA |

Female | Gold | USA |

Female | Gold | USA |

Female | Bronze | USA |

Female | Gold | USA |

Female | Silver | USA |

Female | Silver | USA |

Female | Silver | USA |

Female | Silver | USA |

We will use this dataset to attempt to try and predict the country which an example athlete would have represented based on their gender and the medal they achieved. I have chosen to use only discrete data for this example as it is the easiest way to represent the way that the probabilities are calculated. It is possible to use continuous data in a naive bayes model, this will be discussed later in this article.

As we know from Bayes Theorem, to calculate the probability that a test vector will belong to a class, the parameters whcih we are going to need are P(Ck) and P(Xi | Ck) where Ck is the class k and Xi is a feature. We are going to need the probability of seeing each class as well as the probability of having a feature given that they belong to each class. We are able to do this before we begin the testing stage, to do this we need to identify the unique possible values for each feature as well as the possible classes each feature vector can belong to. The unique values for the gender are “Male” and “Female”, the unique values for the medal column are “Bronze”, “Silver” and “Gold” and the possible classes are “GB” and “USA”.

Now that we have the unique values for features, we are able to calculate the probabilities that we need. Let’s start with P(Ck), to do this we need to divide the number of instances of each class k and divide it by the total number of rows.

*P(C_gb) = 6 / 17*

*P(C_usa) = 11 / 17*

Country | Probability |
---|---|

GB | 0.3529 |

USA | 0.6471 |

The conditional probability of a feature being a certain value, given the class can be calculated in a similar manner. For example, the probability of an athlete being female given that they belong to the country GB can be calculated by dividing the number of female athletes in team GB by the number of athletes in team GB.

*P(gender=female | country=GB) = 1 / 6*

Country | Gender | P(g | C) |
---|---|---|

GB | Male | 0.8333 |

GB | Female | 0.1667 |

USA | Male | 0.2727 |

USA | Female | 0.7273 |

Country | Medal | P(m | C) |
---|---|---|

GB | Bronze | 0.5 |

GB | Silver | 0.1667 |

GB | Gold | 0.3333 |

USA | Bronze | 0.1818 |

USA | Silver | 0.5455 |

USA | Gold | 0.2727 |

Now that we have the probability for each class and the conditional probabilities for seeing a feature given each class, we can calculate the probability of a test vector belonging to a class. Below is a new test vector which we will try and classify using Bayes theorem.

```
test_vector = ["Male", "Bronze"] # A new test vector with a gender of male, and a bronze medal
```

To be able to find which classification this best represents, we must find P(Ck | X) for each possible class and then the class with the highest probability iwll be the prediction that we give to the test vector. As we have already calculated the parameters we could need, we simply need to put these values together.

*P(C_gb | X) ∝ P(C_gb) * P(g=male | C_gb) * P(m=Bronze | C_gb)
= 0.3529 * 0.8333 * 0.5*

*P(C_usa | X) ∝ P(C_usa) * P(g=male | C_usa) * P(m=Bronze | C_usa)
= 0.6471 * 0.2727 * 0.1818*

Country | Probability |
---|---|

GB | 0.1470 |

USA | 0.0321 |

Here we have the probability that the vector could belong to either country, and we can see that this athlete is most likely to be from GB. The output from our algorithm given this test vector as the input should be the prediction of the country being GB.

## Why is it “Naive”?

The naivety of this algorithm refers to the independence of the features which is assumed, unlike some other algorithms which can make no assumptions about the input features. This does not necessarily mean that the algorithm is any less effective than another. The best algorithm to use on a dataset must be decided based on the features and size of the dataset, and it may be that you find that the Naive Bayes classifier gives the best accuracy for some purposes.

## Using continuous data

As mentioned before, it is possible to use continuous data with this algorithm as well as the discrete data. To do this, some extra information must be decided or calculated.

The easiest way to use continuous data is to attempt to transform it into discrete data by setting boundary points manually in order to divide the continuous dataset into segments. For example, with continuous data such as height it may be possible to set up three separate categories: “Short”, “Medium” and “Tall”. However, this is likely to lose some information as different people will consider the boundaries of each category to be at different values (e.g. some people may consider tall to be over 6’, whereas others may consider tall to be over 6’ 5”).

Another method is to assume some sort of distribution over the feature values. To do this, it is necessary to find the mean value for the data as well as the standard deviation for each possible class. There are many articles and explanations on how this can be done online.

## To Conclude

The Naive Bayes classifier is an extremely powerful algorithm which can be used to classify feature vectors without explicitly defining the parameters from which the output is decided. This can be useful for predicting a wide range of things for a wide range of reasons.