Logistic regression can be used as a binary machine learning classifier which is very powerful. Logistic regression can only be used on numerical data where the labels are either 1 or 0, although if you do have discrete data then this can often be massaged into a numerical format to fit the algorithm. The reason that you would not be able to use a standard regression algorithm to do this is that simple regression can return a number much larger than 1 or below 0. This problem is resolved in logistic regression by using a sigmoid function for the hypothesis. The range of a sigmoid is 0 < y < 1 and thus our problem is solved! Simple regression is also a slightly different problem, it can be used as part of a classification problem but the goal of the simple regression is to find a line to fit all of the data, not to divide the data in two. This can cause issues with where the threshold between classes is decided and how this will change based on new data.

In order to use this sigmoid function, it’s important to know what it is.

*σ = 1 / (1 + e^-x)*

Now, with this function we can plot it on a graph to show that it does what we want it to.

```
import matplotlib.pyplot as plt # Import the plotting library
import numpy as np # Import numpy for array manipulation
%matplotlib inline
def sigmoid(x): # Define the sigmoid function
return 1 / (1 + np.exp(-x))
plt.plot(range(-10,11), sigmoid(np.arange(-10,11))) # Plot the sigmoid function against values of x from -10 to 10
plt.ylabel("sigmoid(x)") # Add a label to the y axis
plt.xlabel("x") # Add a label to the x axis
plt.show() # Show the graph
```

As can be seen from the graph, the sigmoid function approaches but will never actually touch 0 or 1. These are considered the asymtotes of the graph. Therefore, no matter what we pass into our sigmoid function, it will only respond with a number between 0 and 1. This can also be thought of as the probability of the input fitting in class 1. If this value is above 0.5 then the probability of it belonging to class 1 is higher than the probability of it belonging to class 0.

The finding of the function to fit our data can be hard to imagine if you’ve never done anything with a sigmoid function before, but it can be seen with a fairly simple graph to compare to the standard sigmoid.

*f(x) = σ(ß ^{T}x)*

This is very similar to the function found in linear regression, except that it is wrapped in the sigmoid function to keep the range limited to between 0 and 1.

```
x_values = np.arange(-10, 11) # Create a numpy array of values from -10 to 10
plt.plot(x_values, sigmoid(x_values), c='b', label='sigmoid(x)') # Plot the original sigmoid function in blue
plt.plot(x_values, sigmoid(x_values*3), c='r', label='sigmoid(3x)') # Plot sigmoid(3x) in red
plt.plot(x_values, sigmoid(x_values*3 + 2), c='g', label='sigmoid(3x + 2)') # Plot sigmoid(3x) in red
plt.ylabel("f(x)") # Add a label to the y axis
plt.xlabel("x") # Add a label to the x axis
plt.legend(loc="lower right") # Add the legend in the lower right corner
plt.show() # Show the graph
```

Unfortunately, this graph doesn’t help much when trying to visualise the issue. However, we can use the linear regression to help understand what this actually means.

```
training_data = np.array([
[1, 5, 1],
[2, 3, 1],
[3, 7, 1],
[4, 5, 1],
[5, 6, 1],
[6, 4, 0],
[7, 3, 0],
[8, 2, 0],
[9, 2, 0],
[10, 1, 0],
])
training_class_1 = training_data[training_data[:,2] == 1, 0:2] # All of the rows with class 1
training_class_0 = training_data[training_data[:,2] == 0, 0:2] # All of the rows with class 0
plt.scatter(training_class_1[:, 0], training_class_1[:,1], c='g') # Plot class 1 in green
plt.scatter(training_class_0[:, 0], training_class_0[:,1], c='b') # Plot class 0 in blue
plt.xlabel("x") # Add a label to the x axis
plt.ylabel("f(x)") # Add a label to the y axis
plt.show() # Show the graph
```

Here we have some example training data, plotted as a scatter plot with the different classes as different colours. In essence, what we are trying to find with this algorithm is a line separating the two classes, where any values one side of the line belong to one class and the other side of the line will belong to the other class. An example line that would fit this requirement is below.

```
training_class_1 = training_data[training_data[:,2] == 1, 0:2] # All of the rows with class 1
training_class_0 = training_data[training_data[:,2] == 0, 0:2] # All of the rows with class 0
plt.scatter(training_class_1[:, 0], training_class_1[:,1], c='g') # Plot class 1 in green
plt.scatter(training_class_0[:, 0], training_class_0[:,1], c='b') # Plot class 0 in blue
separator = np.arange(1, 11) # Define the x and y values the same, y = x
plt.plot(separator, separator, c='r') # Plot the separation line of y = x
plt.xlabel("x") # Add a label to the x axis
plt.ylabel("f(x)") # Add a label to the y axis
plt.show() # Show the graph
```

This graph shows an example of what would be passed into the sigmoid function to limit the output to give us a 0 or 1. The output of the sigmoid function can be interpreted as a predicted probability of the input vector belonging to class 1. Therefore we can assume that anything with an output greater than 0.5 is in class 1, and if not then it belongs to class 0.

## Updating the parameters

The method for optimising the parameters that I am using for this article is gradient descent, there are others available. There are 3 main bits of information needed from the model: the prediction, the cost, and the gradient of the cost. These values will be used to optimise the parameters used so that a more accurate function can be found to fit the data. With real world data it is always important to try and avoid overfitting the model. Overfitting is when you train the model too closely to your training data and so the model loses its ability to give an accurate output for unseen data.

The prediction is simply the output of the function when it is passed the feature vector and the current parameters (the coefficients and the bias). In order to calculate the cost however, an error function must be decided on. There are many error functions available but the one I am using in this article is a common one.

*E = 0.5 * (prediction - real value) ^{2}*

In this calculation, “real value” is the expected output from the training data (the second column in our case).

To be able to update the parameters, we need to know the gradient of the error with respect to the parameters. We can achieve that by differentiating the error function, which should give you the following equation.

*dE/dθ = (prediction - real value) * x*

We can also define the update function.

*θ = θ - δ * (prediction - real value) * x*

Here θ denotes the parameters, while δ is the learning rate (a small amount so that we do not jump by too much and miss the optimal parameters. As the bias doesn’t have a corresponding value in the feature vector, the x is replaced by 1 (so doesn’t need to be written in the equation). The bias is shown by the letter b and the coefficients are written as θ_{1}, θ_{2}, θ_{3} … θ_{n}.

*b = b - δ * (prediction - real value)*

*θ _{1} = θ_{1} - δ * (prediction - real value) * x_{1}*

The gradient descent should be repeated until either a preset number of iterations have been completed or until the error is minimised to a preset threshold. When one of these requirements has been met, the parameters that you have are the ones which you can use as your optimal ones.

## Making a prediction

Now that we’ve got our optimised parameters, we can get predictions for x values that we don’t already have the output known. These values will likely also have an amount of error, but they can give us a good estimate of the value and we can use the existing amount of error to assume the variance. Because this is so effective with continuous data, it can be an incredibly effective predictor.