In this post, you will learn the concepts of Stochastic Gradient Descent using Python example. In order to demonstrate Stochastic gradient descent concepts, Perceptron machine learning algorithm is used. Recall that Perceptron is also called as single-layer neural network. Before getting into details, lets quickly understand the concepts of Perceptron and underlying learning algorithm such SGD is used. You may want to check out the concepts of gradient descent on this page – Gradient Descent explained with examples. The following topics are covered in this post:

## Stochastic Gradient Descent (SGD) for Learning Perceptron Model

Perceptron algorithm can be used to train binary classifier that classifies the data as either 1 or 0. It is based on the following:

• Gather data: First and foremost, one or more features get defined. Thereafter, the data for those features is collected along with the class label representing the binary class of each record.
• Invoke activation function: A function called as activation function is invoked which sums up the weighted sum of input data. The weighted sum represent the sum of different weights, $$w_i$$ with different features, $$x_i$$. This is the formula: $$\sum w_i*x_i$$. In the weighted sum, $$x_0$$ = 1
• Use unit step function to predict class 0 or 1: The output of activation function is compared with 0. If the output is greater than or equal to 0, the prediction is 1 or else the prediction is 0.

In order to achieve above, what is unknown is weights which are also called as coefficients in case of linear regression. And, the weights are entities which need to be learned as part of training or fitting the model. In other words, model is trained with the data set to learn weights or parameters or coefficients. The algorithm which is used to learn the weights is called as stochastic gradient descent.

Stochastic gradient descent is a type of gradient descent algorithm where weights of the model is learned (or updated) based on every training example such that next prediction could be accurate. This is unlike batch gradient descent where the weights are updated or learned after all the training examples are visited.

Here is the Python code which represents the learning of weights (or weight updation) after each training example. Pay attention to the following in order to understand how Stochastic gradient descent works:

1. Fit method runs multiple iterations of the process of learning of weights. This is assigned using n_iterations.
2. In each iteration, each of the training examples is used for updating the weights. Notice the code for xi, target in zip(X, y)
3. The delta value which needs to be updated to weights is calculated as multiplication of learning rate (set as 0.01), difference between expected value and predicted value and feature values. Note that predicted value is calculated based on the comparison of output of activation function with 0. If the comparison is greater than 0, the prediction is 1 otherwise 0.
4. Weights get updated with the delta value calculated in the previous step.
5. New weights get applied with the next training example.
6. Step 2, 3, 4, 5 is what is called as stochastic gradient descent.
n_iterations = 100
learning_rate = 0.01

def predict(X, y, coef):
'''
Activation function: w0 + w1*x1 + w2*x2 + ... + wn*xn
'''
output = np.dot(X, coef[1:]) + coef
'''
Unit Step function: Predict 1 if output >= 0 else 0
'''
return np.where(output >= 0.0, 1, 0)

def fit(X, y):
rgen = np.random.RandomState(1)
coef_ = rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape)
for _ in range(n_iterations):
for xi, expected_value in zip(X, y):
predicted_value = predict(xi, target, coef_)

coef_[1:] += learning_rate * (expected_value - predicted_value) * xi
coef_ += learning_rate * (expected_value - predicted_value) * 1
return coef_


## Perceptron Python Code representing SGD

Here is the Perceptron code representing stochastic gradient descent algorithm implementation. Pay attention to fit method which consists of same code as described in the previous section.

class CustomPerceptron(object):

def __init__(self, n_iterations=100, random_state=1, learning_rate=0.01):
self.n_iterations = n_iterations
self.random_state = random_state
self.learning_rate = learning_rate

'''

1. Weights are updated based on each training examples.
2. Learning of weights can continue for multiple iterations
3. Learning rate needs to be defined
'''
def fit(self, X, y):
rgen = np.random.RandomState(self.random_state)
self.coef_ = rgen.normal(loc=0.0, scale=0.01, size=1 + X.shape)
for _ in range(self.n_iterations):
for xi, expected_value in zip(X, y):
predicted_value = self.predict(xi)
self.coef_[1:] += self.learning_rate * (expected_value - predicted_value) * xi
self.coef_ += self.learning_rate * (expected_value - predicted_value) * 1

'''
Activation function calculates the value of weighted sum of input value
'''
def activation(self, X):
return np.dot(X, self.coef_[1:]) + self.coef_

'''
Prediction is made on the basis of unit step function
'''
def predict(self, X):
output = self.activation(X)
return np.where(output >= 0.0, 1, 0)

'''
Model score is calculated based on comparison of
expected value and predicted value
'''
def score(self, X, y):
misclassified_data_count = 0
for xi, target in zip(X, y):
output = self.predict(xi)
if(target != output):
misclassified_data_count += 1
total_data_count = len(X)
self.score_ = (total_data_count - misclassified_data_count)/total_data_count
return self.score_


You could use the following code to train a model using CustomPerceptron implementation and calculate the score. Note the Sklearn Breast cancer data set is used for training the model.

#
#
X = bc.data
y = bc.target
#
# Create training and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
#
# Instantiate CustomPerceptron
#
prcptrn = CustomPerceptron()
#
# Fit the model
#
prcptrn.fit(X_train, y_train)
#
# Score the model
#
prcptrn.score(X_test, y_test), prcptrn.score(X_train, y_train)


## What the advantages of using Stochastic Gradient Descent (SGD) for learning weights?

Here are a couple of advantages of using SGD for learning model parameters (not hyper parameters) or weights.

• SGD helps the model to converge fast empirically in case of large training data set
• Using SGD, one can achieve better generalization when using the model for predicting population or unseen data set.

## Conclusions

Here is the summary of what you learned in relation to stochastic gradient descent along with Python implementation and related example:

• Stochastic gradient descent (SGD) is a gradient descent algorithm used for learning weights / parameters / coefficients of the model, be it perceptron or linear regression.
• SGD requires updating the weights of the model based on each training example.
• SGD is particularly useful when there is large training data set.
• Models trained using algorithm which applies the stochastic gradient descent algorithm for learning weights is found to generalize better on unseen data set.