In this post, you will learn about why and when do we use random seed values while training machine learning models. This is a question most likely asked by beginners data scientist/machine learning enthusiasts.
We use random seed value while creating training and test data set. The goal is to make sure we get the same training and validation data set while we use different hyperparameters or machine learning algorithms in order to assess the performance of different models. This is where the random seed value comes into the picture. Different Python libraries such as scikit-learn etc have different ways of assigning random seeds.
While training machine learning models using Scikit-learn, the function, train_test_split imported from the module sklearn.model_selection takes input for random seed using the parameter such as random_state. Here is the code demonstrating the usage of random_state for passing the value of the random seed.
from sklearn import datasets from sklearn.model_selection import train_test_split iris = datasets.load_iris() X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
random_state=42 sets the random seed to the same value every time you run the above code. This implies that you get the same validation set (X_test, y_test) every time you execute the above code. In this manner, if you change your model by either changing hyperparameters or ML algorithms and retrain it, you can be assured that any differences happen due to the changes to the model, and not due to having a different random validation set.
Conceptually, the seed value is used to generate the random number generator. And, every time you use the same seed value, you will get the same random values. In Python, the method is random.seed(a, version). Numpy provides a similar method such as numpy.random.seed().