In this post, you will learn about some usefulĀ random datasets generators provided by Python Sklearn. There are many methods provided as part of Sklearn.datasets package. In this post, we will take the most common ones such as some of the following which could be used for creating data sets for doing proof-of-concepts solution for regression, classification and clustering machine learning algorithms. As data scientists, you must get familiar with these methods in order to quickly create the datasets for training models using different machine learning algorithms.
The following is the list of methods which can be used to generate datasets which could be used to train classification models.
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
#
# 200 records with noise set as 0.3
#
X, y = datasets.make_moons(200, noise=0.3, random_state=42)
#
# Create the plot
#
fig, ax = plt.subplots(figsize=(6, 6))
plt.xlabel("X0", fontsize=20)
plt.ylabel("X1", fontsize=20)
plt.scatter(X[:,0], X[:,1], s=60, c=y)
Here is the plot for the above dataset.
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
#
# 200 records, 5 features, number of classes = 3
# weights for each class (proportions of samples assigned to each class)
#
X, y = datasets.make_classification(n_samples=300, n_features=5, n_classes=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.5, 0.3, 0.2], random_state=42)
#
# Create the plot
#
fig, ax = plt.subplots(figsize=(9, 6))
plt.xlabel("X0", fontsize=20)
plt.ylabel("X1", fontsize=20)
plt.scatter(X[:,0], X[:,1], s=50, c=y)
The output of above will show up the following plot:
The data set can as well be used for different purposes such as training model. Here is the sample code which demonstrates how a LogisticRegression model is fit on the random dataset generated using make_classification method:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#
# 200 records, 5 features, number of classes = 3
# weights for each class (proportions of samples assigned to each class)
#
X, y = datasets.make_classification(n_samples=300, n_features=5, n_classes=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.5, 0.3, 0.2], random_state=42)
#
# Training / test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
#
# Create pipeline
#
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
#
# Fit the model
#
pipeline.fit(X_train, y_train)
#
# Score the model
#
pipeline.score(X_test, y_test), pipeline.score(X_train, y_train)
The following is the list of methods which can be used to generate datasets which could be used to train regression models.
import pandas as pd
import seaborn as sns
#
# Create regression datasets
#
X, y = datasets.make_regression(n_samples=200, n_features=5, n_informative=2, random_state=42)
#
# Create Pandas Dataframe and processes correlation
# You could also use numpy corrcoef method for same
#
df = pd.DataFrame(X)
df.columns = ['ftre1', 'ftre2', 'ftre3', 'ftre4', 'ftre5']
df['target'] = y
#
# Determine correlations
#
corr = df.corr()
#
# Draw the correlation heatmap
#
f, ax = plt.subplots(figsize=(9, 6))
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, annot=True, mask = mask, cmap=cmap)
Here is how the correlation heatmap will look like for the randomly generated datasets.
Here is how you could fit a linear regression model using randomly generated regression datasets using make_regression method:
from sklearn.linear_model import LinearRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
#
# Create regression datasets
#
X, y = datasets.make_regression(n_samples=200, n_features=5, n_informative=2, random_state=42)
#
# Training / test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#
# Create pipeline
#
pipeline = make_pipeline(StandardScaler(), LinearRegression())
#
# Fit the model
#
pipeline.fit(X_train, y_train)
#
# Score the model
#
pipeline.score(X_test, y_test), pipeline.score(X_train, y_train)
Here is the summary of what you learned in this post in relation to generating random datasets using Python Sklearn methods:
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…