In this post, you will learn about some usefulĀ random datasets generators provided by Python Sklearn. There are many methods provided as part of Sklearn.datasets package. In this post, we will take the most common ones such as some of the following which could be used for creating data sets for doing proof-of-concepts solution for regression, classification and clustering machine learning algorithms. As data scientists, you must get familiar with these methods in order to quickly create the datasets for training models using different machine learning algorithms.
The following is the list of methods which can be used to generate datasets which could be used to train classification models.
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
#
# 200 records with noise set as 0.3
#
X, y = datasets.make_moons(200, noise=0.3, random_state=42)
#
# Create the plot
#
fig, ax = plt.subplots(figsize=(6, 6))
plt.xlabel("X0", fontsize=20)
plt.ylabel("X1", fontsize=20)
plt.scatter(X[:,0], X[:,1], s=60, c=y)
Here is the plot for the above dataset.
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
#
# 200 records, 5 features, number of classes = 3
# weights for each class (proportions of samples assigned to each class)
#
X, y = datasets.make_classification(n_samples=300, n_features=5, n_classes=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.5, 0.3, 0.2], random_state=42)
#
# Create the plot
#
fig, ax = plt.subplots(figsize=(9, 6))
plt.xlabel("X0", fontsize=20)
plt.ylabel("X1", fontsize=20)
plt.scatter(X[:,0], X[:,1], s=50, c=y)
The output of above will show up the following plot:
The data set can as well be used for different purposes such as training model. Here is the sample code which demonstrates how a LogisticRegression model is fit on the random dataset generated using make_classification method:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#
# 200 records, 5 features, number of classes = 3
# weights for each class (proportions of samples assigned to each class)
#
X, y = datasets.make_classification(n_samples=300, n_features=5, n_classes=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.5, 0.3, 0.2], random_state=42)
#
# Training / test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
#
# Create pipeline
#
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
#
# Fit the model
#
pipeline.fit(X_train, y_train)
#
# Score the model
#
pipeline.score(X_test, y_test), pipeline.score(X_train, y_train)
The following is the list of methods which can be used to generate datasets which could be used to train regression models.
import pandas as pd
import seaborn as sns
#
# Create regression datasets
#
X, y = datasets.make_regression(n_samples=200, n_features=5, n_informative=2, random_state=42)
#
# Create Pandas Dataframe and processes correlation
# You could also use numpy corrcoef method for same
#
df = pd.DataFrame(X)
df.columns = ['ftre1', 'ftre2', 'ftre3', 'ftre4', 'ftre5']
df['target'] = y
#
# Determine correlations
#
corr = df.corr()
#
# Draw the correlation heatmap
#
f, ax = plt.subplots(figsize=(9, 6))
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, annot=True, mask = mask, cmap=cmap)
Here is how the correlation heatmap will look like for the randomly generated datasets.
Here is how you could fit a linear regression model using randomly generated regression datasets using make_regression method:
from sklearn.linear_model import LinearRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
#
# Create regression datasets
#
X, y = datasets.make_regression(n_samples=200, n_features=5, n_informative=2, random_state=42)
#
# Training / test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#
# Create pipeline
#
pipeline = make_pipeline(StandardScaler(), LinearRegression())
#
# Fit the model
#
pipeline.fit(X_train, y_train)
#
# Score the model
#
pipeline.score(X_test, y_test), pipeline.score(X_train, y_train)
Here is the summary of what you learned in this post in relation to generating random datasets using Python Sklearn methods:
Last updated: 25th Jan, 2025 Have you ever wondered how to seamlessly integrate the vast…
Hey there! As I venture into building agentic MEAN apps with LangChain.js, I wanted to…
Software-as-a-Service (SaaS) providers have long relied on traditional chatbot solutions like AWS Lex and Google…
Retrieval-Augmented Generation (RAG) is an innovative generative AI method that combines retrieval-based search with large…
The combination of Retrieval-Augmented Generation (RAG) and powerful language models enables the development of sophisticated…
Have you ever wondered how to use OpenAI APIs to create custom chatbots? With advancements…