In this post, you will learn about some usefulĀ random datasets generators provided by Python Sklearn. There are many methods provided as part of Sklearn.datasets package. In this post, we will take the most common ones such as some of the following which could be used for creating data sets for doing proof-of-concepts solution for regression, classification and clustering machine learning algorithms. As data scientists, you must get familiar with these methods in order to quickly create the datasets for training models using different machine learning algorithms.
The following is the list of methods which can be used to generate datasets which could be used to train classification models.
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
#
# 200 records with noise set as 0.3
#
X, y = datasets.make_moons(200, noise=0.3, random_state=42)
#
# Create the plot
#
fig, ax = plt.subplots(figsize=(6, 6))
plt.xlabel("X0", fontsize=20)
plt.ylabel("X1", fontsize=20)
plt.scatter(X[:,0], X[:,1], s=60, c=y)
Here is the plot for the above dataset.
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
#
# 200 records, 5 features, number of classes = 3
# weights for each class (proportions of samples assigned to each class)
#
X, y = datasets.make_classification(n_samples=300, n_features=5, n_classes=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.5, 0.3, 0.2], random_state=42)
#
# Create the plot
#
fig, ax = plt.subplots(figsize=(9, 6))
plt.xlabel("X0", fontsize=20)
plt.ylabel("X1", fontsize=20)
plt.scatter(X[:,0], X[:,1], s=50, c=y)
The output of above will show up the following plot:
The data set can as well be used for different purposes such as training model. Here is the sample code which demonstrates how a LogisticRegression model is fit on the random dataset generated using make_classification method:
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#
# 200 records, 5 features, number of classes = 3
# weights for each class (proportions of samples assigned to each class)
#
X, y = datasets.make_classification(n_samples=300, n_features=5, n_classes=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.5, 0.3, 0.2], random_state=42)
#
# Training / test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
#
# Create pipeline
#
pipeline = make_pipeline(StandardScaler(), LogisticRegression())
#
# Fit the model
#
pipeline.fit(X_train, y_train)
#
# Score the model
#
pipeline.score(X_test, y_test), pipeline.score(X_train, y_train)
The following is the list of methods which can be used to generate datasets which could be used to train regression models.
import pandas as pd
import seaborn as sns
#
# Create regression datasets
#
X, y = datasets.make_regression(n_samples=200, n_features=5, n_informative=2, random_state=42)
#
# Create Pandas Dataframe and processes correlation
# You could also use numpy corrcoef method for same
#
df = pd.DataFrame(X)
df.columns = ['ftre1', 'ftre2', 'ftre3', 'ftre4', 'ftre5']
df['target'] = y
#
# Determine correlations
#
corr = df.corr()
#
# Draw the correlation heatmap
#
f, ax = plt.subplots(figsize=(9, 6))
mask = np.triu(np.ones_like(corr, dtype=bool))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, annot=True, mask = mask, cmap=cmap)
Here is how the correlation heatmap will look like for the randomly generated datasets.
Here is how you could fit a linear regression model using randomly generated regression datasets using make_regression method:
from sklearn.linear_model import LinearRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
#
# Create regression datasets
#
X, y = datasets.make_regression(n_samples=200, n_features=5, n_informative=2, random_state=42)
#
# Training / test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#
# Create pipeline
#
pipeline = make_pipeline(StandardScaler(), LinearRegression())
#
# Fit the model
#
pipeline.fit(X_train, y_train)
#
# Score the model
#
pipeline.score(X_test, y_test), pipeline.score(X_train, y_train)
Here is the summary of what you learned in this post in relation to generating random datasets using Python Sklearn methods:
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…