Python pandas often uses a dataframe object to save data. We often need to get some data from dataframe randomly. In this tutorial, we will discuss how to randomize a dataframe object.
We can use pandas.DataFrame.sample() to randomize a dataframe object.
DataFrame.sample(self: ~FrameOrSeries, n=None, frac=None, replace=False, weights=None, random_state=None, axis=None)
This function will return a random sample of items from an axis of dataframe object.
Important parameters explain
n: int, it determines the number of items from axis to return.
replace: boolean, it determines whether return duplicated items.
weights: the weight of each imtes in dataframe to be sampled, default is equal probability.
axis: axis to sample
We will use some examples to illustrate how to use this function.
Preliminary
We should create a dataframe object. We will read a csv file to get this object.
import pandas as pd df= pd.read_csv("test_member.csv", sep = '\t') print(df)
The df is:
No Name Age 0 1 Tom 24 1 2 Kate 22 2 3 Alexa 34 3 4 Kate 23 4 5 John 45 5 6 Lily 41 6 7 Bruce 23 7 8 Lin 33 8 9 Brown 31 9 10 Alibama 20
To know more about python pandas read csv file, you can read this tutorial:
A Beginner Guide to Python Pandas Read CSV
Here are some examples to show how to randomize a dataframe object.
1.Get a random element from dataframe
d = df.sample(n=1) print(type(d)) print(d)
We can set n=1 to get a random row from a dataframe.
The random result is:
<class 'pandas.core.frame.DataFrame'> No Name Age 8 9 Brown 31
2.Randomize all rows in dataframe
d = df.sample(n = len(df)) print(type(d)) print(d)
We can set n = len(df) to randomize a dataframe.
The new random dataframe is:
<class 'pandas.core.frame.DataFrame'> No Name Age 5 6 Lily 41 2 3 Alexa 34 7 8 Lin 33 4 5 John 45 6 7 Bruce 23 0 1 Tom 24 9 10 Alibama 20 3 4 Kate 23 1 2 Kate 22 8 9 Brown 31
We can find all rows in new dataframe are unique.
3.Get random dataframe with duplicated rows
d = df.sample(len(df), replace = True) print(type(d)) print(d)
If you set replace = True, the new dataframe will contain duplicated rows.
4.Get random dataframe with weights
d = df.sample(n = 4, weights = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) print(type(d)) print(d)
In this code, we will extract 4 (n = 4) random rows from df, each rows will be extractd base on weights: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The sum of all weights is not 1, they will be normalized to sum to 1. These weights mean the last row of df will be sampled more likely.
The result is:
<class 'pandas.core.frame.DataFrame'> No Name Age 9 10 Alibama 20 5 6 Lily 41 6 7 Bruce 23 8 9 Brown 31
5.How to get row in the new random dataframe
As to the new random dataframe:
<class 'pandas.core.frame.DataFrame'> No Name Age 9 10 Alibama 20 5 6 Lily 41 6 7 Bruce 23 8 9 Brown 31
We can see the row index name is also randomized (9, 5, 6, 8), however, we also can get row by 0 – n-1 index.
For example:
print(d.iloc[1])
The result is:
No 6 Name Lily Age 41 Name: 5, dtype: object
6.Get random dataframe based on axis
In this tutorial, the dataframe df is (10 * 3). axis = 0, which means we will randomize dataframe by row. axis = 1, which means we will randomize dataframe by column.
Here is an example:
d = df.sample(4, axis = 1,replace = True) print(type(d)) print(d)
The result is:
<class 'pandas.core.frame.DataFrame'> Name Age Age No 0 Tom 24 24 1 1 Kate 22 22 2 2 Alexa 34 34 3 3 Kate 23 23 4 4 John 45 45 5 5 Lily 41 41 6 6 Bruce 23 23 7 7 Lin 33 33 8 8 Brown 31 31 9 9 Alibama 20 20 10