Use the Pandas String-Only get_dummies Method to Instantly Restructure your Data

In this post, you’ll learn how to use the fantastic str.get_dummies Pandas Series method to instantly restructure trapped data within a string. We begin by reading in a small sample dataset containing people's favorite fruits.
import pandas as pddf = pd.read_csv('data/fruits.csv')
df

Notice that the fruit column has multiple fruit names in each cell separated by a pipe character. While this format compactly represents the data, it isn’t the most suitable for answering basic questions such as:
What is the number of fruit per person?How many people enjoy banana?Which people enjoy both oranges and bananas?How any fruits are in-common with each person?Attempt to answer questions in current formIt’s possible to answer these questions in the current format using Pandas, though, we will see how these ways are sub-optimal. Here we find the number of fruit per person by adding one to the count of the pipe characters.
s = df.set_index('name')['fruits']s.str.count(r'\|') + 1name
Ana 5
Bill 2
Calvin 2
Dean 2
Elias 3
Felicia 2
George 5
Henry 2
Name: fruits, dtype: int64
The number of people who enjoy banana.
s.str.contains('banana').sum()3The people that enjoy both oranges and banana.
s.str.contains('(?=.*orange)(?=.*banana)')nameAna True
Bill False
Calvin False
Dean False
Elias False
Felicia False
George True
Henry True
Name: fruits, dtype: bool
Finding the number of fruits in-common with both people is particularly difficult and is a clear case for reformatting the data.
Better formatting of the dataAll of these questions can be bettered answered if the data is in a different format. The get_dummies string-only method will split all values in a single cell into their own columns creating 0/1 indicator variables. Here, we pass in the pipe character to get_dummies, producing the following DataFrame.
df1 = s.str.get_dummies('|')df1

We can now answer the same questions as before. Here, we count the number of fruit for each person.
df1.sum(axis=1)nameAna 5
Bill 2
Calvin 2
Dean 2
Elias 3
Felicia 2
George 5
Henry 2
dtype: int64
We sum up a single column to count the total number of people who enjoy bananas.
df1['banana'].sum()3Here, we use the query method to select each person who likes both oranges and bananas.
df1.query('orange + banana == 2')
Finding the number of fruits in-common with each person is where the largest gain from restructuring comes from. Here, we use the matrix multiplication operator to multiply the DataFrame to itself.
df1 @ df1.T
Master Data Analysis with PythonIf you enjoyed this tip and area looking to become an expert with Pandas, then check out my extremely comprehensive book, Master Data Analysis with Python. It is the most comprehensive Pandas book available, comes with 500+ exercises, video tutorials, and certification exams.
[image error]Use the Pandas String-Only get_dummies Method to Instantly Restructure your Data was originally published in Dunder Data on Medium, where people are continuing the conversation by highlighting and responding to this story.


