Efficient Data Analysis with Pandas Groupby

Qodo Team November 01, 2023 8 min

Getting Started with Pandas Groupby

The Python library Pandas has become one of the most essential tools for data manipulation and analysis. Its groupby function is a powerful tool for grouping and summarizing data. In this article, we’ll go through the basics of Pandas groupby and explore how to use it in combination with count, value_counts, and other functions for efficient data analysis.

The groupby function in Pandas divides a DataFrame into groups based on one or more columns. You can then perform aggregation, transformation, or other operations on these groups. Here’s a step-by-step breakdown of how to use it:

Split: You specify one or more columns by which you want to group your data. These columns are often referred to as “grouping keys.”
Apply: You apply an aggregation function, transformation, or any custom function to each group. Common aggregation functions include sum, mean, count, max, min, and more.
Combine: Pandas combines the results of the applied function for each group, giving you a new DataFrame or Series with the summarized data.

Before we dive deep into groupby functionality, let’s create a hypothetical dataset to work with. Imagine we are analyzing a dataset containing sales data for the past month. Our goal is to analyze this data to gain insights into product sales, customer behavior, and revenue generation. The dataset might look something like this:

import pandas as pd
import numpy as np

# Create a sample sales dataset
np.random.seed(42)
date_rng = pd.date_range(start='2023-08-01', end='2023-12-31', freq='D')
data = {
    'Date': np.random.choice(date_rng, size=100),
    'Product_Category': np.random.choice(['Electronics', 'Clothing', 'Groceries'], size=100),
    'Product_Name': np.random.choice(['Laptop', 'T-shirt', 'Banana', 'Phone', 'Jeans', 'TV'], size=100),
    'Customer_ID': np.random.randint(1, 101, size=100),
    'Quantity_Sold': np.random.randint(1, 10, size=100),
    'Price_Per_Unit': np.random.uniform(10, 1000, size=100).round(2),
}

df = pd.DataFrame(data)

# Calculate 'Total_Revenue' for each transaction
df['Total_Revenue'] = df['Quantity_Sold'] * df['Price_Per_Unit']

Now that we have our dataset, let’s explore how to use groupby for data analysis.

First of all, we might want to group the data by Product_Category. After grouping the data, we can apply various aggregate functions to each group, for example, count to see how many times each product has been sold:

product_categories = df.groupby('Product_Category')
product_categories['Date'].count()

Here is the output of our example code:

Product_Category
Clothing       26
Electronics    35
Groceries      39

Counting Data with Pandas Groupby

The first aggregation function that we have already mentioned is count. It counts the number of non-null entries in a DataFrame or Series. In our example, we can use it to count the occurrences of each unique Product_Name or Customer_ID, and the resulting DataFrame will show the count of each unique Product_Name or Customer_ID along with the counts of other columns. Let’s try it with Product_Name:

df.groupby('Product_Name').count()

The resulting DataFrame would look like this:

Out[10]:

	Date	Product_Category	Customer_ID	Quantity_Sold	Price_Per_Unit	Total_Revenue
Product_Name
Banana	19	19	19	19	19	19
Jeans	7	7	7	7	7	7
Laptop	26	26	26	26	26	26
Phone	19	19	19	19	19	19
T-shirt	14	14	14	14	14	14
TV	15	15	15	15	15	15

Value Counting Data with Pandas Groupby

Sometimes, you might be interested in counting the occurrences of unique values within each group. In this case, you would need a value_counts function. It is useful for understanding the distribution of categorical or discrete data within subsets of your dataset.

In our sales dataset, let’s say we want to find the product that has been sold the least number of times. We can use the value_counts function to achieve this.

product_occurrences = df['Product_Name'].value_counts()
min_occurrence_product = product_occurrences.idxmin()
print(min_occurrence_product)

In this code, we first use value_counts to count the occurrences of each product, and then we use idxmin to find the product with the minimum occurrence. In our dataset, this would return ‘Jeans’.

Another way to make use of value_counts is for analysis of the distribution of customers by counting how many transactions each customer made:

customer_purchase_counts = df['Customer_ID'].value_counts()

print("Distribution of customers by number of transactions:")
print(customer_purchase_counts.describe())

This example uses describe to provide summary statistics of customer purchase counts, including mean, standard deviation, and quartiles:

Distribution of customers by number of transactions:

count    61.000000
mean      1.639344
std       1.000546
min       1.000000
25%       1.000000
50%       1.000000
75%       2.000000
max       5.000000

Applying Other Aggregate Functions with Pandas Groupby

In addition to the count function, Pandas groupby features allow you to apply various other aggregate functions to perform in-depth data analysis. These functions can help you gain valuable insights into your data by summarizing, calculating statistics, or applying custom aggregation operations to grouped data. In this section, we’ll explore some commonly used aggregate functions you can apply with groupby.

Applying standard aggregation functions: sum, mean, median, min, and max

You can use the sum function to calculate the sum of numeric values within each group. For example, if you want to know which product category generated the most revenue in the past month, you can use groupby to group by the ‘Product_Category’ column and calculate the sum of ‘Total_Revenue’ for each category:

category_revenue = df.groupby('Product_Category')['Total_Revenue'].sum()

print("Total revenue by category:")
print(category_revenue)

This will give us a Series with the total revenue for each product category:

Total revenue by category:
Product_Category
Clothing        69969.61
Electronics    100362.45
Groceries       96215.87

Other statistical functions can be used in the same way as sum to calculate statistics within each group. For instance, to find the average (mean), median (median), minimum (min), and maximum (max) quantity sold for each product.

[blog-subscribe]

Applying custom aggregation with agg

The agg function allows you to apply custom aggregation functions to different columns within each group. For example, you might want to find both the total quantity sold and the average price per unit for each product:

import numpy as np
product_groups = df.groupby('Product_Name')
custom_aggregations = {
    'Quantity_Sold': np.sum,  # calculate total quantity sold
    'Price_Per_Unit': np.mean  # calculate average price per unit
}
product_groups.agg(custom_aggregations)

This will give us the following result:

Out[18]:

	Quantity_Sold	Price_Per_Unit
Product_Name
Banana	103	583.950526
Jeans	32	592.764286
Laptop	115	559.973077
Phone	98	493.672632
T-shirt	59	623.602857
TV	85	497.064667

Handling Missing Data with Pandas Groupby

groupby provides methods for handling missing data within groups, including counting missing values, filling missing values with specific strategies, dropping rows with missing values, and applying custom handling for missing values. These techniques allow you to manage missing data efficiently when performing group-wise analysis, ensuring that your analyses are both accurate and insightful. In this section, we’ll explore how to apply it. We will use the same dataset we generated before but replace several values in “Quantity_Sold” and “Price_Per_Unit” with missing data.

Counting missing values with isna and sum

The first step in handling missing data is to get the count of missing values in a dataset. You can count missing values within each unique group using the isna function in combination with sum. For instance, to count the number of missing values in the “Price_Per_Unit” column for each category:

category_groups = df.groupby('Product_Category')
category_groups['Price_Per_Unit'].apply(lambda x: x.isna().sum())

This gives us the following Series with the count of missing values in each category:

Product_Category
Clothing        4
Electronics    11
Groceries       8

Filling missing values with fillna

Once you find missing values, you might want to fill them with specific values or strategies within each group. The fillna function can be used for this purpose. For example, to fill in missing values in the “Price_Per_Unit” column with the mean of the non-missing values within each category:

category_groups = df.groupby('Product_Category')
df['Price_Per_Unit'] = category_groups['Price_Per_Unit'].transform(lambda x: x.fillna(x.mean()))

This code will replace missing values in “Price_Per_Unit” with the mean of non-missing values.

Dropping rows with missing values with dropna

Sometimes, it is necessary to remove rows containing missing values within each group. In this case, you can use the dropna function. For instance, to drop rows with missing values within each category:

category_groups = df.groupby('Product_Category')
df_cleaned = category_groups.apply(lambda x: x.dropna())

This code returns a DataFrame with rows containing missing values removed within each category group.

Applying custom handling for missing values

In some cases, you may want to apply custom handling for missing values within each group. You can use a custom aggregation function with agg to achieve this. For example, to calculate the mean of non-missing values and count the number of missing values for the “Price_Per_Unit” column within each category:

category_groups = df.groupby('Product_Category')
custom_aggregations = {
    'Price_Per_Unit': [np.mean, lambda x: x.isna().sum()]
}
category_groups.agg(custom_aggregations)

This code will create the following DataFrame:

Out[48]:

	Price_Per_Unit
	mean	<lambda_0>
Product_Category
Clothing	595.775331	4
Electronics	571.908380	8
Groceries	494.863710	9

Grouping Time Series Data with Pandas Groupby

groupby can also be used for grouping and analyzing time series data. Time series data typically includes timestamps or dates; you can use these temporal values to group and aggregate your data effectively. Whether you’re interested in aggregating data by specific time periods, custom intervals, or components of timestamps, Pandas makes it easy to extract valuable insights from time series datasets. In this section, we’ll explore how to group time series data using the same dataset for examples.

Grouping by year and month

Time series data can be grouped by year and month (or month and day, day and hour, etc.) for a more granular analysis. To group the data by year and month and calculate the maximum “Price_Per_Unit” for each period, we use the following code:

year_month_groups = df.groupby([df['Date'].dt.year, df['Date'].dt.month])
year_month_groups['Price_Per_Unit'].max()

Running the code, we get the following:

Date  Date
2023  8       990.60
      9       993.04
      10      969.61
      11      931.45
      12      986.77

Applying custom functions

Custom aggregation functions can be used in combination with grouped time series data for more sophisticated analysis. For instance, we could calculate the difference between the maximum and minimum “Price_Per_Unit” for each month:

custom_aggregation = lambda x: x.max() - x.min()
year_month_groups['Price_Per_Unit'].apply(custom_aggregation)

This gives us the following:

Date  Date
2023  8       969.36
      9       972.31
      10      947.58
      11      838.78
      12      896.23

These examples demonstrate that using groupby for your data analysis can make it more efficient and insightful. It is very flexible and offers a wide range of aggregation and transformation functions that can be applied to support even complex data investigations. On top of that, groupby leverages vectorized operations and memory-efficient algorithms while allowing for parallel processing, resulting in faster and more scalable data analysis. All this makes groupby a powerful tool for data exploration, particularly when working with structured data. Whether you’re exploring sales data or any other dataset, it will help you efficiently summarize and gain valuable insights from your data.

Start to test, review and generate high quality code

Get Started