Stackify is now BMC. Read theBlog

Developer’s Guide to Getting Started with Pandas Profiling

By: Stackify
  |  August 6, 2024
Developer’s Guide to Getting Started with Pandas Profiling

Exploratory data analysis is a key component of the machine learning pipeline that helps in understanding various aspects of a dataset. For example, you can learn about statistical properties, types of data, the presence of null values, the correlation among different variables, etc. But to get these details, you need to use different types of Python methods and write multiple lines of code.

What if there’s some tool or library that can help you understand all these properties from a dataset with a few lines of code and with less complexity? Well, you’re in luck. The pandas profiling library from Python can help you get all this information in detail with very little effort.

In this post, you’ll learn about the pandas profiling library with examples, best practices, and practical implementation on how the library extracts more information out of your datasets in the real world.

What Is Pandas Profiling?

Pandas profiling is a Python library that generates interactive HTML reports containing a comprehensive dataset summary. It automates the exploratory data analysis (EDA) process, saving time and effort for data scientists and analysts.

Pandas profiling empowers users to make informed decisions and accelerate the data analysis pipeline by offering insights into data quality, distribution, relationships, and potential issues. Profiling capabilities are built on top of the pandas library, leveraging its data manipulation capabilities and offering a wide range of features:

  • Descriptive statistics for numerical and categorical variables
  • Correlations matrices and scatter plots to explore relationships between variables
  • Data type information for each column
  • Identification and visualization of missing values
  • Categorical variable analysis using frequency distributions, mode, and top categories
  • Numerical variable analysis with quantiles, mean, standard deviation, histograms, and box plots

Practical Example

Let’s understand the pandas profiling advantage by looking at an example with Python code.

To illustrate pandas profiling’s capabilities, you should consider a hypothetical dataset containing information about meteorites, which is available in pandas profiling itself.

Let’s begin with importing the necessary Python modules:

import numpy as np
import pandas as pd
import requests

from pathlib import Path
from ydata_profiling.utils.cache import cache_file

# load dataset
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)

df = pd.read_csv(file_name)

# preprocess dataset
df["year"] = pd.to_datetime(df["year"], errors="coerce")

# Example: Constant variable
df["source"] = "NASA"

# Example: Boolean variable
df["boolean"] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df["mixed"] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df["reclat_city"] = df["reclat"] + np.random.normal(scale=5, size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add["name"] = duplicates_to_add["name"] + " copy"

df = pd.concat([df, duplicates_to_add], ignore_index=True)

# generate report
report = df.profile_report(
    sort=None, html={"style": {"full_width": True}}, progress_bar=False
)

profile_report = df.profile_report(html={"style": {"full_width": True}})
profile_report.to_file("/tmp/example.html")

Upon running the above code, an HTML report will be stored in the “temp/” folder. Don’t worry if you don’t understand the code here; we’ll break it down in the upcoming section for better understanding. The generated HTML report will provide a detailed overview of the dataset, including:

  • Overview – summary statistics of the dataset (number of rows, columns, missing values)
  • Variables – detailed information about each column (type, counts, unique values, missing values, quantiles)
  • Correlations correlation matrix between numerical variables
  • Missing values – visualization of missing values patterns
  • Sample – a random sample of the data

Detailed Report Analysis

The Overview section provides a high-level dataset summary, including the number of rows, columns, and missing values. This information is crucial for understanding the dataset’s size and completeness.

The dataset has 14 columns, more than 45,000 rows, and almost 29,000 missing rows, as shown in the above image.

The Variables section offers detailed insights into each column, including data type, unique values, missing values, and statistical summaries. This section helps identify potential data-quality issues, such as inconsistent data types or excessive missing values. You also get the option to choose a particular column from the drop-down menu.

The Correlations section reveals relationships between numerical variables. A high correlation between two variables suggests a robust linear relationship, which you can explore further using scatter plots.

The coefficient near 1 shows a high positive correlation between the two variables, while -1 shows a negative correlation.

The Missing values section visualizes the missing values pattern, helping to identify potential causes and implications for data analysis.

All columns have high missing values, as shown in the image above.

The Sample section provides a random sample of the data, allowing for a quick visual inspection of the data distribution and identifying potential outliers or anomalies.

This image shows the first 10 rows of the dataset. However, you also can check the last 10 rows by clicking “Last rows.”

How to Get Started with Pandas Profiling

Now, let’s go through the steps to work with the pandas profiling installation to generate and analyze the report.

Setup

The pandas profiling project setup requires you to install pandas profiling with other libraries.

Requirements

You need to install the following libraries to work with the project to use pandas profiling:

  • Python (version 3.6 or later)
  • Pandas
  • Jinja
  • HTML
  • Plotly
  • NumPy
  • SciPy

Installation

To install pandas profiling, use the following command:

!pip install pandas-profiling

Basic Functions

After installing the necessary libraries, be ready to play with some Python code to perform analysis quickly.

Loading Data

Before generating a profile, load your data into a pandas DataFrame and clean the dataset to perform the EDA using pandas profiling.

# define filename
file_name = cache_file(
    "meteorites.csv",
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
)

# read dataset
df = pd.read_csv(file_name)

Here, the read_csv() function from the pandas library is loading the CSV file as a dataframe.

Generating a Profile Report

Create a profile report using the pandas function (an implementation of pandas profiling).

report = df.profile_report(
    sort=None, html={"style": {"full_width": True}}, progress_bar=False
)
report

Here, the profile_report() function creates the profile report of the dataframe.

Advanced Usage

Now that you’ve generated the report using basic features, you should be curious about what more you can do with pandas profiling. This section provides more advanced usage of pandas profiling to overcome your curiosity.

Customization

Pandas profiling offers several options for customizing the report:

  • explorative – enables more in-depth analysis (default: true)
  • minimal – creates a more concise report (default: false)
  • HTML – save the report as an HTML file (default: true)
  • title – set a custom title for the report

Handling Large Datasets

For large datasets, consider using the explorative=False option to improve performance. Additionally, you can sample the data before generating the profile to reduce processing time.

Integrations

Pandas profiling can be integrated with other tools and libraries to enhance the functionality. For example, you can embed the report in a web application framework like Flask or Django.

Interpreting the Report

Reports provide valuable insights into data quality, distribution, and relationships. Here are some key considerations:

  • Missing values – Identify columns with high missing value percentages and investigate potential causes
  • Data types – Ensure data types are correct and consistent across columns
  • Outliers – Detect extreme values that might affect analysis and consider appropriate handling techniques
  • Correlations – Explore relationships between variables and identify potential dependencies
  • Distributions – Understand data distributions to inform modeling and feature engineering

The code used in this article can be found here.

Pandas Profiling and Data Analysis

Pandas profiling is a powerful tool for accelerating data exploration and analysis. You can use it to:

  • Understand data quickly: Gain insights into data characteristics and structure
  • Identify data-quality issues: Detect inconsistencies, missing values, and outliers
  • Explore relationships: Discover correlations and dependencies between variables
  • Communicate findings: Share informative reports with stakeholders

Focus on Developers

Developers can leverage pandas profiling to:

  • Streamline data exploration: Quickly understand new datasets and their properties
  • Accelerate development: Use profiling to inform data cleaning, pre-processing, and feature engineering
  • Build data-driven applications: Integrate profiling into application development workflows

Real-World Use Cases

Pandas profiling has been used across various industries, including:

  • finance – Detect anomalies in financial data, assess credit risk, and optimize investment portfolios
  • health care – Analyze patient data to identify disease patterns, optimize treatment plans, and improve patient outcomes
  • marketing – Understand customer behavior, predict customer churn, and optimize marketing campaigns
  • e-commerce – Analyze sales data to identify product trends, optimize inventory management, and personalize customer experiences

You can also check out some fascinating applications of Python where pandas profiling can be used.

Best Practices

To maximize its benefits of this valuable tool, which can significantly enhance data exploration quality and understanding, consider the following best practices:

Early Adoption

Employ immediately upon data ingestion to establish a baseline understanding of the dataset’s structure, quality, and potential issues. Also, you can utilize pandas profiling as a foundational tool for EDA, uncovering patterns, anomalies, and relationships within the data.

Integration with Development Tools

Incorporate generated reports into version control systems (e.g., Git) to track data quality and distribution drift over time. Additionally, you can integrate pandas profiling into continuous integration and continuous delivery (CI/CD) pipelines to ensure data-quality checks are automated and consistently applied.

Collaboration and Knowledge Sharing

Establish a consistent reporting format using to facilitate effective collaboration and knowledge sharing among team members. Moreover, document the insights you discovered from pandas profiling reports to provide valuable context for future analysis and model development.

Maintenance and Evolution

Re-run frequently on updated datasets to monitor data characteristic changes and identify potential issues. Additionally, as the project progresses, consider refining the pandas profiling configuration to focus on specific areas of interest or to optimize performance for larger datasets.

Limitations and Alternatives of Pandas Profiling

While pandas profiling is a valuable tool, it has some limitations:

  • Performance – May be slow for large datasets
  • Customization – Options are limited compared with some other tools
  • Depth – Provides a general overview but may not delve into specific analysis needs

Several alternative tools offer similar functionalities, including:

  • Sweetviz – Provides interactive visualizations and comparisons between datasets
  • DataProfiler – Offers detailed reports with customization options
  • AutoViz – Automatically generates visualizations for exploratory data analysis

When choosing a profiling tool, consider the size of your dataset, the level of customization required, and the specific insights you need to extract.

Pandas Profiling vs. Y-Data Profiling

Pandas profiling is being renamed to ydata-profiling with version 4.0, focusing on performance and flexibility.

Improve All Your Python Application Monitoring

For more advanced tips and best practices for monitoring all your Python applications, check out Stackify’s guide on optimizing Python code. Better still, start your free trial of Stackify Retrace today and see how full lifecycle APM helps you maintain code quality and performance when using Python or any other programming language.

Conclusion

Pandas profiling is an essential library for data scientists and analysts to explore data efficiently and a valuable tool for developers of finance, health care, marketing, e-commerce, and other applications benefitting from data analysis. A comprehensive dataset overview accelerates the data analysis process and enables informed decision-making.

By effectively utilizing pandas profiling, you can improve the quality and efficiency of your data-driven application development projects. You’ll also gain a significant advantage in your data-driven projects by mastering pandas profiling. Moreover, you can improve data quality and build robust data-driven applications by leveraging the profiling insights you get.

Improve Your Code with Retrace APM

Stackify's APM tools are used by thousands of .NET, Java, PHP, Node.js, Python, & Ruby developers all over the world.
Explore Retrace's product features to learn more.

Learn More

Want to contribute to the Stackify blog?

If you would like to be a guest contributor to the Stackify blog please reach out to [email protected]