Hello, readers!

We are excited to kick off an insightful project focused on analyzing data related to Diwali sales. This initiative aims to delve deep into the dataset and uncover valuable trends and patterns. By conducting thorough analyses, we will explore how various factors contribute to sales during the festive season.

The dataset we will be using contains several columns, each representing key aspects of the sales data. Our analysis will systematically examine these columns to provide a comprehensive understanding of the dataset. From sales performance across different categories to regional trends, we’ll explore all the significant details this data has to offer.

For your convenience, the dataset in CSV format is attached below. Feel free to download it and follow along with us as we embark on this exciting journey of data analysis.

Let’s uncover the story behind the numbers!

Diwali sale analysis GITHUB FILE 🔗

We will use Jupyter Notebook and Python for the analysis portion of this project.

Setting Up the Project Environment

To begin, we need to set up the working environment for our analysis in Jupyter Notebook. Follow these steps:

Choose the Folder: Select the folder where you want to create your Python 3 file. This will serve as the workspace for the project.
Create a New File: Click the “New” button, as shown in the provided image, and choose "Python 3" to create a new notebook file.
Rename the File: After creating the file, you can rename it to a meaningful title of your choice. For this tutorial, I’ve already created a file named “Diwali_sales_analysis.ipynb.”

Installing Required Packages

Before diving into the data analysis, we need to ensure that all the necessary Python libraries are installed. These packages are essential for handling and visualizing data effectively. We'll walk through the installation process and confirm that our environment is ready for analysis.

Once this setup is complete, we’ll move on to loading and exploring the dataset!

!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn

Here’s a brief overview of the working and purpose of each package:

1. NumPy (Numerical Python)

NumPy is a fundamental library for numerical computing in Python.

Key Features:
- Efficient handling of large, multi-dimensional arrays and matrices.
- Provides mathematical functions for linear algebra, random numbers, and more.
Use in Data Analysis:
NumPy is often used to perform calculations and transformations on numerical data efficiently. It forms the foundation for other libraries like Pandas.

2. Pandas

Pandas is a library designed for data manipulation and analysis.

Key Features:
- Provides data structures like DataFrame (2D) and Series (1D) to organize data.
- Includes functionality to clean, filter, and transform data.
- Supports reading and writing data in multiple formats like CSV, Excel, SQL, etc.
Use in Data Analysis:
Pandas is widely used for exploring and preparing datasets. It helps in operations like grouping, sorting, merging, and reshaping data.

3. Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations.

Key Features:
- Supports a wide variety of plots, such as line plots, bar plots, histograms, and scatter plots.
- Offers control over plot elements like labels, legends, axes, and styles.
Use in Data Analysis:
Matplotlib helps in visualizing data trends and distributions, which is essential for understanding the dataset and presenting findings.

4. Seaborn

Seaborn is a data visualization library built on top of Matplotlib.

Key Features:
- Simplifies creating complex visualizations with minimal code.
- Includes specialized plots like heatmaps, violin plots, pair plots, and categorical plots.
- Offers enhanced aesthetics with built-in themes and color palettes.
Use in Data Analysis:
Seaborn is ideal for creating attractive and informative statistical graphics, making data interpretation intuitive and visually appealing.

Importing and Setting Up Packages for Analysis

In this step, we import all the required Python libraries needed for our analysis. Importing packages ensures that we can access their full range of functionalities, such as numerical computations, data manipulation, and visualization tools.

# import python libraries

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt # visualizing data
%matplotlib inline
import seaborn as sns

To streamline our workflow and make the code easier to read, we often rename packages using standard conventions. This allows us to reference these libraries more conveniently in our analysis. For instance:

NumPy is typically imported as np.
Pandas is commonly imported as pd.
Matplotlib.pyplot is often shortened to plt.
Seaborn is usually referred to as sns.

Renaming these packages helps to maintain clarity and consistency, especially in larger projects where they are referenced multiple times.

With all the packages successfully imported, we’ll proceed to load the dataset and prepare it for analysis. Let’s get started!

Importing the CSV File into Python

To load the dataset for analysis, we use the Pandas library, which provides a convenient function called pd.read_csv() for importing CSV files into Python. Below is the command we are using:


# Reading the CSV file
data = pd.read_csv('Diwali Sales Data.csv', encoding='unicode_escape')

Explanation of the Code

pd.read_csv():
This is a Pandas function used to read a CSV (Comma-Separated Values) file and load it into a DataFrame, a two-dimensional table-like data structure in Pandas.
File Location:
- 'Diwali Sales Data.csv': This is the file name (and location, if applicable) we want to load. In this example, the file should be in the current working directory. If it’s located elsewhere, you must provide the full file path, like 'C:/Users/YourName/Documents/Diwali Sales Data.csv'.
Encoding:
- The parameter encoding='unicode_escape' ensures that special or non-ASCII characters in the dataset, such as those commonly found in Diwali-themed datasets, are handled correctly without errors.
- Without proper encoding, reading files with special characters might lead to issues like decoding errors.
Storing the Data:
- The command assigns the imported data to the variable data, which becomes a Pandas DataFrame. This allows us to access and manipulate the dataset efficiently for our analysis.

Why Use Pandas for Importing Data?

The pd.read_csv() function offers several advantages:

It is fast and efficient, even with large datasets.
Automatically infers column headers, data types, and formats.
Provides options to handle missing values, delimiters, encodings, and more.

Exploring the Dataset Using Pandas Inbuilt Functions

Pandas provides several powerful inbuilt functions that allow us to quickly understand the structure and contents of a dataset. These functions make it easy to perform an initial exploration of the data. Below are the key functions we will use, along with their purpose and detailed explanation:

1. `df.shape`

Usage:
```
  df.shape
```
Purpose:
The shape attribute is used to determine the dimensions of the dataset.
- It returns a tuple: (number_of_rows, number_of_columns).
- This helps you quickly understand the size of your dataset.

Example Output:
If your dataset contains 11251 rows and 15 columns, df.shape will return:

2. `df.head()`

Usage:
```
  df.head()
```
or
```
  df.head(n)
```
Here, n is an optional argument specifying the number of rows to display.
Purpose:
- This function shows the first few rows of the dataset, providing a snapshot of the data.
- By default, it displays the first 5 rows, but you can specify any number of rows you want to see by passing it as an argument, e.g., df.head(10) will show the first 10 rows.

Example:
To see the first 10 rows of the dataset:

df.head(10)

Output Example:
You will get a table with the first 10 rows and all available columns. This is helpful for verifying that the data has been imported correctly and getting a glimpse of its content.

3. `df.info()`

Usage:
```
  df.info()
```
Purpose:
- Provides a summary of the dataset, including:
  - The number of non-null entries in each column (i.e., data presence).
  - The names of all the columns.
  - The data types of each column (e.g., integers, floats, strings, etc.).
  - The total number of rows and columns.
- This function is essential for understanding the dataset's structure and identifying any missing or inconsistent data.

Example Output:

This shows:

The range of the index.
Column details, including non-null counts and data types.
Memory usage, which helps gauge the size of the dataset.

Why Use These Functions?

These functions provide a quick and efficient way to perform Exploratory Data Analysis (EDA) by:

Verifying the dataset's dimensions and structure.
Getting a snapshot of the data.
Checking for missing values, incorrect data types, or inconsistencies.

Dropping Unnecessary Columns from the Dataset

The Pandas library provides the df.drop() method, which allows us to remove specific rows or columns from the dataset. Let’s break down the code:

Code Explanation

df.drop(['Status', 'unnamed1'], axis=1, inplace=True)

Parameters in the Code:

['Status', 'unnamed1']:
- This is a list of column names we want to remove from the dataset. In this example, the columns named 'Status' and 'unnamed1' are being dropped.
- These columns may contain redundant, irrelevant, or placeholder data, and removing them simplifies our analysis.
axis=1:
- The axis parameter specifies whether to drop rows or columns:
  - axis=0: Drops rows.
  - axis=1: Drops columns.
- Here, axis=1 indicates that we’re targeting columns for removal.
inplace=True:
- The inplace parameter specifies whether to modify the original DataFrame directly or create a new one:
  - True: Changes are applied directly to the original DataFrame.
  - False (default): Returns a new DataFrame with the specified columns/rows removed, leaving the original DataFrame unchanged.
- By setting inplace=True, we make the changes permanent in the current DataFrame without needing to reassign it to a new variable.

What Happens After Execution?

The Status and unnamed1 columns will be permanently removed from the dataset.
This operation makes the dataset cleaner and eliminates unnecessary information that could clutter or complicate the analysis.

Why Drop Columns?

Relevance: The columns might not contribute to the objectives of the analysis.
Cleaning: Removing placeholder or automatically generated columns (like 'unnamed1') helps maintain clarity.
Optimization: Reducing the dataset’s size can improve computational efficiency for large datasets.

Checking for Null Values in the Dataset

When working with datasets, it’s essential to identify and handle missing or null values, as they can significantly impact data analysis and model performance. In this step, we use the Pandas isnull() and sum() functions to detect null values in the dataset.

Code Explanation

# Checking for null values
pd.isnull(df).sum()

Step-by-Step Explanation:

pd.isnull(df):
- The isnull() function checks each cell in the DataFrame and returns a boolean value:
  - True if the value is null (i.e., missing or NaN).
  - False if the value is non-null.
- This creates a DataFrame of the same dimensions as the input, but filled with boolean values indicating the presence of nulls.
.sum():
- The sum() function is then applied to the boolean DataFrame to compute the total number of True values (null values) for each column.
- The result is a Pandas Series, where:
  - The index represents the column names.
  - The corresponding values indicate the count of null values in that column.

What Does the Output Look Like?

After running the code, the output might look like this:

This means:

All the columns except the amount have no missing values.
Amount has 12 missing values.

Why Check for Null Values?

Data Quality: Missing values can affect the accuracy of data analysis and machine learning models.
Decision-Making: Helps decide how to handle null values (e.g., removal, imputation, etc.).
Dataset Understanding: Provides insight into potential gaps in data collection.

Dropping Null Values from the Dataset

Null values (missing data) can hinder the analysis process and may lead to errors or inaccurate results. To address this, we can remove rows containing null values using the Pandas dropna() method.

Code Explanation

# Dropping null values
df.dropna(inplace=True)

Step-by-Step Explanation:

dropna() Function:
- The dropna() function is used to remove rows or columns with missing values (NaN).
- By default, it removes rows containing at least one null value.
inplace=True:
- When inplace=True, the changes are applied directly to the original DataFrame without creating a new one.
- This makes the operation permanent unless you reload the DataFrame or revert the changes manually.

What Happens After Execution?

Any rows containing null values in any column are permanently removed from the dataset.
The DataFrame becomes cleaner and contains only complete rows of data.

Why Drop Null Values?

Simplifies Analysis: Ensures that missing values don’t interfere with calculations or visualizations.
Maintains Data Consistency: Helps avoid errors caused by incomplete data during operations like aggregation or machine learning model training.

Checking the Results

It’s important to verify the changes after dropping null values:

Changing the Data Type of a Column in the Dataset

Sometimes, a column in the dataset might have an incorrect or non-optimal data type. Converting such columns to the appropriate type ensures that the data is handled correctly during analysis or computation. Here, we use the astype() method to change the data type of the column Amount to an integer.

Code Explanation

# Changing the data type of the 'Amount' column to integer
df['Amount'] = df['Amount'].astype('int')

Step-by-Step Explanation:

Selecting the Column:
- The df['Amount'] part refers to the Amount column in the DataFrame.
Converting Data Type:
- The .astype('int') method changes the data type of the selected column to integer (int).
- If the column contains floating-point numbers or strings that represent numbers, they will be converted to integers.
Assigning Back:
- The transformed column is assigned back to the same column in the DataFrame, updating its values and data type.

Why Change Data Types?

Consistency: Ensures uniform data types, which are critical for accurate calculations or comparisons.
Optimization: Reduces memory usage and processing time by using the most efficient data type.
Error Prevention: Avoids runtime errors when performing numeric operations on non-numeric types.

Handling Potential Issues

Non-convertible Values:
If the column contains non-numeric values (e.g., text), the conversion will raise a ValueError.

Expected Output

Before Conversion:

After Conversion:

This process ensures that the Amount column is converted into an integer type, making it ready for numerical computations like aggregation, comparison, or visualization.

Accessing Column Names in the Dataset

The df.columns property is used to display all the column names of a DataFrame. It helps in understanding the structure of the dataset.

Code Explanation

df.columns

What It Does:
- This command lists all the column names in the DataFrame.

Example Output:

If your DataFrame has the following columns:

Why Use It?

Quick Overview: Provides an understanding of the available data at a glance.
Ease of Reference: Helps you use column names correctly when performing operations.

Renaming a Column

The rename() function allows you to modify column names to make them more descriptive or easier to understand.

Code Explanation

df.rename(columns={'Marital_Status': 'Shaadi'})

Parameters in the Code:
1. columns={'Marital_Status': 'Shaadi'}
  - This specifies the column to rename. In this case, Marital_Status is renamed to Shaadi.
2. Optional Parameter: inplace=True
  - If True, the change will be directly applied to the DataFrame. If omitted, a new DataFrame with updated column names is returned.

Why Rename Columns?

Clarity: Renaming columns makes their purpose more evident.
Adaptability: Simplifies working with datasets in different contexts or regions (e.g., translating to a local language).

Describing the Data

The describe() function provides summary statistics for numerical columns in the DataFrame, such as count, mean, standard deviation, minimum, and maximum values.

Code Explanation

df.describe()

What It Does:
- Calculates and returns a summary of the numerical columns in the DataFrame.

Output Example:

For a DataFrame with columns Age and Amount, the output might look like this:

Specific Column Descriptions

You can use the describe() function on selected columns to focus on specific attributes.

df[['Age', 'Orders', 'Amount']].describe()

What It Does:
- Provides summary statistics for the columns Age, Orders, and Amount only.

Exploratory Data Analysis:

Plotting a Bar Chart for Gender and Its Count

Visualizing data is a crucial part of data analysis, as it allows us to identify patterns and insights easily. In this step, we use the Seaborn library to create a bar chart that shows the distribution of gender counts in the dataset. Additionally, we label each bar with its respective count for better clarity.

Code Explanation

# Plotting a bar chart for Gender and its count
ax = sns.countplot(x='Gender', data=df)

# Adding labels to the bars
for bars in ax.containers:
    ax.bar_label(bars)

Step-by-Step Breakdown

sns.countplot():
- This function from Seaborn creates a bar chart displaying the counts of unique values in the Gender column.
- Parameters:
  - x='Gender': Specifies the column to visualize on the x-axis.
  - data=df: Specifies the DataFrame containing the data.
ax.containers:
- Retrieves all the rectangular bar objects from the plot.
ax.bar_label():
- Adds numeric labels to the top of each bar, showing its count value.

Why Use a Bar Chart?

Categorical Data Representation: Ideal for visualizing the frequency distribution of categorical variables like gender.
Clear Insights: Displays the proportion of each category, making comparisons easy.

Verifying the Output

For a DataFrame with a Gender column containing values ['Male', 'Female', 'Male', 'Female', 'Male'], the bar chart would look like this:

The x-axis represents gender categories (e.g., Male, Female).
The y-axis represents the count for each category.
Each bar is labeled with its respective count (e.g., Male = 3, Female = 2).

Plotting a Bar Chart for Gender vs. Total Amount

A bar chart is an effective way to visualize the total sales amount contributed by each gender in your dataset. By grouping the data and summing the sales for each gender, you can generate insightful comparisons using the Seaborn library.

Code Explanation

# Grouping data by Gender and summing the Amount
sales_gen = df.groupby(['Gender'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)

# Plotting the bar chart
sns.barplot(x='Gender', y='Amount', data=sales_gen)

Step-by-Step Breakdown

groupby(['Gender']):
- Groups the DataFrame by the Gender column.
- The grouped data will aggregate the rows for each unique gender.
['Amount'].sum():
- For each group, it calculates the sum of the values in the Amount column.
- This gives the total sales amount for each gender.
as_index=False:
- Ensures the grouped column (Gender) remains a regular column in the resulting DataFrame instead of becoming the index.
sort_values(by='Amount', ascending=False):
- Sorts the DataFrame in descending order of the total sales amount for clearer visualization and comparison.
sns.barplot():
- Creates a bar chart with:
  - x='Gender': Gender categories on the x-axis.
  - y='Amount': Total sales amount on the y-axis.
  - data=sales_gen: Uses the grouped and aggregated dataset for the plot.

Expected Output

The x-axis displays the gender categories (e.g., Male, Female).
The y-axis displays the total sales amount corresponding to each gender.
The bars' heights represent the total sales amount for each gender, clearly indicating which gender contributed the most.

Plotting a Grouped Bar Chart: Age Group vs Gender

This visualization showcases the distribution of gender across different age groups using a grouped bar chart. The hue parameter allows for an additional categorization by gender, while bar labels enhance clarity by displaying exact counts on each bar.

Code Explanation

# Plotting a grouped bar chart for Age Group and Gender
ax = sns.countplot(data=df, x='Age Group', hue='Gender')

# Adding bar labels
for bars in ax.containers:
    ax.bar_label(bars)

Step-by-Step Breakdown

sns.countplot():
- Plots a count of occurrences for unique values in the Age Group column, while differentiating by Gender using the hue parameter.
- Parameters:
  - data=df: Specifies the DataFrame to use.
  - x='Age Group': Age groups are displayed on the x-axis.
  - hue='Gender': Adds a secondary categorical distinction (Male/Female) within each age group.
ax.containers:
- Accesses the bar containers, which store individual bar objects for both genders.
ax.bar_label():
- Annotates each bar with its corresponding count, adding precise values to enhance readability.

Plotting a Grouped Bar Chart: Age Group vs Gender

Code Explanation

# Plotting a grouped bar chart for Age Group and Gender
ax = sns.countplot(data=df, x='Age Group', hue='Gender')

# Adding bar labels
for bars in ax.containers:
    ax.bar_label(bars)

Step-by-Step Breakdown

sns.countplot():
- Plots a count of occurrences for unique values in the Age Group column, while differentiating by Gender using the hue parameter.
- Parameters:
  - data=df: Specifies the DataFrame to use.
  - x='Age Group': Age groups are displayed on the x-axis.
  - hue='Gender': Adds a secondary categorical distinction (Male/Female) within each age group.
ax.containers:
- Accesses the bar containers, which store individual bar objects for both genders.
ax.bar_label():
- Annotates each bar with its corresponding count, adding precise values to enhance readability.

Why Use Grouped Bar Charts?

Comparative Analysis: Easily compares subcategories (e.g., Male vs Female) within each age group.
Better Insights: Identifies trends such as gender-dominated age groups.
Appealing Visuals: Organizes data for intuitive interpretation.

Insights Derived

Age Group Patterns: See which age group has the highest or lowest population for each gender.
Marketing Decisions: Helps in targeting age groups effectively based on gender-specific preferences.
Demographic Analysis: Useful for understanding customer distribution.

This approach provides a clear and insightful representation of gender distribution within different age groups.

Plotting Total Amount vs Age Group

Visualizing total sales by age group helps in identifying which age demographic contributes the most to revenue. This can be achieved using a bar chart, where the sales data is grouped by age group and aggregated. The resulting plot provides clear insights into spending trends across different age categories.

Code Explanation

Data Preparation

sales_age = df.groupby(['Age Group'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)
sns.barplot(x='Age Group', y='Amount', data=sales_age)

groupby(['Age Group']):
- Groups the data by the Age Group column to aggregate values.
['Amount'].sum():
- Sums up the Amount column for each age group to calculate total sales.
as_index=False:
- Keeps the Age Group column as part of the DataFrame rather than setting it as the index.
sort_values(by='Amount', ascending=False):
- Sorts the results in descending order of total sales, highlighting the top-performing age groups.
sns.barplot(): Creates a bar chart:
- - x='Age Group': Age groups on the x-axis.
    - y='Amount': Total sales amount on the y-axis.
    - data=sales_age: Uses the grouped data.

From above graphs we can see that most of the buyers are of age group between 26-35 yrs female

Plotting the Total Number of Orders from the Top 10 States

This bar chart visualizes the total number of orders from the top 10 states, providing insight into regional demand and sales distribution. Using groupby to aggregate the data, followed by visualization with Seaborn, this chart makes it easy to identify the states generating the most orders.

Code Explanation

Data Preparation

sales_state = df.groupby(['State'], as_index=False)['Orders'].sum().sort_values(by='Orders', ascending=False).head(10)

groupby(['State']):
- Groups the dataset by the State column to analyze orders state-wise.
['Orders'].sum():
- Calculates the total number of orders for each state.
sort_values(by='Orders', ascending=False):
- Sorts states by the total number of orders in descending order, placing the highest at the top.
head(10):
- Retrieves the top 10 states with the highest number of orders.

Plotting the Bar Chart

sns.set(rc={'figure.figsize': (15, 5)})  # Sets figure size
sns.barplot(data=sales_state, x='State', y='Orders')

sns.set(rc={'figure.figsize': (15, 5)}):
- Adjusts the dimensions of the figure to make it wider, ensuring all 10 states are visible.
sns.barplot() Parameters:
- data=sales_state: Specifies the DataFrame to use for the chart.
- x='State': States on the x-axis.
- y='Orders': Total number of orders on the y-axis.

Expected Output

The x-axis lists the top 10 states.
The y-axis represents the total number of orders.
Bar heights display the relative contribution of each state in terms of order volume.

Plotting Total Sales (Amount) from the Top 10 States

This visualization shows the total sales revenue contributed by the top 10 states. Using Seaborn to create a bar chart, the data highlights regional sales leaders and helps uncover geographic trends in revenue distribution.

Code Explanation

Data Preparation

sales_state = df.groupby(['State'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False).head(10)

groupby(['State']):
- Groups the data by State to aggregate sales per state.
['Amount'].sum():
- Calculates the total sales amount for each state.
sort_values(by='Amount', ascending=False):
- Sorts states in descending order of total sales, showing the top-performing ones first.
head(10):
- Selects the top 10 states with the highest sales revenue.

Plotting the Bar Chart

sns.set(rc={'figure.figsize': (15, 5)})  # Adjust figure dimensions
sns.barplot(data=sales_state, x='State', y='Amount')

sns.set(rc={'figure.figsize': (15, 5)}):
- Defines the chart dimensions, ensuring all states fit comfortably on the x-axis.
sns.barplot() Parameters:
- data=sales_state: Uses the aggregated sales data.
- x='State': Maps states to the x-axis.
- y='Amount': Displays total sales (amount) on the y-axis.

Plotting Total Sales (Amount) from the Top 10 States

Code Explanation

Data Preparation

sales_state = df.groupby(['State'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False).head(10)

groupby(['State']):
- Groups the data by State to aggregate sales per state.
['Amount'].sum():
- Calculates the total sales amount for each state.
sort_values(by='Amount', ascending=False):
- Sorts states in descending order of total sales, showing the top-performing ones first.
head(10):
- Selects the top 10 states with the highest sales revenue.

Plotting the Bar Chart

sns.set(rc={'figure.figsize': (15, 5)})  # Adjust figure dimensions
sns.barplot(data=sales_state, x='State', y='Amount')

sns.set(rc={'figure.figsize': (15, 5)}):
- Defines the chart dimensions, ensuring all states fit comfortably on the x-axis.
sns.barplot() Parameters:
- data=sales_state: Uses the aggregated sales data.
- x='State': Maps states to the x-axis.
- y='Amount': Displays total sales (amount) on the y-axis.

Expected Output

X-axis: The states with the highest total sales (top 10).
Y-axis: Total revenue contributed by each state.
Bars: Height reflects the total sales amount, ordered by performance.

Plotting a Countplot for Marital Status

This plot visualizes the count of records based on marital status, showing how many individuals in the dataset are married versus unmarried. A countplot from Seaborn is used, along with bar labels to highlight the exact count values on each bar.

Code Explanation

Plotting the Countplot

ax = sns.countplot(data=df, x='Marital_Status')

sns.countplot() Parameters:
- data=df: Specifies the dataset to use.
- x='Marital_Status': Maps the Marital_Status column to the x-axis.
Marital_Status:
- The column indicates whether individuals are married or single, typically containing categories like 'Married' or 'Unmarried'.

Customizing Figure Size

sns.set(rc={'figure.figsize': (7, 5)})

This sets the plot dimensions to 7 inches wide and 5 inches high for better readability.

Adding Bar Labels

for bars in ax.containers:
    ax.bar_label(bars)

ax.containers:
- Accesses the bar containers (individual bars in the plot).
ax.bar_label(bars):
- Adds labels showing the height (count) of each bar on top of the bars for better clarity.

Plotting Total Sales (Amount) by Marital Status and Gender

This visualization depicts the total sales amount categorized by marital status and further segregated by gender, enabling insights into how different demographic segments contribute to overall sales. The use of a grouped bar chart helps analyze the combined influence of marital status and gender on purchasing behavior.

Code Explanation

Data Preparation

sales_state = df.groupby(['Marital_Status', 'Gender'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)

groupby(['Marital_Status', 'Gender']):
- Groups the data by Marital_Status and Gender to create subcategories within each marital status.
['Amount'].sum():
- Aggregates the total sales (amount) for each marital status-gender combination.
sort_values(by='Amount', ascending=False):
- Sorts the aggregated data in descending order of the total sales amount for easy interpretation.

Plotting the Bar Chart

sns.set(rc={'figure.figsize': (6, 5)})  # Sets figure dimensions
sns.barplot(data=sales_state, x='Marital_Status', y='Amount', hue='Gender')

sns.set(rc={'figure.figsize': (6, 5)}):
- Configures the figure size to balance visual clarity without clutter.
sns.barplot() Parameters:
- data=sales_state: Provides the dataset with aggregated sales values.
- x='Marital_Status': Sets the marital status on the x-axis.
- y='Amount': Displays the total sales (amount) on the y-axis.
- hue='Gender': Adds bars for each gender within each marital status category, distinguishing them using color.

Expected Output

X-axis: Displays marital status categories (e.g., 'Married', 'Unmarried').
Y-axis: Represents the total sales (amount).
Bars: Show the sales contribution of different genders within each marital status.
Hue (Color Coding): Differentiates between genders.

Plotting a Countplot for Occupation

This plot visualizes the distribution of occupations within the dataset by counting the number of records for each occupation. A countplot from Seaborn is used, combined with bar labels for clarity and a wide figure size to accommodate numerous categories on the x-axis.

Code Explanation

Plotting the Countplot

ax = sns.countplot(data=df, x='Occupation')

sns.countplot() Parameters:
- data=df: Specifies the dataset to use.
- x='Occupation': Maps the column Occupation to the x-axis, where each unique occupation will be represented as a bar.

Adjusting Figure Size

sns.set(rc={'figure.figsize': (20, 5)})

This ensures the plot width is sufficient to display all occupation labels on the x-axis without overlap.
The width of 20 and height of 5 are ideal for a detailed visualization of many categories.

Adding Bar Labels

for bars in ax.containers:
    ax.bar_label(bars)

ax.containers:
- Accesses the bar elements in the plot.
ax.bar_label(bars):
- Adds labels at the top of each bar, showing the count for that occupation, improving the interpretability of the plot.

Expected Output

X-axis: Occupations.
Y-axis: Count of records for each occupation.
Bars: Represent the count for each occupation, with labels for exact numbers.

Plotting Total Sales Amount by Occupation

This plot visualizes the total sales (amount) contributed by each occupation. By using a barplot from Seaborn, the chart displays the relationship between different occupations and their respective sales contributions, making it easier to identify top-performing occupations in terms of sales.

Code Explanation

Data Aggregation

sales_state = df.groupby(['Occupation'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False)

groupby(['Occupation']):
- Groups the data by the Occupation column to aggregate sales for each occupation.
['Amount'].sum():
- Calculates the total sales (amount) for each occupation group.
sort_values(by='Amount', ascending=False):
- Sorts the occupations in descending order of total sales, placing the highest contributors at the top.

Creating the Bar Chart

sns.set(rc={'figure.figsize': (20, 5)})
sns.barplot(data=sales_state, x='Occupation', y='Amount')

sns.set(rc={'figure.figsize': (20, 5)}):
- Sets a wide figure size to prevent overlapping x-axis labels when there are many occupations.
sns.barplot() Parameters:
- data=sales_state: Uses the prepared dataset with aggregated sales.
- x='Occupation': Occupations are displayed along the x-axis.
- y='Amount': Total sales are plotted along the y-axis.

Expected Output

X-axis: Displays various occupations.
Y-axis: Represents the total sales (amount) generated by each occupation.
Bars: Illustrate the relative contribution of each occupation to total sales.

From above graphs we can see that most of the buyers are working in IT, Healthcare and Aviation sector

Plotting a Countplot for Product Category

This visualization showcases the distribution of records within different product categories, displaying the count of items across various product categories in the dataset. A countplot from Seaborn is used here, and the bar labels are added for precise information about each category's count.

Code Explanation

Plotting the Countplot

ax = sns.countplot(data=df, x='Product_Category')

sns.countplot() Parameters:
- data=df: Specifies the dataset to use for plotting.
- x='Product_Category': Maps the Product_Category column to the x-axis, allowing for categories to be shown as individual bars.

Adjusting Figure Size

sns.set(rc={'figure.figsize': (20, 5)})

Sets the figure dimensions, ensuring it is wide enough to display all product categories on the x-axis without overlap. This allows clearer visualization, especially if there are many product categories.

Adding Bar Labels

for bars in ax.containers:
    ax.bar_label(bars)

ax.containers:
- Accesses all the bars (containers) in the countplot, allowing further manipulation.
ax.bar_label(bars):
- Adds labels above each bar to display the actual count for each category, providing immediate visual feedback about the data.

Plotting Total Sales Amount by Top 10 Product Categories

This visualization shows the total sales (amount) contributed by the top 10 product categories. By grouping the data based on product category and aggregating the total sales, this bar chart helps to highlight the most revenue-generating product categories. It provides valuable insights into which product categories are the highest performers in terms of sales.

Code Explanation

Data Aggregation

sales_state = df.groupby(['Product_Category'], as_index=False)['Amount'].sum().sort_values(by='Amount', ascending=False).head(10)

groupby(['Product_Category']):
- Groups the dataset by the Product_Category column to analyze sales within each product category.
['Amount'].sum():
- Sums up the total sales (amount) within each product category.
sort_values(by='Amount', ascending=False):
- Sorts the categories in descending order, with the product categories that generate the most sales at the top.
head(10):
- Selects only the top 10 categories based on total sales, focusing the plot on the highest-grossing categories.

Creating the Bar Chart

sns.set(rc={'figure.figsize':(20,5)})
sns.barplot(data = sales_state, x = 'Product_Category', y = 'Amount')

sns.set(rc={'figure.figsize':(20,5)}):
- Adjusts the figure size, making it wide to accommodate the labels of the top 10 product categories without overlap.
sns.barplot() Parameters:
- data=sales_state: The data used for the bar plot, which contains the top 10 product categories with aggregated sales.
- x='Product_Category': Displays the product categories along the x-axis.
- y='Amount': Total sales amounts are plotted along the y-axis.

Plotting Total Orders for Top 10 Products

This visualization displays the number of orders for the top 10 most frequently ordered products. By analyzing and grouping the data by product IDs, we can identify the products with the highest order volumes, which provides valuable insights for product performance analysis.

Code Explanation

Data Aggregation

sales_state = df.groupby(['Product_ID'], as_index=False)['Orders'].sum().sort_values(by='Orders', ascending=False).head(10)

groupby(['Product_ID']):
- Groups the dataset by the Product_ID column to calculate order totals for each product.
['Orders'].sum():
- Aggregates the total number of orders for each product ID.
sort_values(by='Orders', ascending=False):
- Sorts the products in descending order based on their total order counts.
head(10):
- Selects only the top 10 products with the highest number of orders for focused visualization.

Creating the Bar Chart

sns.set(rc={'figure.figsize':(20,5)})
sns.barplot(data = sales_state, x = 'Product_ID', y = 'Orders')

sns.set(rc={'figure.figsize':(20,5)}):
- Sets the figure's size to accommodate longer product IDs on the x-axis, ensuring no label overlaps.
sns.barplot() Parameters:
- data=sales_state: Specifies the dataset containing top product IDs and their order counts.
- x='Product_ID': Displays product IDs on the x-axis.
- y='Orders': Total order counts are plotted on the y-axis.

Expected Output

X-axis: Displays the product IDs of the top 10 most-ordered products.
Y-axis: Represents the total number of orders for each product.
Bars: Visualize the order counts for each product, helping identify the most popular items.

Visualizing the Top 10 Most Sold Products

This chart represents the top 10 most sold products in terms of their total order counts. The data is grouped by the Product_ID, and a bar chart is generated to show the sales volumes of these leading products.

Code Explanation

Data Grouping and Aggregation

df.groupby('Product_ID')['Orders'].sum()

groupby('Product_ID'): Groups the dataset by unique product IDs.
['Orders'].sum(): Aggregates the total number of orders for each product.

Selecting the Top 10 Products

.nlargest(10)

Extracts the top 10 products with the highest order counts.

Sorting Values

.sort_values(ascending=False)

Ensures the data is sorted in descending order so that the product with the most orders appears first.

Creating the Bar Chart

fig1, ax1 = plt.subplots(figsize=(12, 7))

plt.subplots(figsize=(12, 7)): Creates a figure and axis object with dimensions suitable for visualizing the top 10 products.

.plot(kind='bar')

Specifies the type of chart as a bar chart.

Output

X-axis: Displays the top 10 product IDs based on sales volume.
Y-axis: Indicates the total number of orders.
Bars: Represent sales volumes for the products, helping compare their popularity visually.

Conclusion

In this project, we explored and analyzed Diwali sales data using Python and powerful libraries like Pandas, NumPy, Matplotlib, and Seaborn. Through step-by-step data preprocessing, visualization, and analysis, we gained actionable insights into customer demographics, product performance, and sales trends. Key findings included identifying the most popular products, understanding purchasing patterns across different age groups, and analyzing sales contributions by states and customer profiles such as gender and marital status. This analysis can help businesses make data-driven decisions, optimize marketing strategies, and enhance inventory management. By leveraging the tools and techniques demonstrated here, similar datasets can be analyzed to uncover valuable patterns and drive growth effectively.

Diwali-sale-analysis

Installing Required Packages

1. NumPy (Numerical Python)

2. Pandas

3. Matplotlib

4. Seaborn

Importing and Setting Up Packages for Analysis

Importing the CSV File into Python

Explanation of the Code

Why Use Pandas for Importing Data?

Exploring the Dataset Using Pandas Inbuilt Functions

1. df.shape

2. df.head()

3. df.info()

Why Use These Functions?

Dropping Unnecessary Columns from the Dataset

Code Explanation

Parameters in the Code:

What Happens After Execution?

Why Drop Columns?

Checking for Null Values in the Dataset

Code Explanation

Step-by-Step Explanation:

What Does the Output Look Like?

Why Check for Null Values?

Dropping Null Values from the Dataset

Code Explanation

Step-by-Step Explanation:

What Happens After Execution?

Why Drop Null Values?

Checking the Results

Changing the Data Type of a Column in the Dataset

Code Explanation

Step-by-Step Explanation:

Why Change Data Types?

Handling Potential Issues

Expected Output

Accessing Column Names in the Dataset

Code Explanation

Example Output:

Why Use It?

Renaming a Column

Code Explanation

Why Rename Columns?

Describing the Data

Code Explanation

Output Example:

Specific Column Descriptions

Exploratory Data Analysis:

Plotting a Bar Chart for Gender and Its Count

Code Explanation

Step-by-Step Breakdown

Why Use a Bar Chart?

Verifying the Output

Plotting a Bar Chart for Gender vs. Total Amount

Code Explanation

Step-by-Step Breakdown

Expected Output

Plotting a Grouped Bar Chart: Age Group vs Gender

Code Explanation

Step-by-Step Breakdown

Plotting a Grouped Bar Chart: Age Group vs Gender

Code Explanation

Step-by-Step Breakdown

Why Use Grouped Bar Charts?

Insights Derived

Plotting Total Amount vs Age Group

Code Explanation

Data Preparation

Plotting the Total Number of Orders from the Top 10 States

Code Explanation

Data Preparation

Plotting the Bar Chart

Expected Output

Plotting Total Sales (Amount) from the Top 10 States

Code Explanation

Data Preparation

Plotting the Bar Chart

Plotting Total Sales (Amount) from the Top 10 States

Code Explanation

1. `df.shape`

2. `df.head()`

3. `df.info()`