Bank Loan Risk Analysis

Mitigating Financial Risk with Exploratory Data Analysis in Excel

Profile Picture of Anirudh Chaudhary

Anirudh Chaudhary

Data Analyst | Financial Risk Specialist

This project showcases a comprehensive Exploratory Data Analysis (EDA) on a bank loan dataset to identify key factors influencing loan defaults. By leveraging advanced Excel functions, statistical analysis, and data visualization, I uncovered critical patterns to help the company make more informed, data-driven lending decisions, ultimately minimizing financial losses and ensuring sustainable growth.

Total Loan Applications
50002
Rows Analyzed
Identified Default Rate
8.05%
Clients with Payment Difficulties
Data Points Analyzed
3.6M+
(73 columns x 50k+ rows)
Highest Risk Group
Younger Applicants
(Age 20-30)
Bank Loan Analysis Dashboard Thumbnail

Task A: Missing Data Handling

+

Objective & Approach

To ensure the accuracy of the analysis, it was crucial to identify and handle missing values within the loan application dataset. My approach involved a three-step process to systematically clean the data.

Methodology

  1. Identify Missing Percentage: Calculated the percentage of blank cells for every column. The formula below was used in a new row (50001) to get the fill rate, which was then subtracted from 1.
    =COUNT(B2:B50001)/COUNTA(B2:B50001)
  2. Column Deletion: Used Conditional Formatting to highlight and subsequently delete all columns with over 40% missing data, as they would not provide reliable insights.
  3. Data Imputation: For remaining missing cells, I imputed the median value of the respective column. The median was chosen over the mean to avoid distortion from outliers present in the data.

Result

After cleaning, the dataset was reduced to 73 columns and 50,002 rows, with no blank cells or duplicates, creating a robust foundation for analysis.

Task B: Outlier Detection

+

Objective & Approach

Identified and flagged outliers in numerical variables to prevent them from skewing analytical results. The Interquartile Range (IQR) method was used for its robustness against extreme values.

Methodology in Excel

  1. Quartiles: Calculated Q1 and Q3 using =QUARTILE(ARRAY, 1) and =QUARTILE(ARRAY, 3).
  2. IQR: Calculated the Interquartile Range as IQR = Q3 - Q1.
  3. Bounds: Determined the upper and lower bounds for outlier detection using the standard formula:
    Lower Bound = $Q1 - 1.5 \times IQR$
    Upper Bound = $Q3 + 1.5 \times IQR$
  4. Flagging: Created a new column with an IF statement to flag any value outside these bounds as an outlier.

Task C: Data Imbalance Analysis

+

Objective & Approach

Assessed the distribution of the target variable (loan default status) to check for class imbalance, which can bias predictive models. I used an Excel Pivot Table to count occurrences of each class.

Key Findings

The analysis revealed a significant data imbalance:

  • Paid on Time (Target=0): 91.95% of applicants.
  • Payment Difficulties (Target=1): 8.05% of applicants.

This results in a ratio of approximately 11.4 to 1, indicating that models trained on this data might be heavily biased towards predicting non-defaulters. This is a critical insight for any future machine learning implementation.

Task D: Univariate & Segmented Analysis

+

Objective & Approach

To understand the characteristics of loan applicants, I performed univariate analysis on key attributes. To uncover patterns related to loan defaults, I then conducted segmented univariate analysis, comparing these attributes against the target variable (payment status). This was achieved using Pivot Tables and creating various charts for visualization.

Key Applicant Demographics

Segmented Analysis: Default Rate by Demographics

Task E: Correlation Analysis

+

Objective & Approach

To identify the strongest indicators of loan default, I segmented the data into two groups: applicants with on-time payments (Target=0) and those with payment difficulties (Target=1). I then calculated correlation matrices for numerical variables within each segment using Excel's Data Analysis ToolPak.

Key Findings

The analysis highlights different relationships within each group. For clients with payment difficulties, certain variables show stronger correlations. For instance, there's a notable negative correlation between age (`DAYS_BIRTH`) and the likelihood of default, suggesting younger clients are riskier. Conversely, for clients who pay on time, variables like income and loan amount are more interrelated.

Download Full Report (PDF) View Excel File

Technical Skills Applied

Advanced Excel

Pivot Tables, CORREL, QUARTILE, statistical functions.

Data Cleaning

Missing data imputation (Median) & outlier treatment (IQR).

Risk Analytics

Identifying key drivers of financial risk and loan default.

Data Visualization

Creating insightful, interactive charts to communicate findings.