Objective & Approach
To ensure the accuracy of the analysis, it was crucial to identify and handle missing values within the loan application dataset. My approach involved a three-step process to systematically clean the data.
Methodology
- Identify Missing Percentage: Calculated the percentage of blank cells for every column. The formula below was used in a new row (50001) to get the fill rate, which was then subtracted from 1.
=COUNT(B2:B50001)/COUNTA(B2:B50001) - Column Deletion: Used Conditional Formatting to highlight and subsequently delete all columns with over 40% missing data, as they would not provide reliable insights.
- Data Imputation: For remaining missing cells, I imputed the median value of the respective column. The median was chosen over the mean to avoid distortion from outliers present in the data.
Result
After cleaning, the dataset was reduced to 73 columns and 50,002 rows, with no blank cells or duplicates, creating a robust foundation for analysis.