Data Overview

Featured Image

Author: Liangxi Dou, Jialin Guo, Yutong Du and Jiaqi Yin.

Variables

  • The first dataset contains 23 variables. We selected the below 9 variables as our primary interests for EDA and Big Picture:

    • Attrition_Flag: Internal event (customer activity) variable (binary: Existing Customer or Attrited Customer)
    • Customer_Age: Customer’s age in years (numeric: from 26 to 73)
    • Gender: Customer’s gender (binary: F-Female or M-Male)
    • Education_Level: Educational Qualification of the account holder (nominal: College, Doctorate, Graduate, High School, Post-Graduate, Uneducated or Unknown)
    • Income_Category: Annual Income Category of the account holder (nominal: Less than 40K,40K, 40K-60K,60K, 60K-80K,80K, 80K-120Kor120K or 120K+)
    • Total_Trans_Ct: Total Transaction Count (Last 12 months) (numeric: from 10 to 139)
    • Total_Trans_Amt: Total Transaction Amount (Last 12 months) (numeric: from 510 to 18484)
    • Total_Revolving_Bal: Total Revolving Balance on the Credit Card (numeric: from 0 to 2517)
    • Avg_Utilization_Ratio: Average Card Utilization Ratio (numeric: from 0 to 1)
  • The second dataset contains 14 variables. We selected the common ones shared in the two datasets:

    • Exited: whether or not the customer left the bank (binary: 1-left or 0-not left)
    • Age: Customer’s age in years (numeric: from 18 to 92)
    • Gender: Customer’s gender (binary: Female or Male)
    • EstimatedSalary: Estimated customer salaries (numeric: from 11.58 to 200k)

Data Import & Cleaning

We first import two datasets and check importing issues by using the function problems.

## [1] row      col      expected actual  
## <0 rows> (or 0-length row.names)
## [1] row      col      expected actual  
## <0 rows> (or 0-length row.names)

Then we detect the missing values and outlier. From boxplots, we realized that variables “Cusomer_Age”, “Total_Trans_Ct”, “Total_Trans_Amt” in dataset 1 and “Age” in dataset 2 have outliers. We removed the outliers which are defined by the IQR method.

## [1] 0
## [1] 0

This link direct you to the detailed data cleaning process. load_and_clean_data.R