library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Churner <- read_csv("churn.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## RowNumber = col_double(),
## CustomerId = col_double(),
## Surname = col_character(),
## CreditScore = col_double(),
## Geography = col_character(),
## Gender = col_character(),
## Age = col_double(),
## Tenure = col_double(),
## Balance = col_double(),
## NumOfProducts = col_double(),
## HasCrCard = col_double(),
## IsActiveMember = col_double(),
## EstimatedSalary = col_double(),
## Exited = col_double()
## )
The description of the datasets we are combining: We looked at a second data set called churn.csv. This new dataset shares a similar dependent variable with our old dataset, which refers to whether the bank’s customers leave the bank or not. It consists of 10000 observations and 12 variables. Independent variables contain information about customers while dependent variables refers to customer abandonment. In both data sets, they share similar columns like Attrition/Exited(weather the customer has churned or not), Age, Gender, Tenure/Month on Book, EstimatedSalary/Income. We aim to join these two datasets
How we are combining them: We tend to combine these two datasets vertically according to the variable “Attrtion_Flag” in our old dataset and “Exited” in our new dataset. Both variables are categorical variables and have two situations: Still existing customers and Attrited Customers. Since the new datasets are about new and different customers with their individual Customer_ID, we can only combine these two datasets vertically, increasing our datasets’ width. Also, abundant variables have similar meanings in both datasets(like gender and age), though some are in different names. Hence, we want to change the name of the variables in the second dataset to be the same as the name of variables in the first dataset to add vertically. For example, we will change the variable “Exited” in the second dataset to be “Attrition_Flag,” and the 1 represents attrition customers while 0 represents existing customers. Also, the variable “estimated salary” has the same meaning as variable “income,” and variable “tenure” has a similar meaning with variable “month on book,” but the unit of “tenure” is years while the unit of “month on the book” is the month. We can change the unit to a uniform one.
Our initial findings: Just a glance at the second data, we see that it shares similar findings with the first dataset. Where older customers, customers with higher salary or income, then they are less likely to leave the bank. Also new data aroses, called credit score. And we find that customers are less likely to leave if they have a higher credit score.
Difficulties we faced when combining the data: When we tried to find new datasets that could be joined together with our old dataset, we faced some obstacles. One of the major issues is that each row of data from our original dataset is a bank customer’s personal data. Due to this situation, it is really difficult to find new information about each customer. All of the independent variables of each row weren’t unique, which means we don’t have data like date which could be used to link to new data from other datasets. We tried to find some datasets which also have information on education level or income level, however, those new datasets could hardly provide useful variables for our analysis. For instance, we could find the average income level for each education level, but this won’t add meaningful value to our analysis since the new variable “average income level” would share the same benchmark and have the same ratio with the education level we already have. As a result, we believe that the only way to join a new dataset would be to join vertically. We find this dataset that share the same dependent variable with our original dataset.