Imported the dataset into Excel from CSV converting it to an Excel file, after which data was initially explored in Excel, after which it was imported into Rstudio for analysis and visualization
library(readxl): Used to Import Excel file into
RStudio.library(lubridate): Used to manipulate and work with
date data type.library(tidyverse): used the dplyr and
ggplot sub packages for manipulation and visuals.Certain new variables were created out of existing one for the
purpose of analysis and visualization, the mutate and
ifelse functions was used at this stage to achieve
this.
# Extract the Current year and month
current_year <- 2024
current_month <- 6
# Extract the Birth year and month for policyholders
insurance_data$birth_year <- year(insurance_data$birthdate)
insurance_data$birth_month <- month(insurance_data$birthdate)
# calculate the age
insurance_data <- insurance_data %>%
mutate(
age = ifelse(birth_month > current_month, current_year - birth_year - 1, current_year - birth_year)
)
# to group this ages into interval for the aim of exploration
age_breaks <- c(21, 31, 41, 51, 61, 71, 75)
age_lables <- c("21-30", "31-40", "41-50", "51-60", "61-70", "71-74")
insurance_data <- insurance_data %>%
mutate(
age_group = cut(age, breaks = age_breaks, labels = age_lables, right = FALSE)
)# Encode the data to turn the character data type to numeric for correlation
insurance_data_encoded <- insurance_data %>%
mutate(make_encoded = as.numeric(factor(car_make)),
model_encoded = as.numeric(factor(car_model)),
color_encoded = as.numeric(factor(car_color)))claim_freq, claim_amount and
household_income.#Define risk profile of policyholders based on claim frequency and claim amount
insurance_data <- insurance_data %>%
mutate(risk_profile = ifelse(claim_freq >= 2 | claim_amt > 75000, "High", "Low"))
# Convert risk_profile to a binary variable for logistic regression
insurance_data <- insurance_data %>%
mutate(high_risk = ifelse(risk_profile == "High", 1, 0))
# Define a threshold for what frequent claim is
insurance_data <- insurance_data %>%
mutate(claim_freq_status = ifelse(claim_freq >= 2, "Frequent", "Infrequent"))
# Convert claim_freq to a binary variable for logistic regression
insurance_data <- insurance_data %>%
mutate(frequent_claimer = ifelse(claim_freq_status == "Frequent", 1, 0))
# Define what Low Claim Frequency and High House Hold income is
insurance_data <- insurance_data %>%
mutate(low_claim_high_income = ifelse(claim_freq < 2
& household_income > 198000,"Yes", "No"))
# Convert low_claim_high_income to a binary variable for logistic regression
insurance_data <- insurance_data %>%
mutate(highi_lowc = ifelse(low_claim_high_income == "Yes", 1, 0))
# Estimate premiums based on claim amounts and frequency
insurance_data <- insurance_data %>%
mutate(estimated_premium = 0.1 * claim_amt + 100 * claim_freq)For defining the risk profiles, claim status, and low claim frequency and high household income based on claim_frequency, claim_amount, and household_income was not done based on judgement but rather using certain descriptive statistics techniques such as mean, standard deviation, max and min and quartile to determine what should be appropriate for this. For the premiums it was estimated based on a certain % of what splendor insurance will charge based on the claim amount and also a fixed amount of the claim_freq as certain policy holders who claim more will be high risk policyholders that status of their risk will have to be adjusted for same for low risk policy holders when calculating premium.
Certain methods were used to extract insights out of the dataset
library(tidyverse): used the dplyr and
ggplot sub packages for manipulation and visuals.library(caret): used for regression analysis.library(SmartEDA): used for Automated EDA.library(Skimr): used as well for quick EDA and check
for missing values.An automated EDA was done to explore the dataset and understand frequency distribution, datatype attributes, relationships, missing value check and lot more.
# Automated Data Exploration
#Exp Report(insurance_data,op_file = "EDA_report.html")
skim_without_charts(insurance_data)| Name | insurance_data |
| Number of rows | 37542 |
| Number of columns | 28 |
| _______________________ | |
| Column type frequency: | |
| character | 14 |
| factor | 1 |
| numeric | 12 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 10 | 10 | 0 | 37541 | 0 |
| marital_status | 0 | 1 | 6 | 9 | 0 | 4 | 0 |
| car_use | 0 | 1 | 7 | 10 | 0 | 2 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| kids_driving_ch | 0 | 1 | 3 | 5 | 0 | 4 | 0 |
| parent | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| education | 0 | 1 | 3 | 11 | 0 | 4 | 0 |
| car_make | 0 | 1 | 2 | 13 | 0 | 78 | 0 |
| car_model | 0 | 1 | 1 | 23 | 0 | 1011 | 0 |
| car_color | 0 | 1 | 3 | 10 | 0 | 19 | 0 |
| coverage_zone | 0 | 1 | 5 | 12 | 0 | 5 | 0 |
| risk_profile | 0 | 1 | 3 | 4 | 0 | 2 | 0 |
| claim_freq_status | 0 | 1 | 8 | 10 | 0 | 2 | 0 |
| low_claim_high_income | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| age_group | 0 | 1 | FALSE | 6 | 61-: 7186, 41-: 7087, 31-: 7071, 51-: 7062 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| kids_driving | 0 | 1 | 0.42 | 0.74 | 0.00 | 0.00 | 0.00 | 1.00 | 3.00 |
| car_year | 0 | 1 | 2000.29 | 9.05 | 1909.00 | 1995.00 | 2002.00 | 2007.00 | 2013.00 |
| claim_freq | 0 | 1 | 0.51 | 1.02 | 0.00 | 0.00 | 0.00 | 1.00 | 4.00 |
| claim_amt | 0 | 1 | 50028.51 | 28706.52 | 0.04 | 25439.41 | 49455.89 | 74974.93 | 99997.70 |
| household_income | 0 | 1 | 147247.41 | 59145.59 | 45004.91 | 96162.18 | 146674.90 | 198277.42 | 249991.11 |
| birth_year | 0 | 1 | 1975.85 | 15.30 | 1949.00 | 1963.00 | 1976.00 | 1989.00 | 2002.00 |
| birth_month | 0 | 1 | 6.53 | 3.44 | 1.00 | 4.00 | 7.00 | 10.00 | 12.00 |
| age | 0 | 1 | 47.65 | 15.30 | 21.00 | 34.00 | 48.00 | 61.00 | 74.00 |
| high_risk | 0 | 1 | 0.34 | 0.47 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
| frequent_claimer | 0 | 1 | 0.12 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| highi_lowc | 0 | 1 | 0.22 | 0.42 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| estimated_premium | 0 | 1 | 5053.88 | 2872.65 | 0.00 | 2597.43 | 5005.02 | 7546.02 | 10398.03 |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| birthdate | 0 | 1 | 1949-10-21 | 2002-10-18 | 1976-04-21 | 16525 |
This were used to explore the data for pattern and trends among the
various categorical variables and numeric variables. The function used
were ggplot for visuals summarise for applying
an aggregation such as sum and mean for a
numeric variable group_by to group by a categorical
variable filter to narrow out observations not needed. This
technique was applied across every question that the project to sort to
provide answer too. below is a code sample
# common characteristics among policyholders who make frequent claims?
# Gender
insurance_data %>%
group_by(claim_freq_status, gender) %>%
filter(claim_freq_status == "Frequent") %>%
summarise(Count = n(), .groups = 'drop') %>%
ggplot(aes(x=claim_freq_status, y=Count, fill = gender)) +
geom_col(position = 'dodge', width = 0.4) +
labs(title = "Distrubtion of PolicyHolders that Claim Frequently based on Gender",
x="Frequent Claimers",
y="Value") +
theme_minimal()# key characteristics of policyholders with low claim frequencies and high household incomes
insurance_data %>%
group_by(low_claim_high_income, marital_status) %>%
filter(low_claim_high_income == "Yes") %>%
summarise(Count = n(), .groups = 'drop') %>%
ggplot(aes(x=low_claim_high_income, y=Count, fill = marital_status)) +
geom_col(position = 'dodge', width = 0.4) +
labs(title = "Distribution of Policyholders with Low Claim Frequency and
High Household Income based on Marital Status",
x="Low Claim & High Income",
y="Value") +
theme_minimal()Was used to provide insights for certain questions this included the Risk Assessment and Customer Segmentation questions below is a code sample
# A. factors that are most indicative of high-risk policyholders
# Using Logistic Regression for other features
#logistic_model_1 <- glm(high_risk ~ household_income + education + kids_driving_ch + parent, data = insurance_data, family = binomial)
# Based on Car features
#logistic_model_2 <- glm(high_risk ~ car_use + car_make + car_model + car_color , data = insurance_data, family = binomial)
# Based on Demographics
logistic_model_3 <- glm(high_risk ~ age_group + gender + marital_status +
coverage_zone,
data = insurance_data, family = binomial) This was to done to understand and provide insights into relationship that existed between variables to be exact the question on Are there any specific vehicle characteristics (e.g., make, model, year) that correlate with higher claim frequencies or amounts?
# Select only the needed numeric variables out of the encoded data
numeric_data <- insurance_data_encoded %>%
select(claim_amt, claim_freq, car_year, make_encoded, model_encoded, color_encoded)
# calcuate correlation among the variables
cor(numeric_data)## claim_amt claim_freq car_year make_encoded
## claim_amt 1.000000000 0.002058273 0.001201309 -0.0066980780
## claim_freq 0.002058273 1.000000000 0.003845844 0.0032000747
## car_year 0.001201309 0.003845844 1.000000000 -0.0292848318
## make_encoded -0.006698078 0.003200075 -0.029284832 1.0000000000
## model_encoded -0.001034638 0.001940056 0.092446074 0.0289658081
## color_encoded -0.003110628 -0.006654749 -0.002555016 0.0003793521
## model_encoded color_encoded
## claim_amt -0.001034638 -0.0031106285
## claim_freq 0.001940056 -0.0066547487
## car_year 0.092446074 -0.0025550157
## make_encoded 0.028965808 0.0003793521
## model_encoded 1.000000000 -0.0011830157
## color_encoded -0.001183016 1.0000000000