Zurich Insurance Plc
Zurich is a leading multi-line insurer serving people and businesses in more than 200 countries and territories and has about 60,000 employees. Founded more than 150 years ago, Zurich is transforming insurance.
Helped grow the Data Science department and capabilities in Zurich Insurance Spain.
Worked directly under the Head of Data and in close relationship with the CEO and the other heads of departments depending on the topics and projects handled.
Was among the promising youth in the company involved in strategic and innovative company projects
Business Motivation: Retaining existing customers is more cost-effective than acquiring new ones due to lower marketing, incentive, and manpower costs. A small increase in retention rates can significantly boost profits. For example, a 5% increase in loyalty can lead to a 25-85% profit increase1.
Customer Value: Long-term customers are less sensitive to competitor pricing, making retention a strategic priority1.
Reducing Churn and Retention Costs: Predicting which customers are likely to leave allows targeted retention campaigns.
Targeted Marketing: Enables the development of personalized marketing strategies.
Customer Segmentation: Identifies high-value, at-risk customers for focused interventions.
Customer Lifetime Value Calculation: Helps optimize pricing and portfolio management.
Agent Performance Metrics: Assists in evaluating and improving agent contributions to retention.
Lead Data Scientist responsible for overseeing the end-to-end data science project lifecycle, including data collection, defining problem statements, and developing robust predictive models. Key responsibilities include:
- Develop a predictive model for car insurance churn using supervised machine learning. Key Questions: Which customers are most likely to cancel? What are the main reasons for cancellation?1
Business Understanding: The objective is to predict churn probability at policy renewal.
CRISP-DM Framework: Follows standard phases—business understanding, data understanding, data preparation, modeling, evaluation, and deployment1.
Data Infrastructure: Utilizes a Hadoop-based Data Lake with Spark, Hive, and HBase for large-scale data processing1.
Churn Definition Challenge: Churn is defined based on customer behavior within a 12-month window around policy renewal1.
Customer Data: Demographics, relationship length, address.
Policy Data: Status, coverage, premium, historical claims.
Interaction Data: Complaints, call center interactions, satisfaction scores.
Claims Data: Number, type, and timing of claims.
Intermediary Data: Agent information and history.
Model Used: XGBoost (eXtreme Gradient Boosting), implemented in R and Python.
Feature Importance: Variables such as premium changes, vehicle value, bonus, policy history, and claim history were significant predictors.
Performance Metrics:
Test Error: 0.442
Accuracy: 0.558
Recall: 0.315
Precision: 0.143
F-measure: 0.197
These metrics indicate moderate predictive power, with room for improvement in recall and precision.
Claims and Churn: Policies with more claims tend to have lower churn rates.
Open Claims: Customers with open claims at renewal are less likely to churn.
Multiple Policies: Customers with more car policies are less likely to churn.
Inactive Policies: More inactive policies in the last 12 months correlate with lower churn.
Recent Cancellations: Policies cancelled more recently are more likely to churn1.
Summary: The model helps identify at-risk customers, enabling targeted retention strategies and improved business outcomes.
Future Directions: Further improvements may include advanced feature engineering, semi-supervised learning, and deeper integration with business rules for campaign targeting1.
1. Data Processing and Manipulation
Pandas: Used for loading, cleaning, and manipulating structured data. It enabled efficient handling of large datasets, such as customer records, policy details, and claims history.
NumPy: Provided support for numerical operations and array-based computations, which are foundational for data preprocessing and feature engineering.
2. Data Visualization
Matplotlib: Used to create a variety of static, animated, and interactive plots to visualize data distributions, feature relationships, and model results.
Seaborn: Built on top of matplotlib, this library was used for more advanced statistical visualizations, such as count plots and heatmaps, to explore categorical distributions and correlation matrices.
3. Machine Learning and Modeling
Scikit-learn: Served as the primary library for machine learning tasks, including data splitting (e.g., train_test_split), feature scaling (e.g., Z-score normalization), and model evaluation (e.g., accuracy, precision, recall, F1-score). It also provided implementations for algorithms such as logistic regression, decision trees, and random forests.
XGBoost: Implemented the eXtreme Gradient Boosting algorithm, which was the main model used for churn prediction due to its high performance in classification tasks.
LightGBM: Another gradient boosting framework, used for its speed and efficiency, especially with large datasets and high-dimensional features.
4. Model Evaluation and Experimentation
Scikit-learn: In addition to modeling, it was used for generating evaluation metrics (e.g., confusion matrices, classification reports) and for techniques like cross-validation and hyperparameter tuning.
5. Computing Environment
Google Colaboratory (Colab): Provided a cloud-based environment for running Python code, facilitating collaboration and access to GPU resources for faster model training.
These libraries collectively enabled the end-to-end process of data ingestion, exploration, modeling, and evaluation for predicting customer churn in the car insurance context.