Zurich Insurance Plc
Zurich is a leading multi-line insurer serving people and businesses in more than 200 countries and territories and has about 60,000 employees. Founded more than 150 years ago, Zurich is transforming insurance.
Helped grow the Data Science department and capabilities in Zurich Insurance Spain.
Worked directly under the Head of Data and in close relationship with the CEO and the other heads of departments depending on the topics and projects handled.
Was among the promising youth in the company involved in strategic and innovative company projects
In Commercial Insurance, there is the need to analyze historical property claims in order to reduce large claim losses. This is currently very hard to do due to lack of some structured data for certain fields of information.
So, what is structured data?
Structured data is data which can be easily analysed with a software like excel for example, or with a visualization software such as tableau. Basically data that we could easily draw conclusions from.
Unstructured data normally refers, but is not limited, to textual data, images, pdfs, web pages, videos etc.
The main objective is to take better decision in Underwriting risk, by that we mean: risk appetite, wording, deductibles, etc.
Also, as secondary objectives we wanted to:
have a complete property loss map: by claim type, claim cause, activity, segment, customer and region;
improve Risk Engineering Efficiency by focusing on claim reduction strategies.
provide insights to our customers: mapping typical claims by activity, geographical location, segment.
That’s how Loss Inspector was born!
By working in an agile and methodical way, we made use of open sources tools and programming languages to bring a complete solutions that was comprised of 3 parts:
First we capture important data fields from unstructured data (in our case claims documentation - loss adjuster reports) with the use of the Intelligent Document Capture software that we developed in-house
Second, we created an interactive dashboard to identify Risk Insights
Last but not least we will deploy a machine learning model that could predict risk of property policies at renewal through the subproject Pockets of Loss.
"In God we trust; All others must bring data." - W. Eduard Demming
Now lets take them one by one as I believe there was quite a lot of information on the last slide.
What is Intelligent Document Capture? Intelligent Document Capture is a software we developed in-house. It takes data from Loss Adjuster Reports (for IT geeks: in different image formats and pdfs) and together with claims & policy information we extract the claim type, claim cause and location as it is. This extraction is only possible after a step which is involves a software called OCR or Object Character Recognition, which translated an Image into it’s textual representation (here we used an open source solution - Tesseract OCR -, although we could also use Google Cloud Vision OCR or any of the other paid OCRs available on market) so that a machine can therefore understand the information inside. Also, without advanced NLP techniques or Natural Language Processing it would have been close to impossible to extract relevant information from text.
In order to standardize the information extracted we classify it (according to predefined types and causes by risk engineering and portofolio teams). This classification step is done using a hybrid approach: both using an expert rule based system and also using machine learning. The machine learning step involved having close to 600 manually classified reports by the risk engineering team and continuous feedback in order to improve accuracy.
Last but not least the location needed to be normalized, this last step was done using Google Maps Cloud functions which return us the address in a standard form including the geolocation.
We are proud to tell you that we finally reached an accuracy in both the claim type and cause of more than 90% and of 65% in the case of the location. If the results sound to Good, well they do sound very Good to us as well as previous research shows us that getting an accuracy of more tan 70-80% is considered very Good for text mining.
So far we learnt how to convert unstructured data to structured data and performed classification in order to understand the type and cause of the claims.
With this valuable information we moved to the Phase 2 of the project which is to understand risk and we do so by using two methodologies:
I will explain first methodology here which is 'Risk Insights'. Here we gathered all the information related to the claims : in the data sources we took data from the Phase 1 of the project which is Intelligent Document Capture . From here as I explained earlier we took the geo-location of the claim, the type and cause of the claim.
Along with this we also integrated many other data sources such as structured policy data and some auxiliary tables in access database. These additional databases provided enriched data about the policies, customers, customer segmentation and basic information about the claims.
There were many challenges in integrating these different data sources regarding the format and granularity of the information stored in these databases. Also we faced a lot of data quality issues most of which were fixed using data cleaning techniques and implementing some business logic.
Once we had all required information , we build a Tableau Dashboard to visualize data in an easily understandable format. The dashboard contains multiple reports each giving a risk insights along different dimensions.
This tool helps the Risk Engineers to explore risk data and get some valuable insights.
We call this project 'Pockets of Loss'. Here we use Machine Learning techniques to not only understand risk but also predicting risk in future. This project is work in progress; it has not been finished yet, so we would not be presenting the results here, but would be focusing more on the methodology.
Like any other machine learning project, the first thing is to gather and clean data. The data that we used here is very similar to the data in our 'Risk Insights Dashboard' with off course few additions.
The most important thing here is the model will predict risk at Policy-Location level. In commercial insurance one policy can have several locations and each location comes with a different type of risk therefore it is important to take into consideration the location of the policy while predicting risk.
So our algorithm first clusters our existing policy-locations based on severity and frequency using Machine Learning clustering methods like K-means and then understands risk involved in these clusters along these two dimensions (severity and frequency).
The second step is to use attributes such as customer segment, activity, hazard grade etc and see how these attributes contribute to risk. We do this by building a machine learning model, but i will not get much into the details of that at this point.
The end result is a model where you feed in all the information related to the policy and the location and it will output a risk score.
We can use this model to understand risk of existing policies and predicting risk of new policies without risk engineers visiting the location.
Intelligent Document Capture - Better quality data & new historical information for pricing & risk engineering
Risk Insights - Improve risk understanding and exploration by a central location for users to access, interact and analyze up to date risk information
Pockets of Loss - Help identifying and assessing risk of current and new portfolios which in turn helps minimizing and treating risks
Do not underestimate business knowledge & involvement
Have a well-defined scope of the analysis (which policies/ claims are included, which business lines, completeness of data)
Multiple iterations needed for best results
Less is More: Focus on extracting specific important fields of information rather than extracting all fields.
TODO