CART OPS (Clasificacion Automatica Real Time in Operations)

Data Science project

for client Zurich Insurance

Zurich Insurance Plc

Client: Zurich Insurance

Zurich is a leading multi-line insurer serving people and businesses in more than 200 countries and territories and has about 60,000 employees. Founded more than 150 years ago, Zurich is transforming insurance.

Helped grow the Data Science department and capabilities in Zurich Insurance Spain.
Worked directly under the Head of Data and in close relationship with the CEO and the other heads of departments depending on the topics and projects handled.
Was among the promising youth in the company involved in strategic and innovative company projects

CART OPS (Clasificacion Automatica Real Time in Operations)

Project description

Led a project aimed at automating document classification and indexing within the insurance sector, addressing the challenges posed by extensive communication between insurers, brokers, and agents. Recognizing that much of this communication occurs through emails, the project focused on enhancing the efficiency of claims management by automating the classification of emails and their attachments. This initiative is expected to eliminate third-party costs of approximately €500K annually during 2020-2021.

Automated Document Classification for Operations (HEGEO): Implemented automation strategies that achieved a 90% automation rate for manual classification tasks, resulting in an estimated savings of €2.1 million over five years.

This project ultimately aimed to enhance operational efficiency and reduce costs while ensuring timely responses to client communications.

Role & Responsibility

Lead Data Scientist responsible for overseeing the end-to-end data science project lifecycle, including data collection, defining problem statements, and developing robust predictive models. Key responsibilities include:

- Collaborating with cross-functional teams to identify business challenges and translate them into actionable data-driven solutions.

- Designing and implementing data collection strategies to gather relevant and high-quality datasets.

- Building, validating, and deploying machine learning models to address specific business needs.

- Conducting exploratory data analysis to uncover insights and inform decision-making.

- Establishing and maintaining strong relationships with stakeholders, ensuring alignment between data initiatives and organizational goals.

- Mentoring and guiding junior data scientists in best practices, methodologies, and analytical techniques.

- Communicating complex findings and recommendations to non-technical stakeholders in a clear and impactful manner.

- Continuously monitoring model performance and iterating on solutions to optimize outcomes.

Accomplishment

Automated the manual classification of operational emails, resulting in savings of €1.5 million over five years by eliminating the manual workload previously handled by 10 full-time employees.

Deployed the solution both on premise and on AWS cloud.

Top 3 master thesis papers.

Paper on the project made in cooperation of Universitat Politecnica de Catalunya and Zurich

highly-imbalanced-multi (4).pdf

Explanation of libraries

Pytesseract: For Optical Character Recognition (OCR) to extract text from images or scanned documents. Example: pytesseract.image_to_string() to convert images to text.

Pandas: For data manipulation, preprocessing, and handling dataframes. Example: Managing and cleaning data input from various sources.

NumPy: For numerical operations and handling arrays efficiently. Example: Performing matrix operations and calculations.

NLTK (Natural Language Toolkit):For natural language processing tasks such as tokenization, stemming, and stop word removal. Example: Preprocessing text data extracted from documents.

spaCy: For advanced NLP tasks, including named entity recognition and dependency parsing. Example: Extracting meaningful features from processed text

Scikit-learn: For machine learning algorithms including SVM, Random Forest, and PCA. Example: RandomForestClassifier() for classification tasks and PCA() for dimensionality reduction.

Imbalanced-learn: For handling imbalanced datasets with techniques like over-sampling (SMOTE) and under-sampling. Example: SMOTE() to create synthetic samples for minority classes.

TensorFlow/Keras or PyTorch: For building and training neural network models. Example: Implementing deep learning architectures for more complex classification tasks.

OpenCV: For image processing tasks that may be necessary to enhance images before OCR. Example: Preprocessing images (e.g., resizing, denoising) to improve OCR accuracy.

Matplotlib/Seaborn: For data visualization and model evaluation. Example: Plotting results to analyze classification performance.

Explanation:

OCR - object character recognition (Google Cloud Vision, Tesseract - Open Source, Rossum AI, Amazon Textract, Abbysoft Reader)

PDF splitter to images: ghostscript, imageMagick.

Data Preparation:

- Computer Vision - OCR, NLP (tokenization, stopword removal, stemming, equivalence classes, lowercase, data cleaning - dictionary matching, removal of chars, data decryption, handling different languages, TFIDF - convert a collection of raw documents to a matric of tfidf features), Feature extraction (if specific keyword appears in text), Principal Component Analysis - using randomized SVD, linear dimensionality reduction keeping only the most significiant vectors to project data to a lower dimensional space.

Model training:

- Linear classifiers,

- SVM

- Random Forest

- Neural Network

Model Selection:

- Precision, Recall, F1

Imballance:

- Oversampling: SMOTE, ADASYN, Random Undersampling

- Undersampling: Random, Near Miss, Cluster Centroids.

Challenges: OCR quality, standardized data input, NLP data cleaning, document misclasification, time and memory complexity, dealing with imbalance, dealing with similar classes and overalapping classes.

Contributions:

- Single model for multi-language context,

- Used class/ subclass hierarchy in prediction

- Used synthetic groups of subclasses for prediction

- Combined the emails multiple fields in classification

- Handled the imbalance by over/ under sampling

Google Sites

Report abuse