How to Build a Fraud Detection System using Machine Learning Models

Today, I would like to discuss how to build a complete fraud detection system from a technical perspective. I will address the fundamental steps for developing the fraud detection system as well as the key drivers associated with each step.

Step One: Define project goals, measurement metrics and assign resources

The first step for any data science project will be defining the project goals:

What are the fraud cases that we wanted to identify?
What kind of analytics techniques have already been implemented to combat fraud?
What are the key measurement metrics that we wanted to focus on when assessing the effectiveness of our fraud detection system?
What and how many developers do we need for developing the fraud detection system?

Step Two: Identify proper data sources

Once the business objectives have been confirmed and communicated, we start to identify and collect proper data sources for the fraud detection system.

The common data sources for detecting fraud includes:

client profile
risk profile
product usage
billing data

Additional data could also be available from third-party data vendors. For example, for the financial services industry, we will incorporate government compliance data (American sanction list, Canadian sanction list, and regulation rules) when building the fraud model.

Step Three: Design the fraud detection system architecture

There are multiple key factors that needed to be considered when designing the fraud detection system architecture.

Detection frequency determines how often we run the new data through our fraud scoring model.

Fraud-prevention operation flow impacts how and when we flag different events as suspicious, and how to handle and confirm those suspicious cases afterwards.

Scoring accuracy baseline helps us to assess the qualification of our fraud scoring model.

Step Four: Develop the data engineering, transformation, and modeling pipelines

Key activities

After we have envisioned the architecture of the fraud detection solution, we will start the development of the data engineering, transformation, and modeling pipelines. I have listed key activities for each of those pipelines in the graph below.

For the data engineering pipeline, we need to ingest and merge the data from different sources, aggregate the data based on business metrics, and set up batch processes.
For the data transformation pipeline, the main goal is to improve the data quality, deal with data issues such as missing & incorrect data and convert the data so that it could be fed into machine learning models.
For the machine learning model pipeline, we focus on building and comparing diversified ML models based on key business metrics. A module for automated model accuracy testing and re-training is a necessity in the production environment to avoid model drifting issue.

Step Five: Integrate the model into the case management system

The final step is to incorporate our best performing ML model into the case management system. We can rank the risk level of individual case based on the risk score that we generated. Then, a list of highly suspicious cases will be sent and assigned to relationship managers for further review through the case management system.