HDSC August ’21 Capstone Project Presentation: Credit Card Fraud Detection: Machine Learning Solutions

HamoyeHQ
8 min readNov 22, 2021

A project by Team Vispy

From the moment e-Commerce payment systems began to exist, there have always been people who continually find ways to gain unauthorized access to a person’s finances. This has gone on to become a major problem of the modern era as all transactions can be completed online by simply entering your card information.

Credit card fraud is the most common type of identity theft and can be a nightmare scenario for any individual or business that falls victim to it. Unauthorized card operations hit an astonishing amount of 16.7 million victims in 2017. Additionally, as reported by the Federal Trade Commission (FTC), the number of credit card fraud claims in 2017 was 40% higher than the previous year’s number. There were around 13,000 reported cases in California and 8,000 in Florida, which are the largest states per capita for such a crime. In 2019, payment card fraud losses reached $828.65 billion worldwide according to a Nilson Report data. The United States with an estimated 1.5 billion credit cards alone accounts for more than a third of the total global loss, thus making it the most card fraud prone country in the world. The coronavirus pandemic also contributed a great deal to the explosive increase in card fraud activity.

Credit card fraud occurs when unauthorized card transactions and unwanted usage of an account by someone else other than the owner of the account is carried out. In other words, Credit Card Fraud can be defined as a case where a person uses someone else’s credit card for personal reasons while the owner and the card issuing authorities are unaware of the fact that the card is being used. This negatively impacts consumers, issuers, and merchants alike as businesses often spend millions to protect themselves from fraud thereby significantly increasing the cost of running a business, especially small businesses. The behaviour of such fraudulent practices can be studied to minimise it and formulate preventive measures to protect against similar occurrences in the future. Expertise especially in the field of data science and machine learning are now demanded where solutions to such a relevant problem can be automated.

Credit Card Fraud Techniques

Credit card fraud techniques carried out by fraudsters include credit card skimming, clone transactions, account theft, false application fraud, account takeover, etc. Credit card fraud can also happen due to the card owner’s negligence.

What is Credit Card Fraud Detection?

“This is defined as a set of activities or measures taken to prevent money or property from being obtained through false pretenses.” Fraud detection involves monitoring the activities of populations of users in order to estimate, perceive or avoid objectionable behaviour, which consists of fraud, intrusion, and defaulting. Fraud can be committed in different ways and in many industries. Fraud detection methods are continuously developed to protect against criminals in adapting to their fraudulent strategies. These strategies are classified as:

• Credit Card Frauds: Online and Offline

• Card Theft

• Account Bankruptcy

• Device Intrusion

• Application Fraud

• Counterfeit Card

• Telecommunication Fraud

There are a number of approaches currently used in credit card fraud detection. Some of which are:

• Artificial Neural Network

• Fuzzy Logic

• Genetic Algorithm

• Logistic Regression

• Decision tree

• Support Vector Machines

• Bayesian Networks

• Hidden Markov Model

• K-Nearest Neighbor

Credit Card Fraud Detection Systems and the Steps to Implement Artificial Intelligence Fraud Detection Systems

Credit card fraud detection systems include:

  1. Predictive machine learning models that can learn from preceding data and estimate the probability of a fraudulent credit card transaction (which is what our project is set to implement).

2. Risk scores obtained from third party sources(e.g, MicroBilt or LexisNexis).

3. Business rules that set conditions that the transaction must pass to be approved (e.g. SSN matches, below deposit/withdrawal limit, two-step verification, etc.).

Out of these fraud detection informatics techniques, predictive Machine Learning models belong to smart internet solutions.

Steps to Implement an AI fraud detection system:

  1. Data mining. This implies the classification, segmentation, and grouping of data from millions of transactions with which the Machine Learning model will be trained to trace patterns and detect fraud.

2. Pattern Recognition. This implies detecting the clusters, classes, and patterns of suspicious behaviour. For example, the neural networks approach helps automatically identify the characteristics most often found in fraudulent transactions; this method is very effective if you have a lot of transaction samples.

3. As soon as the Machine Learning fraud detection module is integrated into the E-commerce platform, it begins to keep track of transactions. Whenever a user requests a transaction, it is processed for some time. Depending on the level of predicted fraud probability, there are three possible outcomes:

4. If the probability is less than 10%, the transaction is allowed.

5. If the probability is between 10% and 80%, an additional authentication factor (e.g. a one-time SMS code, a fingerprint, or a Secret Question) should be applied.

6. If the probability is more than 80%, the transaction is frozen, so it should be processed manually

Requirements for Payment Fraud Detection with AI-based Methods

To run an AI-driven strategy for Credit Card Fraud Analytics, and to ensure that the model reaches its best detection score, some critical requirements have to be met:

  1. Machine Learning models have to be trained with high-quality internal historical data. This means that if there is insufficient data from preceding fraudulent and normal transactions, the machine learning model will run with difficulty because the quality of its training process depends on the quality of its inputs.

2. Data that will be used to train the model should be neatly and properly sorted and should contain both fraudulent and normal transactions so as to ensure there is no bias in the model’s results.

If the above requirements are met and the business logic matches the machine learning model, there is a very high chance that fraud detection will work satisfactorily.

Advanced Credit Card Fraud Identification Methods

Advanced Credit Card Fraud Identification Methods are split into:

  1. Supervised. Such as Decision Trees (e.g. XGBoost and LightGBM), Random Forest, and KNN.

2. Unsupervised. Such as PCA, LOF, One-class SVM, and Isolation Forest.

Supervised Learning means that a model learns from previous examples and is trained on labeled data. In other words, the dataset has tags that tell the model which patterns are related to fraud and which represent normal behavior. “Banks and payment systems typically accumulate tons of data on different fraudulent schemes that can be used to train a model,” says Alexander Konduforov. Such models are constantly updated and improved to produce accurate results. But unfortunately, they fail to spot new fraud schemes if faced with them. That’s when unsupervised learning comes into the picture.

Unsupervised Learning is also called anomaly detection as it automatically captures unusual patterns. In this case, training datasets come without any labels or instructions. This approach lags behind supervised learning in terms of accuracy. But it is unrivaled when a business needs to find hidden fraud patterns and useful insights.

As a rule, fraud detection systems combine both approaches that complement each other.

Our Project

Our Goal

For our project, we worked to create an efficient Machine Learning model which can predict if a credit card transaction is fraudulent. We made use of a dataset from <https://www.kaggle.com/>

Our Approach

  • Data cleaning and profiling
  • Exploratory Data Analysis
  • Feature Engineering
  • Machine learning model for inferences
  • Deciding the best model

Data Cleaning and Profiling

Data is encrypted using Principal Component Analysis for security reasons. The dataset contained clean data as there were no null values. It is observed that there was a huge imbalance as the number of non-fraudulent transactions greatly exceeded the number of the fraudulent transactions. This imbalance is later corrected using a tool in feature engineering.

Exploratory Data Analysis

Using statistical tools, we are able to see the percentage of frauds (pink) among non fraudulent ones (tiny blue) from the pie chart.

Data Visualization

A Correlation check is carried out between the features from which it is observed that there were no redundancies.

Feature Engineering

  • To achieve a balanced data, the imblearn technique was imported to upsample the minority class A and match the equal proportion using the SMOTE method.
  • A model was built and trained using the ExtraTreesClassifier to fit the data sample to minimize over-learning and prevent overfitting of the data.
  • The importance of each feature was computed and a bar chart was plotted to compare these features. We picked the top 20 features for our predictions.

Fitting different models to dataset

We tried to fit 4 models to the data.

  • Decision Tree Classifier
  • Random Forest Classifier
  • XGBoost Classifier
  • LightGBM Classifier

Due to the presence of an imbalanced dataset, the correct metric to measure the success of a model would be to use Area under Receiver Operating Characteristic curve and not just Accuracy.

Machine Learning Model

The best model was found to be RandomForestClassifier. The Area under ROC Curve was used due to the class imbalance ratio of the dataset.

Based on our data analysis, we found out that linear models cannot be used to predict the outcome effectively. To further analyze the data we apply decision tree, random forest classifier, Xgboost and Lgbm classifiers on this data. We are able to conclude that the best model to use on our data set is the Random Forest classifier which has the best area under the ROC curve.

Challenges

  • Presence of PCA encrypted data
  • Not knowing what the actual features are
  • Time constraints
  • Team management

Recommendation

  • Knowing the actual features would help a lot in further analysis and optimization of the model.
  • Usage of statistics can also help: for example, a confidence interval concept can be used so that an alert is raised when we are even 60% sure that the transaction might be fraudulent.
  • A random forest classifier can be used as the perfect model for this kind of prediction.

CONCLUSION

Credit card fraud is a menace to the entire card industry that continues to grow as electronic transfer of money becomes increasingly popular. However, the implementation of advanced Credit Card Fraud Detection and Prevention methods by credit issuers will effectively prevent the activities of criminals that lead to identity theft, loss of billions of dollars annually, loss of customers trust and loyalty, etc. Machine Learning-based methods can continuously improve the accuracy of fraud prevention based on information about each cardholder’s behavior.

REFERENCES

Maniraj S P September 2019, ResearchGate GmbH, accessed 10th November 2021, <https://www.researchgate.net/publication/336800562_Credit_Card_Fraud_Detection_using_Machine_Learning_and_Data_Science>

SPD-Group 2021, SPD-Group, accessed 10th November 2021, <https://spd.group/machine-learning/credit-card-fraud-detection/#What_is_Credit_Card_Fraud_Detection>

altexsoft 2021, altexsoft, accessed 10th November 2021, <https://www.altexsoft.com/blog/credit-card-fraud-detection/>

--

--

HamoyeHQ

Our mission is to develop an army of creative problem solvers using an innovative approach to internships.