Using Machine Learning for credit card fraud detection

As we get closer to a digital world, cybersecurity is becoming a crucial part of our life. As for digital security, the most difficult task is detecting unusual activities. When we do any kind of online transaction, a large number of individuals prefer to use credit cards. A credit card’s limits can sometimes assist us in making purchases even if we do not have the funds available at the moment. Cyber attackers, on the other hand, take advantage of these features.

Our case study

The program that we have built takes in a transaction dataset made by credit cards in September 2013, by European cardholders. The dataset encloses two days time-lapse of 284,807 transactions, from which 492 are frauds. The data in this dataset are transformed into only numerical values as a result of PCA (Principal Component Analysis) transformation, which was already applied to the dataset.  

Firstly, the program shows some statistical information regarding the dataset such as the number of fraud transactions and valid ones, the amount of each type of transaction in dollars, as well as the time of the day when either fraud or valid transactions took place.  Since the dataset is transformed with PCA, at the end of the statistical part, a correlation matrix is shown in order to see the relationship between each feature of the dataset.  

Afterward, four different machine learning algorithms are applied, with the help of the sklearn python library. The above-mentioned algorithms are Isolation Forest, Support Vector Machine, Random Forest Classifier, and Local Outlier Factor.  Each one of them works on different principles. However, the first three take only 5000 samples from the dataset due to computational reasons. This sample is then split into the training data and testing data, through the usage of a built-in function of the sklearn library called train_test_split().  

The training data includes the exact information of whether or not the transaction is a fraud, whereas the testing data omits it. On the other hand, the Local Outlier Factor takes in 5121 data samples, and it does not split the data into the train and test samples.  

Instead, this algorithm searches for whether or not a transaction has strange transactions “neighbors”, which then get selected as outliers. After each algorithm, the accuracy of the detection and the actual fraud to valid ratio is measured in percentage.  Additionally, a classification report is presented. It shows the precision, the recall, and the f1-score of this dataset to what the algorithm selected as fraud. Moreover, each and one of the algorithms is timed out in order to check their efficiency on this dataset. 

Overall, the program should be taking different datasets where the fraud transactions are already known. Thus, when approached by a new dataset that doesn’t have refined transaction types (the transactions are not specified if they are valid or fraud), it should be able to detect most of the fraud card arrangements. 

Challenges  

One of the challenges that we faced was the computational speed of the algorithms, and also the lack of datasets regarding credit card transactions. 

Solution 

We reduced the dataset into a sample dataset, which then showed higher precision than what the actual detection was. 

Output 

The program outputs the statistical information regarding the dataset. The accuracy of fraud detection compared to the actual fraud cases – in percentage, the classification report, as well as the time that each action takes to be computed in seconds.