Finance Fraud Detection — RandomForest VS LogisticRegression

Deon Gideon-Tech Blogger
Sep 25, 2024
3 min read

Hello dear reader, today we will be doing finance fraud detection with 2 major machine learning algorithms that are logistic regression and random forest model, in addition, we will also use a feed-forward neural network ANN and compare it to the ML models.

As financial systems become increasingly digitized, the sophistication of fraudulent activities also escalates, necessitating advanced detection mechanisms. This article delves into the strategies and technologies employed to safeguard financial integrity, exploring how artificial intelligence, machine learning, and data analytics are revolutionizing the detection of fraudulent activities.

Download the dataset here https://www.kaggle.com/datasets/ealaxi/paysim1

Let's import libraries and load our dataset

If you have run the code, you can see the head of the dataset and the columns available. let us view the type of columns and their data type

we will go ahead and carry on data preprocessing, our dataset has three categorical columns that is ‘type’, ‘nameOrig’, and ‘nameDest’ in my view the columns nameorig and namedest have no impact in determining whether the transaction is fraud or not for our simple model so I opted to ignore them and for the ‘type’ column we will check the unique values and map them with numerical numbers.

we have seen the type column has 5 unique values so let us map them with numerical values. Remember we are doing this because machine learning models do not understand categorical features. So I will assign payment as 1, transfer as 2, etc.

Now we are done with categorical columns since the remaining 2 we said we would be ignoring them, but you can choose to drop them also. Next, let us check if we have missing values and preprocess them.

since we do not have many missing values, we will proceed and drop the rows with missing values.

For good practice we will go ahead and drop the namedest,nameorig, and also the one labeled isflaggedfraud so that we can go ahead and train our model

If you look at our dataset, you can see that some values indifferent columns vary, some have values more than 20000 while others just 0 to 5, now this is not good for a model. So what do we do?

Feature Scaling — This is like giving all your variables a fair shot by putting them in the same numerical ballpark. with our dataset, we will do standardization scaling — Transforming data to have a mean (average) of zero and a standard deviation of one. It’s like making sure everyone’s height is measured from the same ground level.

let us take a quick look into our data now

now we will select the features and target then split our data into first train and test, then we will split again into validation data and holdout to use for final prediction. we keep the holdout so that the model does not in any way see it so when we will be predicting based on holdout test data the model will consider this as new data.

now we will import our models

LogisticRegression model

We will start with the logistic regression model and then see its results

we can see that our model did not quit perform well despite having a 99% accuracy, just to mention:

Accuracy: 0.999 (almost perfect, but might be misleading if data’s imbalanced)
Precision: 0.858 (85.8% of your positive predictions were correct)
Recall: 0.314 (Only caught 31.4% of actual positives)
F1 Score: 0.46 (Balances precision and recall, shows model’s effectiveness when class imbalance exists)

High accuracy with low recall suggests our model might be great at predicting negatives but misses a lot of positives.

RandomForestRegression Model

let's now predict with the holdout test data

Accuracy: 0.999 (Extremely high, model predicts almost perfectly)

Precision: 0.992 (Of the things it says are positive, it’s right 99.2% of the time)
Recall: 0.735 (It catches 73.5% of all actual positives)
F1 Score: 0.844 (Balances precision and recall, indicating good overall performance but with room for improvement in recall)

Your model is very accurate and precise but still misses about 26.5% of actual positive cases.

from this, we can conclude that the random forest model worked better.

As I wind up the future of financial integrity lies in our ability to adapt and advance detection methodologies faster than the fraudsters can devise new schemes.

Access my code notebook for the models here https://colab.research.google.com/drive/11zpTV1rIDcFTLfUJ8hlEpPJQFUCZ6M9g?usp=sharing

Thank you For leaving a clap.

Have a nice time,happy learning