Threat Detection Using Anomaly Detection

I had an interesting conversation on Twitter/X recently. It was about security in the world of AI. You can check it out here https://x.com/sahilmalik/status/1960549394425520621. This is, in fact, something that I've been thinking about for a long time. I've been a developer for a few decades now, and as I've seen productivity increasing using greater and greater layers of abstraction, none of them scare me as much as artificial intelligence.

We are almost turning from developers into psychologists, where we give computers instructions in plain English, and the computer generates a lot of code for us. How good is this code really? There are some really smart people on the other side that are looking for the tiniest vulnerability in your code that you may not have even thought of.

I'm used to a world where developers spend a lot of time polishing every bit of their code, even the ones you don't see, and make it nearly perfect so there are no security holes. And yet hackers smarter than them end up finding those holes.

How is this picture going to change in a world of artificial intelligence where developers are using artificial intelligence to generate a lot of code? Increasingly, it's becoming more and more clear to me that the way to fight the security battle in this artificial intelligence world is with artificial intelligence.

In practice, I don't see this happening. Security always seems to be an afterthought, and productivity seems to be the forethought. Security is important, and in this AI world, it's even more important. So, I thought I would write an article on this topic.

Productivity seems to be a priority with security as an afterthought.

This landscape of security using artificial intelligence, or, frankly, security as it applies to artificial intelligence, is a vast topic in itself. I always strive to give you working examples, something you can try right away, and that will be my focus in this article as well.

In this article, I'm going to walk you through building a simple, entirely local anomaly detection system using Python. You'll simulate system log data, train an Isolation Forest model, and use it to identify suspicious activities. The best part? Everything runs right on your Mac, no cloud services required. Of course, as is the theme of many of my articles, you can leverage cloud-based services to greatly enhance these capabilities. It will be a piece of code that works 100% locally, has isolation and control advantages, and, frankly, sometimes, cost advantages.

Everything I'm about to show in this article will run on your MacOS. You could also make things work on a Windows machine, but I developed this codebase on a Mac. You'll need Python 3 and a relatively beefy modern processor. That's about it.

With that, let's get started.

Why Isolation Forest for Anomaly Detection?

I choose Isolation Forest for anomaly detection for several reasons.

First, it's extremely efficient, even with large datasets. It's efficient primarily because of its linear time complexity and low memory requirements. This is essential because I intend to run everything locally.

Secondly, Isolation Forest is particularly effective for high-dimensional data because it doesn't rely on distance or density metrics to detect anomalies. Many traditional anomaly detection algorithms, like k-Nearest Neighbors or Local Outlier Factor (LOF), rely on calculating distances between data points. In high-dimensional spaces, this approach suffers from the curse of dimensionality, where the distance between all data points becomes nearly uniform, making it impossible to identify meaningful neighbors or density variations. Isolation Forest bypasses this problem by using a tree-based approach.

Finally, Isolation Forest doesn't require labelling data. It's an unsupervised learning algorithm, meaning it doesn't require pre-labeled “normal” or “malicious” data to learn. It simply identifies data points that are “different.” You can literally throw the code I'm going to show in this article at any dataset and it will find anomalies for you.

Setting Up Your Environment

In order to follow along in this article, I expect that you are at least a basic Python developer. I'm not going to explain basic things like PIP or installing PIP packages. I'm going to assume familiarity with an IDE like VS Code.

With that out of the way, set up a Python project for yourself and open it in VS code. Create a virtual environment in it as well.

In this virtual environment, you're going to need certain dependencies. The contents of my requirements.txt can be seen as below:

pandas
scikit-learn
numpy
matplotlib
seaborn

With your requirements.txt saved, install your requirements using the command below:

pip install -r requirements.txt

Excellent, now you're going to need some test data.

Creating Test Data

The code I'm about to show will work on any input data. One of the challenges about writing things in the form of a published article is figuring out what data can I show you that's somewhat real world and also doesn't intrude on anybody's intellectual property.

For that reason, I decided to mimic a real-world scenario using some sample data that I'm going to generate next. In the real world, you'd ingest actual system logs, such as authentication logs, network traffic, process data, etc.

For our purposes, let's generate some synthetic data that mimics normal user activity and let's inject some anomalies for fun.

My sample log file contains the following columns:

login_attempts_per_min: Number of login attempts
files_accessed_per_min: Number of unique files accessed
cpu_usage_avg: Average CPU usage
network_out_mb: MB of data sent over the network
process_count: Number of active processes

Create a file called generate_data.py and let's start writing some code to generate this data. To start with, add some imports at the top of this file, as shown below:

import pandas as pd
import numpy as np
import random

The pandas library is a powerful tool for working with tables of data, which are called DataFrames. The NumPy library is used for advanced math and for creating arrays of numbers, which is great for generating large amounts of data. The random library helps with things like shuffling the data.

Next let's create some normal data. The code for generating normal data can be seen in Listing 1. As you can see from Listing 1, the generate_normal_data function acts like a factory for typical system activity. It creates a dictionary (a set of key-value pairs) where each key is a feature of system behavior, like login_attempts_per_min or cpu_usage_avg. The values are lists of random numbers generated by NumPy that fall within a normal, expected range. For example, it creates 1,000 data points where the cpu_usage_avg is a random number between 5% and 40%. The function then converts this dictionary into a pandas DataFrame, which is essentially a spreadsheet-like table.

Listing 1: Normal data generation code

def generate_normal_data(num_samples=1000):
"""Generates synthetic 'normal' system activity data."""
    data = {
        'login_attempts_per_min': np.random.randint(0, 3, num_samples),
        'files_accessed_per_min': np.random.randint(1, 20, num_samples),
        'cpu_usage_avg': np.random.uniform(5, 40, num_samples),
        'network_out_mb': np.random.uniform(0.1, 5.0, num_samples),
        'process_count': np.random.randint(30, 80, num_samples)
    }
    return pd.DataFrame(data)

Now, in a similar vein, let's write up a method to generate anomalous data. This can be seen in Listing 2. As you can see from Listing 2, The generate_anomalous_data function is similar to the generate normal data function, but it creates data that's unusual or suspicious. It generates a smaller number of data points (20 by default) where the values are outside the normal range. For instance, cpu_usage_avg is set to a high value between 80% and 100%, and network_out_mb is also very high. These high values simulate a potential security breach or a system error, making them anomalies. This is crucial for training an anomaly detection model later.

Listing 2: Anomalous data generation code

def generate_anomalous_data(num_anomalies=20):
    """Generates synthetic 'anomalous' system activity data."""

    anomalies = {
        # Many failed logins
        'login_attempts_per_min': np.random.randint(10, 50, num_anomalies),
        # Accessing many files
        'files_accessed_per_min': np.random.randint(50, 200, num_anomalies),
        # High CPU
        'cpu_usage_avg': np.random.uniform(80, 100, num_anomalies),
        # High network egress
        'network_out_mb':np.random.uniform(50.0, 200.0, num_anomalies),
        # Many processes
        'process_count': np.random.randint(100, 200, num_anomalies)
    }
    return pd.DataFrame(anomalies)

Now that you have functions to generate normal and anomalous data, let's use the __main__ function, to generate data, and mix them up. This can be seen in Listing 3. As you can see from Listing 3, the __main__ method first calls the two functions created earlier to get a table of normal data and a table of anomalous data. It then uses pd.concat to stack the two tables on top of each other, creating one large table. The combined table is then shuffled randomly using .sample(frac=1) so that the normal and anomalous data are mixed together, making the dataset more realistic.

Listing 3: Generating, mixing up, and saving data

if __name__ == "__main__": 
    normal_df = generate_normal_data()
    anomalous_df = generate_anomalous_data()

    # Combine and shuffle for a realistic dataset
    combined_df = pd.concat([normal_df, anomalous_df], 
        ignore_index=True)
    combined_df = combined_df.sample(frac=1).reset_index(
        drop=True) # Shuffle rows

    # Save to CSV
    combined_df.to_csv("system_activity_logs.csv", index=False)
    print(f"Generated{len(normal_df)} normal samples
        and {len(anomalous_df)} anomalies.")
    print("Data saved to system_activity_logs.csv")
    print("\nFirst 5 rows of generated data:")
    print(combined_df.head())

Finally, combined_df.to_csv("system_activity_logs.csv", index=False) saves the complete dataset into a file named system_activity_logs.csv. The index=False part just means that it doesn't add an extra column with row numbers to the file.

Run this file using the following command:

python ./generate_data.py

If all is correct so far, you should see some random data generated in system_activity_logs.csv file, and sample output, as can be seen in Figure 1.

Building the Anomaly Detection System

Nicely done so far, if you're following along. Now that you have some sample data ready to go let's write a simple anomaly detection system. As I mentioned earlier, you're going to write some code that uses a machine learning algorithm called Isolation Forest to find anomalies (unusual data points) in a dataset of system activity logs. The main goal is to spot suspicious activities, like a sudden spike in CPU usage or network traffic, that don't fit the normal pattern.

There will be four main parts to this code. The first part is going to load the data. The second part is going to train the model. Then you'll predict anomalies. Finally, you'll visualize the results. And all this will be coordinated with a main method.

Create a new file called detect_anomalies.py.

First, let's get the imports out of the way.

import pandas as pd
import numpy as np from sklearn.ensemble
import IsolationForest
import matplotlib.pyplot as plt
import seaborn as sns

The two new libraries you see here are matplotlib and seaborn. You're going to use them to perform data visualization. The primary purpose is to create informative and visually appealing statistical graphics, simplifying the process of exploring and presenting data.

Next, let's define some constants at the top.

DATA_FILE = "system_activity_logs.csv" # For saving/loading the model
MODEL_PATH = "isolation_forest_model.pkl"
# Expected proportion of anomalies in the
# dataset (2%)
CONTAMINATION_RATE = 0.02

Now let's perform the first step, which is to load the data, and which can be seen here:

df = pd.read_csv(DATA_FILE)

Okay that was simple enough. Let's move on to the second step where you train the model. The code training the model can be seen in Listing 4. As can be seen in Listing 4, the train_model function creates and trains the Isolation Forest model. Think of this model as a smart detective that learns what's considered normal behavior based on the data you provide. The contamination setting tells the detective to expect a certain percentage (in this case, 2%) of the data to be anomalies, which helps it set a good baseline for what to look for. You can tweak this parameter to make your model less or more sensitive to anomalies.

Listing 4: Training the model

def train_model(df, contamination=CONTAMINATION_RATE):
    print("Training Isolation Forest model...")
    model = IsolationForest(contamination=contamination, random_state=42)
    model.fit(df)
    print("Model training complete.")
    return model

Now that the model is trained, the next step is to use it to predict anomalies. This can be seen in Listing 5. As you can see from Listing 5, the predict_anomalies function uses the trained model to go through the data and make a prediction for each data point. For each activity log, the model assigns an anomaly score, which measures how “weird” or isolated that data point is. It then gives a final label: -1 for an anomaly and 1 for a normal data point.

Listing 5: Predict anomalies

def predict_anomalies(model, df):
    """Predicts anomalies using the trained model."""
    print("Predicting anomalies...")

    # Create a copy of the DataFrame to add new columns to
     df_with_predictions = df.copy()

    # IF returns -1 for anomalies and 1 for normal
    df_with_predictions['anomaly_score'] = model.decision_function(df)
    df_with_predictions['is_anomaly'] = model.predict(df)

    return df_with_predictions

This seems almost too simple, doesn't it? Now let's add some visualization to this data, so it looks like I did something smart and intelligent in this article. As you can see in Listing 6, the code prints out the number of anomalies found and lists some of the most unusual ones. The visualize_results function then creates two helpful graphs.

Listing 6: Visualizing the data

def visualize_results(df): print("Generating visualizations...")

    plt.figure(figsize=(16, 6))

    # Scatter plot of two features, colored by anomaly status
    plt.subplot(1, 2, 1)
    sns.scatterplot(
        x='cpu_usage_avg',
        y='network_out_mb',
        hue='is_anomaly',
        palette={1: 'green', -1: 'red'},
        data=df, s=100, alpha=0.7)
    plt.title('CPU Usage vs. Network Out(Anomalies in Red)')
    plt.xlabel('Average CPU Usage (%)')
    plt.ylabel('Network Out (MB)')

    # Distribution of Anomaly Scores
    plt.subplot(1, 2, 2)
    sns.histplot(df['anomaly_score'], kde=True, bins=50)
    plt.axvline(
        x=df[df['is_anomaly'] == -1]['anomaly_score'].max(),
        color='red',
        linestyle='--',
    label='Anomaly Threshold')
    plt.title('Distribution of Anomaly Scores')
    plt.xlabel('Anomaly Score')
    plt.ylabel('Count')
    plt.legend()

    plt.tight_layout()
    plt.show()

The first is a scatter plot that plots two key features (e.g., CPU usage vs. network traffic) and colors the unusual points red so you can easily see them.

The second is the anomaly score distribution that shows how many data points have each score. You can see a clear line that separates the few data points with very low (anomalous) scores from the large group of normal ones.

Now let's put all this together in a main() function to load the data, train the model, predict using the model, and visualize the results. This can be seen in Listing 7.

Listing 7: Putting together the anomaly detection logic

def main():
    # 1. Load Data
    try:
        df = pd.read_csv(DATA_FILE)
        print(f"Loaded data from {DATA_FILE}.Shape: {df.shape}")
    except FileNotFoundError:
        print(f"Error: {DATA_FILE} not found.")
        return

    # 2. Train Model
    model = train_model(df)

    # 3. Predict Anomalies
    # Pass the original DataFrame here
    df_results = predict_anomalies(model, df)

    # 4. Display Anomalies
    anomalies = df_results[df_results['is_anomaly'] == -1]
    normal_data = df_results[df_results['is_anomaly'] == 1]

    print(f"\nFound {len(anomalies)} anomalies
        and {len(normal_data)} normal data points.")
    if not anomalies.empty : print("\n--- Detected Anomalies ---")

        # Show top 5 most anomalous
        print(anomalies.sort_values(by='anomaly_score').head())

        # here you can implement an alerting mechanism
        # for _, row in anomalies.iterrows():send_alert(..)

    else:
        print("\nNo anomalies detected.")

    # 5. Visualize Results
    visualize_results(df_results)

Run the anomaly detection code as follows:

python ./detect_anomalies.py

This should produce an output, as can be seen in Figure 2.

Because you added a visual to the anomalies, you can see that in action in Figure 3. You'll see output indicating the model training, prediction, and a summary of detected anomalies. Specifically, you can see two plots in Figure 3: a scatter plot showing your data points with detected anomalies highlighted in red, and a histogram of anomaly scores with a line indicating the threshold where points are considered anomalous. This visual feedback is incredibly helpful for understanding why certain points were flagged.

Finally, if you observe Listing 7 closely, you'll also see a commented line where you could send alerts on the anomalies to an interested party.

Figure 3: Visualization of the anomalies

Taking This Further

It's quite incredible that without much difficulty or code, I was able to produce an anomaly detection engine that could work against any data. Throw any logs at this data, like EntraID sign in logs, and see if it detects anomalies. You could take this much further in the real world. Next are some enhancements or tweaks you could make.

You could adjust the CONTAMINATION_RATE. The contamination parameter in Isolation Forest is crucial. It's your estimate of the proportion of anomalies in your dataset. If you set it too high, you might flag normal data as anomalous (false positives). If you set it to too low, you might miss real threats (false negatives). Experiment with this value to fine-tune your model.

You could look into feature engineering. In a real scenario, you'd spend significant time on feature engineering. Instead of raw counts, you might calculate ratios (e.g., failed_logins/total_logins), rates of change, or aggregate data over different time windows.

You could add alerting mechanisms. In the main function, I've left a commented-out section for send_alert. In a production system, this could integrate with a messaging service such as Slack or PagerDuty, a SIEM (Security Information and Event Management) system, or simply log to a persistent file.

For a real-world system, you'd probably want a continuous monitoring system, so you wouldn't have to retrain the model every time new data shows up. For that, you'd want to look into model persistence. You'd train it once, save it (joblib.dump(model, MODEL_PATH)), and then load it for predictions (joblib.load(MODEL_PATH)).

And of course, you could look into leveraging more sophisticated models. For more complex data, you might explore other unsupervised anomaly detection algorithms like Local Outlier Factor (LOF), One-Class SVM, or even deep learning approaches like autoencoders.

Summary

It's simply amazing what we can achieve as developers today. The journey from a simple tweet to a thought to a completely functioning code example just took me a few hours. Converting it into a real-world application that can actively start detecting threads isn't that much of a leap. I've built a fully functional, fundamental anomaly detection system, running entirely on my off-the-shelf commercial macOS machine!

This project demonstrates how machine learning, even in its simplest forms, can be a powerful ally in the fight against cyber threats. By understanding “normal” and flagging “abnormal,” you're equipped to detect new and evolving dangers that traditional methods might miss. This local set up provides a fantastic sandbox for further experimentation, allowing you to explore the fascinating intersection of ML and security right from your developer workstation.

I'm curious. How are you using AI in your real-world applications? From what I can tell, the true potential of AI is only beginning to be tapped. What do you think?

Until next time, happy coding. Securely of course.