Your SIEM detection rules are breaking silently and continuously.
Unless you're actively testing them, you have no way of knowing which ones still work and which ones stopped firing three months ago after a log source update that nobody thought to validate against.
This isn't a hypothetical. During purple team engagements, I've watched detection rules fail to fire on techniques they were explicitly written to catch. Rules that passed review, worked perfectly when they were deployed, and that the team had full confidence in. The telemetry was there in the logs and the activity was generating events, but the detection logic was no longer matching because something underneath it had changed. A field got renamed. A log format shifted. Someone tuned the rule to cut down on false positives and accidentally scoped out the real attack path in the process.
The part that makes this genuinely dangerous is that nobody notices. False positives are loud. Analysts complain about them, leadership hears about alert fatigue, and resources get allocated to tuning. But false negatives are completely silent. No one submits a ticket that says "hey, I didn't get an alert for that credential dump that just happened." You find out about false negatives during an incident review, during a red team report, or during a purple team engagement when someone on my side of the table asks why a specific rule didn't fire and the answer is that it hasn't been firing for weeks.
Software engineers solved this problem a long time ago. They don't ship code and hope it keeps working. They write automated tests that run on every commit, every merge, every deployment. When a test fails, the build breaks and someone fixes it before the change goes live. If a refactor accidentally changes the behavior of a function that other parts of the system depend on, the test suite catches it immediately. That feedback loop isn't optional in modern software development. It's fundamental.
Detection rules are software. They're logic statements that process data and produce outputs, with dependencies on schema, field names, log source formats, and parsing pipelines. When any of those dependencies change, the rule can break just like a function breaks when its inputs change. And yet most detection engineering programs treat rule deployment as the finish line rather than the starting point.
The concept isn't complicated. You take a detection rule, pair it with a known-good attack simulation that should trigger it, run that simulation against your production SIEM, and check whether the alert actually fires. Did the rule do what it was supposed to do? That's detection regression testing, and it's the feedback loop that most detection programs are missing entirely.
We built an open-source pipeline called sigma-regression-testing to operationalize this. The project ties together Sigma, attack emulations, and SIEM usage into a single automated workflow. Every detection rule gets mapped to a specific emulation based on the ATT&CK technique ID, the test GUID, and the Sigma rule names that should fire when the atomic runs. Each mapping was validated manually by checking the actual emulation executor command against the detection logic to confirm the test genuinely triggers the rule.
The regression test script connects to a Windows target over WinRM, executes each mapped atomic, waits for logs to land in Splunk, and queries for matching alerts. You get a pass or fail for every single rule. The HTML report gives you summary stats, a filterable results table, and for every failure you can see exactly which rules were expected versus which ones actually triggered, with clickable Splunk links so you can jump straight into debugging.
Here is the validation cycle:
The whole pipeline runs in GitHub Actions. Push a rule change, and the validation workflow runs sigma check to catch formatting errors. If validation passes, the Splunk pipeline kicks off automatically to convert, deploy, and optionally run the full regression suite. A detection engineer can write a rule, add a test mapping, push to the repo, and have the entire lifecycle executed without any manual intervention.
On the first full regression run against 57 mapped detection rules, the pass rate came back at 87.7%. Roughly 1 in 8 rules had silently failed. These weren't obscure edge cases or experimental detections. They covered credential theft, lateral movement, and privilege escalation, the kinds of techniques that show up in real intrusions and that the team had assumed were fully covered.
87% sounds decent until you sit with what it actually means. If you have 57 detection rules and 7 of them are broken, that's 7 gaps in your coverage that no dashboard, no heatmap, and no AI-assisted triage tool is going to surface. Those gaps persist until someone runs a test or an attacker walks through them.
That experience is what convinced me that continuous validation isn't a nice to have. It's a fundamental requirement for any detection program that claims to provide real coverage. You can't claim coverage for a technique if you aren't regularly proving that the detection actually fires.
Once you have regression testing running, you can start making measurable commitments about your detection program. I think about these in three categories.
Latency is about time to detect. A rule should fire within five minutes of technique execution. If it takes longer, something is off with your log ingestion, your indexing pipeline, or your search scheduling, and you need to know about it.
Health is your pass rate, and the target is 100% of your mapped tests passing at all times. A failing detection test should carry the same urgency as a broken build in a CI pipeline. Not next sprint, fix it now.
Coverage ensures that your highest-priority ATT&CK techniques each have at least one detection with a passing regression test. Credential dumping, command-line abuse, log tampering, lateral movement via remote services. If you don't have a tested, validated rule for those, you've got a gap in your most critical coverage. These are the techniques that appear in nearly every real-world intrusion, and they need to be airtight.
These SLAs turn "we have good detection coverage" from a claim that sounds good in a slide deck into something auditable with real data behind it. Track them over time, present them to leadership, and use them to prioritize engineering work. That's how you move detection engineering from an art into an operational discipline.
This isn't just about tooling. It's a mindset shift in how we approach detection engineering as a discipline. The industry has gotten very good at writing rules and deploying them. The part we've collectively underinvested in is proving that those rules continue to work over time. We treat deployment as the finish line when it should really be the starting point of ongoing validation.
The pipeline is built around Splunk today, but the approach is SIEM-agnostic. Sigma converts to any backend, so if you're running Elastic, Sentinel, or anything else that pySigma supports, the detection logic and test mappings carry over. The principle is the same regardless of platform: write the rule, map it to a simulation, test it, and keep testing it.
The project is public at github.com/scythe-io/sigma-regression-testing. Clone it, install the dependencies, and you can start running regression tests against your own environment. There's a full walkthrough that takes you through every step.
Detection engineering doesn't end when you deploy a rule. It ends when you can prove, with data, that the rule fires today against the technique it was built to catch. That proof has to be automated, and it has to be continuous. Otherwise you're just maintaining a list of rules and hoping they work.
Hope is not a detection strategy.
– Tyler Casey