Security Management, Threat Management, SOAR

Is Automated Triage of 100% of Alerts Possible with Today’s Tools?

Share
Malware attack virus alert. Person use laptop with virtual warning sign with ransomware word. warning notification, Cyber threats.

What is the acceptable number of alerts to ignore? 20%? 50%? The answer for many SOCs is around 70%. It’s hard to place the blame in one area.

There are a lot of tools generating a lot of alerts, and because no one wants to miss something important, these tools are prone to generating false positives. This makes analysts’ queues endless, filled with benign, low-fidelity, and redundant alerts. Each of these alerts takes time to determine that it isn’t a serious threat.

Wasn’t security automation supposed to solve this problem? Automation has helped, but significantly reducing alert noise at scale has proved to be very challenging. Doing it with a SIEM is extremely difficult. XDR has not yet been up to the task. Legacy SOAR tools have struggled to match the necessary scale and reliability.

To automate the dismissal of alerts, you need high confidence, and high confidence requires massive amounts of processing power. Let’s consider why. A workflow that provides sufficient confidence about the nature of an alert might include as many as 500 actions. If that sounds crazy, we will dig into it in the next section. If a large MSSP is monitoring one million alerts per day, that’s 500 million steps to process every day, without ever stopping.

There are not many tools that automate reliably at that scale. Why not, and what are the missing pieces?

The Workflow Engine

High-confidence triage at scale relies on the strength of the workflow engine you are using, which needs to be able to handle hundreds of actions at a time. Why does it need this much capacity? If an alert requires advanced queries to collect all indicators of compromise (IOCs), the workflow needs to unpack the alert, run the queries, and analyze the incoming IOCs. This analysis requires unique enrichment and logic. Any dismissed alerts need to be closed in the original data source as well. These actions quickly add up and are all necessary to avoid dismissing serious alerts.

Some types of alerts naturally lead to expansive workflows, as more and more relevant data is brought into the playbook. For example, a phishing playbook might pull in hundreds of related emails that are part of the same campaign, each with their own IOCs to investigate.

Even before triage, some workflows must be able to unwind hundreds of alerts. To take one example, in a tool that does not automatically deduplicate alerts, an ingestion playbook might look like this:

  1. The playbook queries an integrated AWS S3 Bucket to ingest a list of 500 alerts into the automation tool.
  2. The playbook then unwinds that list, creating a parallel path for each alert.
  3. For each event, the playbook then executes a series of 15-20 actions to parse that alert, normalize its fields, and check the global list to see if the alert ID is already in the system.
  4. If the alert ID is not in the global list, then a new event is created, and triage can begin.

That simple playbook can represent as many as 10,000 actions, just to ingest, deduplicate, and correlate alerts.

Bottlenecks and Points of Failure

Automation tools don’t just have to process alerts, they have to process them fast. Otherwise, they can’t keep up with incoming volume and you’re back to square one. There are a few common points of failure that grind automated triage playbooks to a halt.

One is custom Python scripts. Automation tools that rely on users to script actions and integrations slow down their processing speed and introduce new variables that can cause errors. The faster and more scalable option is to have prebuilt utility commands that replace those scripts with code, which is much faster and more reliable because it has already been tested and approved by the vendor.

Another point of failure is not having efficient normalization of alert data. We were recently speaking to a security architect about the challenge of alert noise reduction, and he said, “well, if you have automated normalization, then the rest is easy.” What did he mean by that? When you are ingesting alerts from several different detection tools, the fields won’t consistently match. Meaning that you need to transform the data so that, for example, the field that represents an email sender’s domain in each detection tool is all going to the same field in the automation tool.

Without normalization, you are comparing apples to oranges, and can’t trust that you are correlating properly across alerts from different sources. Building your own normalization workflows is possible, but difficult, so having normalization built into your automation tool greatly increases your chances of achieving automated triage at scale.

A third point of failure when scaling up your automated triage is the automation tools queuing system, which manages the ordering of alerts and processes in real time. When not done properly, such as when automation vendors offload it onto an unreliable third-party tool, queueing becomes a bottleneck, which causes the tool to lag during spikes in alert volume. A good queueing system can be the difference between handling a few thousand alerts per day and handling 100,000+.

Autoscaling

To answer the question posed by this article’s title, yes, we believe 100% automated triage can be achieved with today’s technology. Build around a strong workflow engine, avoid common points of failure, and you shouldn’t have to ignore any percentage of alerts. There’s one other necessary component we should mention, which is architecture.

Cloud-based autoscaling is integral to the scale of processing that we are talking about. Leveraging resources like MongoDB and Kubernetes clusters that can dynamically scale their power and capacity to handle spikes in activity will protect the system against lagging behind when you need it most.

About D3 Smart SOAR for MSSPs

D3 Security supports MSSPs around the world with our Smart SOAR platform. D3 Security supports full multi-tenancy, so you can keep client sites, data, and playbooks completely segregated.

Importantly, we’re vendor-agnostic and independent, so no matter what tools your clients use, our unlimited integrations will meet their needs. D3’s Event Pipeline can automate the alert-handling capacity of dozens of analysts, while reducing alert volume by 90% or more. Watch our case study video with High Wire Networks to see how a master MSSP uses Smart SOAR.

Guest blog courtesy of D3 Security. Read more D3 Security guest blogs and news hereRegularly contributed guest blogs are part of MSSP Alert’s sponsorship program.