Search This Blog

Powered by Blogger.

Blog Archive

Labels

Footer About

Footer About

Labels

AI Models Trained on Incomplete Data Can't Protect Against Threats

Threat hunting powered by AI, automation, or human investigation will only ever be as effective as the data infrastructure it stands on.


In cybersecurity, AI is being called the future of threat finder. However, AI has its hands tied, they are only as good as their data pipeline. But this principle is not stopping at academic machine learning, as it is also applicable for cybersecurity.

AI-powered threat hunting will only be successful if the data infrastructure is strong too.

Threat hunting powered by AI, automation, or human investigation will only ever be as effective as the data infrastructure it stands on. Sometimes, security teams build AI over leaked data or without proper data care. This can create issues later. It can affect both AI and humans. Even sophisticated algorithms can't handle inconsistent or incomplete data. AI that is trained on poor data will also lead to poor results. 

The importance of unified data 

A correlated data controls the operation. It reduces noise and helps in noticing patterns that manual systems can't.

Correlating and pre-transforming the data makes it easy for LLMs and other AI tools. It also allows connected components to surface naturally. 

A same person may show up under entirely distinct names as an IAM principal in AWS, a committer in GitHub, and a document owner in Google Workspace. You only have a small portion of the truth when you look at any one of those signs. 

You have behavioral clarity when you consider them collectively. While downloading dozens of items from Google Workspace may look strange on its own, it becomes obviously malevolent if the same user also clones dozens of repositories to a personal laptop and launches a public S3 bucket minutes later.

Finding threat via correlation 

Correlations that previously took hours or were impossible become instant when data from logs, configurations, code repositories, and identification systems are all housed in one location. 

For instance, lateral movement that uses short-lived credentials that have been stolen frequently passes across multiple systems before being discovered. A hacked developer laptop might take on several IAM roles, launch new instances, and access internal databases. Endpoint logs show the local compromise, but the extent of the intrusion cannot be demonstrated without IAM and network data.


Share it:

AI

ChatGPT

Cloud

Cyber Security

Data

Windows