Jair Santanna

Principal Cybersecurity Researcher at Northwave & Assistant Professor at University of Twente

Jair Santanna

Dr Jair Santanna is an enthusiastic and passionate Principal Cybersecurity Researcher (@Northwave) and an Assistant Professor (@University of Twente). He is a practical, data-driven and extremely curious person. He loves to spread the knowledge with the scientific community and with cybersecurity practitioners. He prepares his presentations thinking about you (the audience). Therefore, he promises to give an engaging, enthusiastic, and to-the-point presentation.

Talk: Fast and Reproducible Data Analysis Beyond State Actors' Leaked Data – Utilizing Classic Techniques and the Latest AI LLMs

In February 2024, a significant data leak occurred involving a Chinese company known as I-Soon or Anxun. The data consisted of 578 files, mostly in simplified Chinese, and included images, documents, tables, presentation slides, and conversations between individuals. The language barrier led some Cyber Threat Intelligence (CTI) teams to step back from analyzing the data. Others used private or open-source solutions to translate the data, aware of the potential for errors. After preliminary reviews of a selection of files, some teams posited that I-Soon is an Advanced Persistent Threat (APT) group with possible ties to the Chinese government. Months later, analyses by CTI teams suggest that (1) the methodologies behind some conclusions are opaque and non-replicable, (2) significant data portions remain unanalyzed or unpublished, and (3) extensive time and human resources are required for a thorough analysis. To address these issues, we have developed a methodology and scripts to drastically accelerate the data analysis process, increasing comprehensiveness, reproducibility, scalability, and actionability.

Our methodology incorporates proven techniques, including (1) the use of Regular Expressions (RegEx) to extract particular types of information such as IP addresses, URLs, hashes, crypto wallets, financial values, and geographic names, and (2) enrichment and correlation with CTI databases. Existing tools, such as the Microsoft Threat Intelligence Python Security Tools1 (MSTIC or msticpy), and scripts2 for examining Distributed Denial of Service (DDoS) attack providers' ecosystems, provide similar capabilities. The most noteworthy innovation of our method is using private and open-source Artificial Intelligence (AI) Large Language Models (LLMs) to annotate and classify data without human intervention.

This methodology is adaptable and not just for I-Soon's leaked data; it's applicable to any dataset that requires in-depth analysis. We've already successfully applied our techniques to the leaked data from the now-dissolved Conti Ransomware group. We envision further impactful applications, such as analyzing terabytes or petabytes of data from ongoing ransomware incidents to help victims rapidly determine the scope of data compromised by threat actors. Another significant use case is accelerating the analysis of data on devices seized by law enforcement agencies. Currently, this type of in-depth data analysis can take weeks or months; our tools can reduce this to minutes or hours.