Threat Explorer: A Comparative Study of Agentic Architectures and Visualization Strategies for Conversational Cybersecurity Analytics

Kostadin Devedzhiev

University of Cambridge
Cambridge, United Kingdom
kgd26@cam.ac.uk

Abstract

This paper presents Threat Explorer, a conversational AI system for cybersecurity threat analysis that translates natural language queries into SQL over a 40,000-record attack dataset. We evaluate three agentic architectures—an LLM Chain, a ReAct agent, and a multi-agent system—across retrieval accuracy, latency, cost, and perceived quality. A separate within-subjects user study (N=12) compares text-only and chart-augmented responses using Likert-scale surveys with Holm–Bonferroni-corrected Wilcoxon tests. The ReAct agent achieves the highest query validity (100%) while the LLM Chain offers the best cost–speed trade-off; chart-based output yields statistically significant improvements in usability, clarity, and efficiency. We discuss the socio-technical risks of visually compelling but potentially inaccurate LLM output and describe transparency mechanisms to support human-in-the-loop validation.

1. Use-Case and Goals

Threat Explorer is a chatbot for analyzing cybersecurity threats in a database using natural language, inspired by Stellar Cyber's AI Investigator [Stellar Cyber, 2025]. The system lets security experts analyze data quickly, without deep knowledge of the schema or query language, in collaboration with AI. The database uses Inscribo's dataset from Kaggle [Inscribo, 2024], with 40,000 records and 25 columns (e.g., Timestamp, Source IP Address, Attack Type, Anomaly Score, IDS/IPS Alerts).

Threat Explorer can answer queries like "show the last 10 attacks with an anomaly score over 75" or "show the number of high severity attacks by type." The system uses three agents: a custom LLM chain, a ReAct agent [Yao et al., 2022], and a multi-agent architecture. This paper explores the system in two main dimensions:

Technical Robustness: Evaluating the retrieval performance, speed, cost efficiency, and perceived usefulness of different agent architectures for response generation in multi-turn dialogues.
Design Effectiveness: Comparing structured output strategies (plain text vs. charts) in terms of usability, helpfulness, and cognitive load through a controlled user study.

2. Baseline System Design

Threat Explorer is a Retrieval-Augmented Generation (RAG) [Lewis et al., 2021] system powered by OpenAI's GPT-4o mini [OpenAI, 2024] for its cost-effectiveness. The workflow consists of four stages: (1) receiving a natural language prompt, (2) constructing an SQL query via an agent, (3) executing the query to retrieve relevant records, and (4) generating a natural language report of the results. The baseline agent is a predefined LLM chain equipped with tools for executing SQL queries and inspecting the database schema. The backend is built with FastAPI [FastAPI, 2023] and uses LangChain [Chase, 2022] and CrewAI [CrewAI, 2025] for agent orchestration. The frontend is built with React [Meta, 2025], and the database is SQLite [Python Software Foundation, 2024].

3. Technical Experimental Study

3.1 Research Question

To what extent do different agentic architectures affect the accuracy, speed, cost, and perceived utility of the RAG-based dialogue system?

3.2 Setup and Evaluation Metrics

A test set of 10 dialogues with 30 system turns was annotated, each containing a rubric and ground-truth SQL queries of varying complexity across use cases including attack analysis, protocol investigation, severity analysis, and temporal analysis.

Example of a test rubric for IDS/IPS Alerts Analysis

Figure 1: Example of a test rubric

Each turn is evaluated on the following metrics:

Retrieval Performance: Query Validity (percentage of executable queries) and Pattern Match Accuracy (correctness of query results against ground truth).
Cost: Token consumption and total price in USD.
Speed: Wall-clock time in seconds to produce a response.
Generation Quality (LLM-as-a-Judge): Scores on a 1–5 scale for factuality, helpfulness, and overall quality [Gu et al., 2024].

3.3 Results

Table 1: Retrieval Performance Across Agents (N=30 turns)

Metric	LLM Chain	ReAct	Multi-Agent
Query Validity	96.7%	100.0%	70.0%
Pattern Match	86.7%	86.7%	43.3%

Figure 2: Retrieval Performance by Agent Architecture

LLM
Chain

ReAct

Multi-
Agent

Query Validity Pattern Match Accuracy

Table 2: Efficiency and Token Distribution Across Agents (N=30 turns)

Metric	LLM Chain	ReAct	Multi-Agent
Total Time (s)	306.9	475.8	2741.4
Input Tokens	84,265	190,517	2,047,357
Output Tokens	13,275	23,259	444,435
Total Tokens	97,540	213,776	2,491,792
Total Cost ($)	$0.0206	$0.0425	$0.5738

Figure 3: Cost and Speed Comparison Across Agents

LLM Chain

306.9s

ReAct

475.8s

Multi-Agent

2741.4s

LLM Chain

$0.021

ReAct

$0.043

Multi-Agent

$0.574

Time (seconds) Cost (USD)

Table 3: Generation Quality (LLM-as-a-Judge Score, 1–5)

Metric	LLM Chain	ReAct	Multi-Agent
Factuality	4.67	4.53	3.70
Helpfulness	4.40	4.57	4.17
Overall Quality	4.37	4.47	3.67

3.4 Discussion

The ReAct agent achieved 100% query validity, meaning all generated SQL queries were executable on the database through its iterative refinement and reasoning. It also tied with the LLM chain for the highest pattern match accuracy at 86.7%. The LLM Chain had the fastest response times and the lowest token consumption and cost.

The multi-agent architecture performed the worst across all metrics due to poor orchestration. However, improving it was not a priority since the simpler agents already achieved high accuracy while being cheaper and faster. The LLM judge scored the LLM chain highest for factuality but favored ReAct for helpfulness and overall quality.

Key Insight

The trade-offs between agents motivate a system design that supports switching based on preference, requirements, and budget. The multi-agent orchestration consists of an SQL Query Analyst, a Cybersecurity Threat Analyst, and a Report Formatter—specialized agents whose system prompts specify tools, schema, reasoning, role, and expected markup format.

In one example, the multi-agent hallucinated output convincingly, whereas the ReAct agent provided accurate results given identical prompting—highlighting the risk of complex orchestration introducing unreliable behavior.

(a) ReAct Correct Result

(b) Multi-Agent Hallucination

Figure 4: Comparison of ReAct vs. Multi-Agent outputs for the same query

4. Design-Focused Experimental Study

4.1 Research Question

Does different structured output, such as using LLM-generated visualizations versus text-only, affect the usability, helpfulness, clarity, confidence, and cognitive load of a cybersecurity data analytics chatbot?

4.2 Setup

The experiment included a header button to toggle visualizations, which update the agents' structured output. Twelve postgraduate Cambridge students were given the chatbot with a text (plain-text summary) and a chart (visual data presentation) condition for 5 minutes each. Half of the students had undergraduate degrees in computer science, and none had formal cybersecurity experience.

(a) Text-Only Response

(b) Chart Response

Figure 5: Comparison of Text-Only vs. Chart-Based Responses in the Design Study

The post-interaction survey included 5 questions, each on a 5-point Likert scale, comparing the two systems across five dimensions:

Table 4: Post-Interaction Survey Questions

Dimension	Survey Question
A. Usability	The responses were presented in a well-organized way.
B. Helpfulness	The system gave me the right level of detail.
C. Clarity	I could identify the key evidence supporting the conclusion.
D. Trust	I trust the chatbot's output for the tasks I performed.
E. Efficiency	This version helped me understand the output quickly.

4.3 Results and Statistical Analysis

Analysis of the Likert-scale responses using a one-sided Wilcoxon Signed-Rank Test (α=0.05) given a priori of an expected improvement and small sample size showed the chart version had higher scores across all dimensions and statistical significance (p < 0.05) after Holm-Bonferroni adjustment in usability, clarity, and efficiency.

Table 5: Design Experiment Results with Holm-Bonferroni Correction (1–5 Likert Scale)

Dimension	Mean Text	Mean Chart	SD Text	SD Chart	p	p_adj
A. Usability	3.42	4.67	0.79	0.49	0.002	0.008*
B. Helpfulness	3.50	4.42	1.09	0.67	0.045	0.078
C. Clarity	3.17	4.50	0.83	0.67	0.002	0.008*
D. Trust	3.83	4.58	0.94	0.67	0.039	0.078
E. Efficiency	2.58	4.75	1.31	0.45	0.001	0.005*

Figure 6: Text vs. Chart Scores Across Dimensions

Usability*

Helpfulness

Clarity*

Trust

Efficiency*

Text Chart * p_adj < 0.05

4.4 Discussion

The results show that structured visual output provides user experience value in Threat Explorer, supporting the decision to make visualizations the default output setting.

Socio-Technical Reflection and Trust

Trust and Helpfulness in the chart version require critical reflection, as users may perceive data visualizations as more trustworthy than plain text, which can be misleading in an LLM-powered system that generates inaccurate or incomplete queries, hallucinates explanations, or misinterprets data. This may be especially true for users without domain expertise, who may over-rely on visually compelling output, leading to unquestioned acceptance of information.

Caution

Visually compelling LLM-generated charts may increase user trust even when the underlying data retrieval is incorrect. This creates a risk of automation bias—users accepting AI output without critical evaluation, particularly when they lack domain expertise.

Mitigation and Transparency

To improve transparency, agents always include SQL queries used in the output and support syntax highlighting so the queries are inspectable. Furthermore, the agents provide an explanation of their thinking. In future designs, the query should be editable so users can explore the data even if the agents are unable to fulfill the requirements. Interactive features will change the user's role from a passive recipient to an active validator, promoting human-in-the-loop ML [Wu et al., 2022] and human-AI collaboration [Vats et al., 2024].

5. Final System Design

The final system can switch between agents. It supports logging, uploading, and downloading dialogues in JSON format. The backend is a FastAPI server using LangChain and CrewAI for orchestration. The interface is built with React. The database is SQLite. The repository includes documentation for running and scripts for evaluations and report generation. The source code is available at github.com/kostadindev/Threat-Explorer.

Dialogue example: high anomaly score attacks

Table query with SQL

User investigation query

Dialogue example: severity distribution bar chart

Bar chart visualization

Dialogue example: detailed attack records

Detailed record view

Pie chart visualization

Elaboration on findings

Figure 7: Dialogue Examples

Figure 8: Logs Output from Threat Explorer

6. Conclusion

This work explored technical and design aspects of Threat Explorer, a conversational AI cybersecurity analysis tool. The technical study showed that while the ReAct agent is most reliable, the LLM Chain is most efficient, leading to the design decision to support switching between agents.

The design experiment yielded statistically significant results showing that visualizations enhance usability, clarity, and efficiency, making them a core feature of Threat Explorer. While helpful, these visualizations pose a risk of misleading users through inaccurate LLM-generated content. Threat Explorer aims to improve transparency and mitigate over-reliance by providing its thought process and SQL queries.

Future work should focus on stronger guardrails against unrelated or dangerous prompts, editable inline SQL queries and charts, support for more chart types, faster response times by sending completed segments over WebSockets, adaptive agent routing, improved prompting and orchestration, short and long-term memory, and observability.

References

[CrewAI, 2025] CrewAI. “CrewAI — Platform for Multi AI Agents Systems.” https://www.crewai.com/.
[FastAPI, 2023] FastAPI. “FastAPI Documentation.” 2023. https://fastapi.tiangolo.com/.
[Gu et al., 2024] Jiawei Gu et al. “A Survey on LLM-As-a-Judge.” ArXiv, 2024. arxiv.org/abs/2411.15594.
[Inscribo, 2024] Inscribo. “Cyber Security Attacks.” Kaggle, 2024. kaggle.com/datasets/teamincribo/cyber-security-attacks.
[Chase, 2022] Harrison Chase. “LangChain.” GitHub, 1 Oct. 2022. github.com/langchain-ai/langchain.
[Lewis et al., 2021] Patrick Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” ArXiv, 2021. arxiv.org/abs/2005.11401.
[OpenAI, 2024] OpenAI. “GPT-4o Mini: Advancing Cost-Efficient Intelligence.” 18 July 2024. openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/.
[Python Software Foundation, 2024] Python Software Foundation. “sqlite3 — DB-API 2.0 Interface for SQLite Databases.” Python 3 Documentation, 2024. docs.python.org/3/library/sqlite3.html.
[Meta, 2025] React. Meta Open Source. “React.” 2025. react.dev.
[Stellar Cyber, 2025] Stellar Cyber. “AI Investigator — Natural Language Threat Hunting with XDR.” 4 Dec. 2025. stellarcyber.ai/ai-investigator-natural-language-threat-hunting/.
[Vats et al., 2024] Vanshika Vats et al. “A Survey on Human-AI Teaming with Large Pre-Trained Models.” ArXiv, 2024. arxiv.org/abs/2403.04931.
[Wu et al., 2022] Xingjiao Wu et al. “A Survey of Human-In-The-Loop for Machine Learning.” Future Generation Computer Systems, vol. 135, Oct. 2022, pp. 364–381.
[Yao et al., 2022] Shunyu Yao et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” 2022.