Kostadin Devedzhiev
PDF
Threat Explorer cover illustration

Threat Explorer: A Comparative Study of Agentic Architectures and Visualization Strategies for Conversational Cybersecurity Analytics

Kostadin Devedzhiev
University of Cambridge
Cambridge, United Kingdom
kgd26@cam.ac.uk
Abstract

This paper presents Threat Explorer, a conversational AI system for cybersecurity threat analysis that translates natural language queries into SQL over a 40,000-record attack dataset. We evaluate three agentic architectures—an LLM Chain, a ReAct agent, and a multi-agent system—across retrieval accuracy, latency, cost, and perceived quality. A separate within-subjects user study (N=12) compares text-only and chart-augmented responses using Likert-scale surveys with Holm–Bonferroni-corrected Wilcoxon tests. The ReAct agent achieves the highest query validity (100%) while the LLM Chain offers the best cost–speed trade-off; chart-based output yields statistically significant improvements in usability, clarity, and efficiency. We discuss the socio-technical risks of visually compelling but potentially inaccurate LLM output and describe transparency mechanisms to support human-in-the-loop validation.

1. Use-Case and Goals

Threat Explorer is a chatbot for analyzing cybersecurity threats in a database using natural language, inspired by Stellar Cyber's AI Investigator [Stellar Cyber, 2025]. The system lets security experts analyze data quickly, without deep knowledge of the schema or query language, in collaboration with AI. The database uses Inscribo's dataset from Kaggle [Inscribo, 2024], with 40,000 records and 25 columns (e.g., Timestamp, Source IP Address, Attack Type, Anomaly Score, IDS/IPS Alerts).

Threat Explorer can answer queries like "show the last 10 attacks with an anomaly score over 75" or "show the number of high severity attacks by type." The system uses three agents: a custom LLM chain, a ReAct agent [Yao et al., 2022], and a multi-agent architecture. This paper explores the system in two main dimensions:

  1. Technical Robustness: Evaluating the retrieval performance, speed, cost efficiency, and perceived usefulness of different agent architectures for response generation in multi-turn dialogues.
  2. Design Effectiveness: Comparing structured output strategies (plain text vs. charts) in terms of usability, helpfulness, and cognitive load through a controlled user study.

2. Baseline System Design

Threat Explorer is a Retrieval-Augmented Generation (RAG) [Lewis et al., 2021] system powered by OpenAI's GPT-4o mini [OpenAI, 2024] for its cost-effectiveness. The workflow consists of four stages: (1) receiving a natural language prompt, (2) constructing an SQL query via an agent, (3) executing the query to retrieve relevant records, and (4) generating a natural language report of the results. The baseline agent is a predefined LLM chain equipped with tools for executing SQL queries and inspecting the database schema. The backend is built with FastAPI [FastAPI, 2023] and uses LangChain [Chase, 2022] and CrewAI [CrewAI, 2025] for agent orchestration. The frontend is built with React [Meta, 2025], and the database is SQLite [Python Software Foundation, 2024].

3. Technical Experimental Study

3.1 Research Question

To what extent do different agentic architectures affect the accuracy, speed, cost, and perceived utility of the RAG-based dialogue system?

3.2 Setup and Evaluation Metrics

A test set of 10 dialogues with 30 system turns was annotated, each containing a rubric and ground-truth SQL queries of varying complexity across use cases including attack analysis, protocol investigation, severity analysis, and temporal analysis.

Example of a test rubric for IDS/IPS Alerts Analysis
Figure 1: Example of a test rubric

Each turn is evaluated on the following metrics:

3.3 Results

Table 1: Retrieval Performance Across Agents (N=30 turns)
MetricLLM ChainReActMulti-Agent
Query Validity96.7%100.0%70.0%
Pattern Match86.7%86.7%43.3%
Figure 2: Retrieval Performance by Agent Architecture
LLM
Chain
ReAct
Multi-
Agent
Query Validity Pattern Match Accuracy
Table 2: Efficiency and Token Distribution Across Agents (N=30 turns)
MetricLLM ChainReActMulti-Agent
Total Time (s)306.9475.82741.4
Input Tokens84,265190,5172,047,357
Output Tokens13,27523,259444,435
Total Tokens97,540213,7762,491,792
Total Cost ($)$0.0206$0.0425$0.5738
Figure 3: Cost and Speed Comparison Across Agents
LLM Chain
306.9s
ReAct
475.8s
Multi-Agent
2741.4s
LLM Chain
$0.021
ReAct
$0.043
Multi-Agent
$0.574
Time (seconds) Cost (USD)
Table 3: Generation Quality (LLM-as-a-Judge Score, 1–5)
MetricLLM ChainReActMulti-Agent
Factuality4.674.533.70
Helpfulness4.404.574.17
Overall Quality4.374.473.67

3.4 Discussion

The ReAct agent achieved 100% query validity, meaning all generated SQL queries were executable on the database through its iterative refinement and reasoning. It also tied with the LLM chain for the highest pattern match accuracy at 86.7%. The LLM Chain had the fastest response times and the lowest token consumption and cost.

The multi-agent architecture performed the worst across all metrics due to poor orchestration. However, improving it was not a priority since the simpler agents already achieved high accuracy while being cheaper and faster. The LLM judge scored the LLM chain highest for factuality but favored ReAct for helpfulness and overall quality.

Key Insight

The trade-offs between agents motivate a system design that supports switching based on preference, requirements, and budget. The multi-agent orchestration consists of an SQL Query Analyst, a Cybersecurity Threat Analyst, and a Report Formatter—specialized agents whose system prompts specify tools, schema, reasoning, role, and expected markup format.

In one example, the multi-agent hallucinated output convincingly, whereas the ReAct agent provided accurate results given identical prompting—highlighting the risk of complex orchestration introducing unreliable behavior.

ReAct agent correct result
(a) ReAct Correct Result
Multi-Agent hallucinated output
(b) Multi-Agent Hallucination
Figure 4: Comparison of ReAct vs. Multi-Agent outputs for the same query

4. Design-Focused Experimental Study

4.1 Research Question

Does different structured output, such as using LLM-generated visualizations versus text-only, affect the usability, helpfulness, clarity, confidence, and cognitive load of a cybersecurity data analytics chatbot?

4.2 Setup

The experiment included a header button to toggle visualizations, which update the agents' structured output. Twelve postgraduate Cambridge students were given the chatbot with a text (plain-text summary) and a chart (visual data presentation) condition for 5 minutes each. Half of the students had undergraduate degrees in computer science, and none had formal cybersecurity experience.

Text-only response
(a) Text-Only Response
Chart response with table
(b) Chart Response
Figure 5: Comparison of Text-Only vs. Chart-Based Responses in the Design Study

The post-interaction survey included 5 questions, each on a 5-point Likert scale, comparing the two systems across five dimensions:

Table 4: Post-Interaction Survey Questions
DimensionSurvey Question
A. UsabilityThe responses were presented in a well-organized way.
B. HelpfulnessThe system gave me the right level of detail.
C. ClarityI could identify the key evidence supporting the conclusion.
D. TrustI trust the chatbot's output for the tasks I performed.
E. EfficiencyThis version helped me understand the output quickly.

4.3 Results and Statistical Analysis

Analysis of the Likert-scale responses using a one-sided Wilcoxon Signed-Rank Test (α=0.05) given a priori of an expected improvement and small sample size showed the chart version had higher scores across all dimensions and statistical significance (p < 0.05) after Holm-Bonferroni adjustment in usability, clarity, and efficiency.

Table 5: Design Experiment Results with Holm-Bonferroni Correction (1–5 Likert Scale)
DimensionMean TextMean ChartSD TextSD Chartppadj
A. Usability3.424.670.790.490.0020.008*
B. Helpfulness3.504.421.090.670.0450.078
C. Clarity3.174.500.830.670.0020.008*
D. Trust3.834.580.940.670.0390.078
E. Efficiency2.584.751.310.450.0010.005*
Figure 6: Text vs. Chart Scores Across Dimensions
Usability*
Helpfulness
Clarity*
Trust
Efficiency*
Text Chart * padj < 0.05

4.4 Discussion

The results show that structured visual output provides user experience value in Threat Explorer, supporting the decision to make visualizations the default output setting.

Socio-Technical Reflection and Trust

Trust and Helpfulness in the chart version require critical reflection, as users may perceive data visualizations as more trustworthy than plain text, which can be misleading in an LLM-powered system that generates inaccurate or incomplete queries, hallucinates explanations, or misinterprets data. This may be especially true for users without domain expertise, who may over-rely on visually compelling output, leading to unquestioned acceptance of information.

Caution

Visually compelling LLM-generated charts may increase user trust even when the underlying data retrieval is incorrect. This creates a risk of automation bias—users accepting AI output without critical evaluation, particularly when they lack domain expertise.

Mitigation and Transparency

To improve transparency, agents always include SQL queries used in the output and support syntax highlighting so the queries are inspectable. Furthermore, the agents provide an explanation of their thinking. In future designs, the query should be editable so users can explore the data even if the agents are unable to fulfill the requirements. Interactive features will change the user's role from a passive recipient to an active validator, promoting human-in-the-loop ML [Wu et al., 2022] and human-AI collaboration [Vats et al., 2024].

5. Final System Design

The final system can switch between agents. It supports logging, uploading, and downloading dialogues in JSON format. The backend is a FastAPI server using LangChain and CrewAI for orchestration. The interface is built with React. The database is SQLite. The repository includes documentation for running and scripts for evaluations and report generation. The source code is available at github.com/kostadindev/Threat-Explorer.

Dialogue example: high anomaly score attacks
Table query with SQL
Dialogue example: top users by attacks
User investigation query
Dialogue example: severity distribution bar chart
Bar chart visualization
Dialogue example: detailed attack records
Detailed record view
Dialogue example: pie chart visualization
Pie chart visualization
Dialogue example: elaboration on findings
Elaboration on findings
Figure 7: Dialogue Examples
Logs output from Threat Explorer
Figure 8: Logs Output from Threat Explorer

6. Conclusion

This work explored technical and design aspects of Threat Explorer, a conversational AI cybersecurity analysis tool. The technical study showed that while the ReAct agent is most reliable, the LLM Chain is most efficient, leading to the design decision to support switching between agents.

The design experiment yielded statistically significant results showing that visualizations enhance usability, clarity, and efficiency, making them a core feature of Threat Explorer. While helpful, these visualizations pose a risk of misleading users through inaccurate LLM-generated content. Threat Explorer aims to improve transparency and mitigate over-reliance by providing its thought process and SQL queries.

Future work should focus on stronger guardrails against unrelated or dangerous prompts, editable inline SQL queries and charts, support for more chart types, faster response times by sending completed segments over WebSockets, adaptive agent routing, improved prompting and orchestration, short and long-term memory, and observability.

References