What is the difference between using BERT-class and GPT-class models for CTI classification?

BERT-class encoder models are well-suited for classification tasks where the label set is fixed and known at training time — such as assigning a threat report to a MITRE ATT&CK technique or a malware family taxonomy. They are fast, cost-efficient at inference, and fine-tune effectively on labeled CTI corpora of a few thousand examples. GPT-class generative models excel at enrichment tasks where the output is open-ended: summarizing a raw IOC report, extracting structured fields from unformatted threat actor prose, or synthesizing a narrative intelligence brief from structured graph data. In production CTI pipelines, the two model types are used at different stages rather than competing: the encoder classifies, the generative model enriches.

How should confidence thresholds be set for CTI classification in a SOC context?

Confidence threshold selection is a precision-recall tradeoff with asymmetric costs in a SOC context. A false negative — a genuine threat event that is not surfaced to an analyst — can have severe consequences if it involves critical infrastructure or an advanced persistent threat group. A false positive — a misclassified event that enters the analyst queue — costs analyst time but does not cause harm. For high-severity sectors (critical infrastructure, defense, energy), thresholds should be set lower (0.60–0.70) to prioritize recall, with the additional analyst review load accepted as the cost. For broader monitoring, thresholds of 0.75–0.85 reduce queue volume. The threshold should be calibrated against a held-out labeled dataset from your specific threat landscape, not carried over from a generic benchmark.

What training data sources are most effective for fine-tuning LLMs on CTI classification?

The MITRE ATT&CK knowledge base provides the most reliable labeled data for technique-level classification: each technique entry contains detailed prose descriptions, procedure examples, and detection guidance that can serve as positive examples. AlienVault OTX pulse exports and MISP event feeds provide labeled malware family and threat actor data at scale. VirusTotal Intelligence reports offer file-level and network-level IOC context. For adversarial TTP labeling specifically, CTI reports published by security vendors (CrowdStrike, Mandiant, Recorded Future) contain high-quality technique attributions but require entity normalization before use as training labels. The critical quality control step is ensuring label consistency across sources — the same technique should carry the same ATT&CK ID regardless of the source document's terminology.

How do LLM-based CTI pipelines handle STIX 2.1 and MISP output formats?

LLM classification produces structured JSON records with extracted fields (threat actor, malware family, technique IDs, IOC values, confidence scores). These records are mapped to STIX 2.1 objects in a post-classification serialization step: threat actors become STIX Threat Actor objects, malware families become Malware objects, techniques map to Attack Pattern objects with ATT&CK external references, and relationships between them are expressed as STIX Relationship objects. The full set is bundled into a STIX Bundle for export or TAXII sharing. For MISP, the same structured records map to MISP Events with Attributes and Objects; the MISP ATT&CK galaxy provides the technique taxonomy mapping. Both serialization layers should be implemented as separate post-processing modules downstream of the LLM classification stage, not baked into the classification prompt, to allow format updates without retraining.

What evaluation metrics should be used for CTI classification models beyond overall accuracy?

Overall accuracy is a misleading metric for CTI classification because threat label distributions are highly imbalanced — common techniques like T1566 (Phishing) appear orders of magnitude more frequently than rare but high-value techniques. Per-technique precision and recall, reported separately, give a more accurate picture of model performance across the label space. Macro-averaged F1 (unweighted average across all technique classes) penalizes equally for errors on rare and common classes, making it more informative than micro-averaged F1 for imbalanced CTI corpora. For operational use, the metric that matters most is recall at the technique level for techniques in your priority monitoring list — a model that misses 20% of T1055 (Process Injection) events is operationally unacceptable regardless of its overall accuracy score.

LLM threat classification for CTI pipelines

Cyber threat intelligence teams face a compounding data problem. The volume of raw threat data – IOC feeds from ISACs, OSINT scraped from paste sites and Telegram channels, dark web forum exports, vendor intelligence reports – has grown faster than analyst headcount at every organization that takes CTI seriously. The result is a backlog: threat data that arrives in time to be actionable but is not classified, enriched, or correlated before the window closes. Manual classification at scale is not a workflow problem. It is a structural problem that cannot be solved by hiring more analysts.

Large language models offer a genuine solution – not as a replacement for analyst judgment, but as a classification and enrichment layer that converts unstructured threat data into structured records at machine speed. This article covers the architectural decisions that matter when integrating LLMs into a CTI pipeline: which model class to use for which task, how to structure the ingest-to-output pipeline with STIX 2.1 and MITRE ATT&CK, what training data produces reliable technique-level classifiers, how to evaluate performance in a SOC context, and how to design the analyst-in-the-loop controls that keep the system trustworthy under adversarial conditions.

Why manual CTI classification does not scale

The scale problem is quantitative and qualitative. On the quantitative side: a mid-tier defense organization monitoring a realistic set of threat feeds – two or three ISAC feeds, AlienVault OTX, several MISP community servers, and passive DNS and certificate transparency log enrichment – receives tens of thousands of raw indicators per day. Classifying each IOC by threat actor, malware family, and relevant ATT&CK technique manually is measured in analyst-hours per day that most CTI teams do not have.

The qualitative problem is source heterogeneity. ISACs deliver structured STIX bundles with relatively clean labels. OSINT feeds deliver unstructured prose: blog posts, forum threads, Telegram channel exports. Dark web data arrives in formats that require significant preprocessing before any classification attempt is meaningful. Each source requires a different extraction approach, and maintaining reliable rule-based extractors across all of them – while keeping pace with the way threat actors deliberately vary their language to evade detection – is a maintenance burden that compounds over time.

Analyst burnout is the downstream consequence. When the classification queue is permanently deep, analysts stop reviewing individual records and start processing only the highest-severity pre-filtered items. The result is systematic blind spots in the threat picture – not because the data was not collected, but because it was never classified and correlated. An LLM classification layer does not eliminate the need for analyst judgment; it eliminates the part of the workflow where analysts are doing work that can be automated reliably.

LLM architecture for CTI: encoder vs generative models

The most consequential architectural choice in a CTI LLM pipeline is which model class to use at each stage. Encoder models (BERT-class) and generative models (GPT-class) have fundamentally different strengths, and using the wrong class for a task produces either poor accuracy or unnecessary cost.

Encoder models for classification

BERT-class encoder models – especially domain-adapted variants fine-tuned on security text, such as SecBERT or CySecBERT – are the right choice for fixed-taxonomy classification tasks. Given a CTI document and a predefined label set (ATT&CK technique IDs, malware family names, threat actor groups), a fine-tuned encoder produces classification scores across the label space in under 500 milliseconds on modest hardware. Fine-tuning on labeled CTI corpora of 5,000 to 20,000 examples typically reaches production-ready accuracy.

The critical constraint is that the label set must be fixed and known at training time. Encoder models cannot generalize to labels not seen during training. For MITRE ATT&CK technique classification, this is not a limitation in practice: the ATT&CK technique taxonomy is version-controlled, and updates can trigger a targeted fine-tuning run. For malware family classification, where new families emerge continuously, the encoder should be paired with an out-of-distribution detection mechanism that routes unknown-family candidates to an analyst rather than forcing a nearest-match classification.

Generative models for enrichment

Generative models are the right choice when the output is open-ended or requires reasoning across document context. Extracting structured IOC fields from an unformatted threat actor report, synthesizing a narrative brief from a set of structured event records, inferring victim geography from implicit cues rather than explicit country names – these tasks require capabilities that encoder classification cannot provide.

The key discipline when using generative models in a CTI pipeline is constraining the output format. A generative model left to produce free-text output will introduce synonymy and inconsistency that makes downstream aggregation unreliable. The solution is structured output prompting: the model is instructed to produce a JSON response conforming to a strict schema, with schema validation applied on receipt. Response parsing failures trigger an automatic retry with corrective instructions. This discipline converts a probabilistic generative system into a reliable structured data source.

Generative enrichment is also the right place for confidence scoring. The model is prompted to return a per-field confidence score between 0 and 1, representing genuine epistemic uncertainty given the source document content. A message that explicitly names a victim organization and country produces high-confidence geography and organization fields; a message that implies a sector without naming an organization produces lower confidence. These scores drive downstream routing decisions in the pipeline.

Pipeline design: from raw IOC to MITRE ATT&CK mapping

A production CTI classification pipeline has five distinct stages, each with specific inputs, outputs, and failure modes.

Stage 1 – Ingest and normalize. Raw threat data arrives in heterogeneous formats: STIX 2.1 bundles from ISAC feeds, MISP event exports, JSON from commercial threat intelligence APIs, and unstructured text from OSINT sources. The ingest stage normalizes all inputs to a canonical internal document format before any LLM processing. For STIX and MISP inputs, this is primarily field extraction. For unstructured text, this includes language detection, encoding normalization, and minimum-length filtering (documents below approximately 50 tokens carry insufficient context for reliable classification). Source metadata – feed identifier, ingestion timestamp, confidence score from the upstream provider if present – is preserved as envelope fields throughout the pipeline.

Stage 2 – Binary relevance gate. Not all ingested documents are candidates for full LLM classification. A lightweight binary classifier (a fine-tuned encoder model at 350M parameters or smaller) runs first to filter out documents that do not contain operational threat content: news summaries, administrative bulletins, false positive IOCs already known-clean. This gate reduces LLM inference volume by 60–80% in typical feed configurations, directly reducing per-day cost. The gate is calibrated for high recall – missing a genuine threat document is more costly than sending a non-operational document to the LLM stage.

Stage 3 – LLM classification and enrichment. Documents passing the binary gate enter the classification stage. A fine-tuned encoder assigns ATT&CK technique IDs and malware family labels. A generative enrichment pass extracts structured fields: threat actor group, victim organization, sector (from a fixed eight-category taxonomy), geography (ISO 3166-1 alpha-2), attack vector, and per-field confidence scores. The two passes can run concurrently since they operate on the same input document.

Stage 4 – MITRE ATT&CK mapping and entity resolution. Technique IDs from the classifier are mapped to ATT&CK objects with full enrichment: tactic association, platform applicability, and detection guidance references. Threat actor and victim organization names are resolved against the existing entity index using fuzzy name matching and country-code disambiguation. Known aliases are canonicalized. New entities trigger provisional record creation for analyst review rather than silent insertion.

Stage 5 – STIX 2.1 serialization and output. Enriched records are serialized as STIX 2.1 Bundles – Threat Actor, Malware, Attack Pattern, Indicator, and Relationship objects with proper external references to ATT&CK technique IDs. Bundles are validated against the STIX 2.1 schema before storage or export. For MISP integration, the same structured records map to MISP Events via the ATT&CK galaxy. For SIEM integration, CEF and structured JSON formats are supported for direct alert ingestion.

Training data for adversarial TTP classification

The quality of a CTI classification model is determined primarily by the quality and coverage of its training data. Three sources provide the most reliable labeled data for ATT&CK technique classification.

The MITRE ATT&CK knowledge base is the canonical starting point. Each technique entry contains prose descriptions, procedure examples drawn from real-world threat actor reports, and detection guidance. Procedure examples – descriptions of how specific threat actor groups have used a technique in confirmed operations – are the highest-quality training signal because they capture the natural language patterns analysts use when describing TTP activity. The ATT&CK corpus is maintained under version control; each release adds new techniques and refines existing ones, so fine-tuning pipelines should be aligned to specific ATT&CK versions.

AlienVault OTX pulse exports provide labeled threat actor and malware family data at scale. Each pulse contains a title, description, and associated IOCs tagged with the threat actor or malware family the submitter attributes them to. Label quality varies by submitter; filtering to pulses from verified organizations significantly improves training signal. OTX exports in STIX format enable consistent ingestion.

For adversarial TTP labeling, vendor intelligence reports (published under permissive terms) contain high-quality technique attributions stated explicitly: "The group used T1055.012 (Process Hollowing) to inject into legitimate Windows processes." These statements provide direct technique-level labels with contextual prose. Extracting them requires a one-time annotation pass to align report text to ATT&CK technique IDs, but the resulting labeled examples are among the most reliable available for fine-tuning.

The labeling strategy for rare techniques requires special attention. ATT&CK contains over 600 techniques and sub-techniques, and many appear in fewer than 20 labeled examples in any available corpus. For these rare classes, data augmentation (paraphrasing procedure example descriptions) and few-shot prompting with a generative model as a fallback classifier are both viable approaches. The minimum practical floor for reliable fine-tuned classification is approximately 80 labeled examples per class; classes below this threshold should be routed to a generative model with a few-shot prompt rather than a fine-tuned encoder.

Evaluation metrics in a SOC context

Standard accuracy metrics mislead when applied to CTI classification because the threat technique label distribution is heavily imbalanced. Techniques like T1566 (Phishing) and T1059 (Command and Scripting Interpreter) appear in a large share of real-world incident reports. Rare but high-value techniques – T1195 (Supply Chain Compromise), T1600 (Weaken Encryption) – appear far less frequently. A model that achieves 92% overall accuracy by concentrating performance on common techniques while failing on rare high-value ones is operationally useless.

The metrics that matter for production CTI classification are per-technique precision and recall, reported separately across the full technique taxonomy. Macro-averaged F1 – the unweighted average of per-class F1 across all technique labels – is the summary metric that best represents overall performance on an imbalanced label distribution. For a CTI pipeline serving a SOC, recall at the technique level for priority monitoring classes (the specific techniques relevant to the threat actors targeting your sector and geography) is the single most operationally important number. Missing 20% of T1055 events at a defense organization monitoring for advanced persistent threats is not an acceptable precision-recall tradeoff, regardless of what the macro F1 score looks like.

False positive cost in a SOC context is asymmetric. A false positive – a document classified as containing a specific ATT&CK technique when it does not – costs analyst time reviewing a spurious record. The cost is bounded and manageable. A false negative – a genuine ATT&CK technique not surfaced by the classifier – can mean a threat actor TTP goes undetected until an incident occurs. Calibrating confidence thresholds to accept higher false positive rates in exchange for lower false negative rates is the correct operating point for high-stakes monitoring scenarios.

Operational integration: real-time, batch, and analyst-in-the-loop design

CTI classification pipelines operate in two modes with different latency and throughput requirements. Real-time classification is required when the source is a live stream – Telegram channel monitoring, live threat feed subscriptions, active network telemetry. The pipeline must classify each document as it arrives, with end-to-end latency measured in seconds rather than minutes. This constrains model selection: the encoder classification stage must run in under 500 milliseconds; the generative enrichment stage should average under 15 seconds per document. Async processing with a message queue between stages prevents backpressure from the generative stage from blocking ingestion.

Batch classification is appropriate for historical corpus analysis – re-classifying an existing IOC database against a new ATT&CK version, enriching a legacy MISP instance with structured fields, or processing a bulk export from a commercial threat intelligence platform. Batch mode can use larger, more accurate models since latency constraints are relaxed, and can run overnight without impacting real-time pipeline capacity.

Analyst-in-the-loop design is not optional for production CTI classification systems. LLM classifiers make systematic errors on edge cases, novel threat actor language patterns, and deliberately obfuscated content. Without a correction mechanism, these errors accumulate in the downstream graph and degrade the quality of intelligence products over time. The analyst queue – records routed for human review based on confidence thresholds – must include an inline correction interface that captures field-level edits as labeled training data. Corrections should feed a fine-tuning feedback loop that runs on a regular schedule, continuously improving model calibration on the specific threat landscape being monitored.

Confidence threshold configuration is the primary operational control. For high-severity sectors (critical infrastructure, defense), lower thresholds (0.60–0.70) maximize recall at the cost of higher analyst queue volume. For broad monitoring where the primary objective is trend analysis rather than individual event alerting, thresholds of 0.78–0.85 reduce queue volume to a manageable level. Thresholds should be calibrated separately per field – geography confidence and technique confidence have different accuracy profiles across the model's evaluation set – and reviewed quarterly against analyst correction rates to detect distribution shift.

For a deeper look at how CTI platforms integrate structured threat data across multi-source environments, see our guide to defense-grade CTI platform architecture.

Integrating LLM classification with OSINT monitoring pipelines

LLM classification does not operate in isolation. In a mature CTI program, it is one stage in a larger pipeline that begins with source monitoring and ends with analyst-ready intelligence products and SIEM-integrated alerts. The integration points that require specific engineering attention are the handoffs between stages.

OSINT source monitoring – passive DNS, certificate transparency log scanning, dark web forum indexing, and open messaging platform channel monitoring – generates the raw document stream that feeds the classification pipeline. Each source type introduces different data quality issues. Passive DNS data is structured but high-volume with many benign records. Dark web forum content is unstructured, multilingual, and requires entity disambiguation to separate genuine threat actors from impersonators. Open messaging platform channels mix high-signal attack announcements with noise, propaganda, and disinformation at a ratio that varies significantly by channel.

The classification pipeline's binary gate stage is the primary mechanism for handling source noise. A gate model fine-tuned on labeled examples from each source type will significantly outperform a generic relevance classifier. Investing in per-source gate models is the highest-ROI tuning investment available in a CTI classification pipeline because it directly reduces the LLM inference cost that dominates per-day operating expense.

SIEM integration at the output end of the pipeline requires careful schema mapping. Most enterprise SIEMs ingest CEF (Common Event Format) or structured JSON over syslog or a REST webhook. STIX 2.1 bundles are not natively ingested by most SIEMs without a translation layer. The practical approach is to maintain two output streams from the classification pipeline: a STIX bundle stream for CTI platform ingestion and inter-organization sharing, and a SIEM-native alert stream that maps the most operationally relevant fields (technique ID, actor, severity, affected organization) to the SIEM schema. Correlation rules in the SIEM should reference ATT&CK technique IDs as the join key between CTI-derived alerts and endpoint/network telemetry events.

The operational maturity of OSINT-based threat monitoring at defense organizations has increased substantially in the past three years, driven largely by the practical accessibility of LLM-based text processing. What required a team of analysts and a significant rules maintenance burden two years ago can now be addressed with a well-engineered classification pipeline running on modest infrastructure.

Corvus.Sense applies LLM-based CTI classification to real-time Telegram channel monitoring and threat actor profiling – converting unstructured open-source intelligence into structured threat actor records, ATT&CK-mapped technique timelines, and STIX-exportable intelligence products. If your team is managing CTI at scale and needs a production-ready classification layer, Corvus.Sense is built for that problem.

Explore Corvus.Sense →

LLM-based threat classification for cyber threat intelligence