Commercial threat intelligence feeds have a well-documented lag problem. By the time an indicator of compromise (IOC) — a malicious IP address, a command-and-control domain, a file hash associated with a new malware sample — appears in a paid feed, it has often already been active for 24 to 72 hours. Adversaries set up infrastructure, conduct reconnaissance, and post operational details in open-access channels long before any feed vendor picks up the signal. For defense software engineers and procurement teams evaluating CTI tooling, this lag is not an edge case: it is the default condition.
The practical response is to build, or procure, a pipeline that extracts IOCs directly from the open sources where they appear first. This article covers the source landscape, the extraction and normalization architecture, false positive handling, real-time streaming mechanics, and the enrichment steps that transform a raw extracted indicator into actionable threat intelligence.
The speed advantage of open-source IOC collection
The gap between first open-source mention and commercial feed publication is well established in the threat intelligence community. A domain registered to serve as a C2 endpoint is often announced — or at least detectable — in Telegram channels operated by threat actors within hours of going live. The same domain may take 24 to 96 hours to appear in a premium feed after a vendor analyst processes and validates it. For high-tempo operations where threat actors rotate infrastructure frequently, this window represents the entire operational lifetime of some indicators.
Open sources also surface IOC types that commercial feeds structurally underrepresent. Paste sites receive data dumps from breaches within minutes of exfiltration. Telegram channels operated by hacktivist groups and state-aligned actors announce targets, claim credit, and post proof-of-compromise material that includes hashes, IPs, and domains not yet associated with any known campaign in commercial databases. Reddit communities and specialized Discord servers host discussions of newly discovered malware samples, often including hash values and behavioral descriptions, before formal analysis is published.
The value is not that open sources replace commercial feeds — they do not. Commercial feeds provide validated, structured, high-confidence indicators at scale. Open sources provide speed and coverage of sources too volatile or too niche for commercial collection operations to monitor systematically. A production CTI pipeline needs both.
Source landscape: where IOCs appear first
Telegram channels. Since 2022, Telegram has become the primary public-facing coordination and announcement platform for a broad spectrum of threat actors including state-aligned groups, hacktivist collectives, ransomware operators, and initial access brokers. Relevant channels publish target lists before attacks, claim credit immediately after, and post screenshots or data samples that contain extractable IOCs. The volume is high and the signal density is uneven: a single active channel may produce dozens of high-value IOCs per week alongside large volumes of propaganda content with no extractable intelligence. Systematic collection requires channel selection, message filtering, and language-aware processing for channels operating in Russian, Ukrainian, Arabic, Chinese, and other languages.
Paste sites. Pastebin and its functional equivalents (Ghostbin, PrivateBin instances, and purpose-built leak sites) receive high volumes of data dumps. Content ranges from stolen credential lists containing domain names, email addresses, and hashed passwords to more operationally significant dumps including network diagrams, configuration files with embedded IPs, and tool output logs containing reconnaissance data. Public paste site APIs and RSS feeds enable near-real-time collection. The challenge is volume: tens of thousands of new pastes per day, the majority of which are irrelevant to any given monitoring target.
Twitter/X threat intelligence accounts. A population of security researchers and vendors use Twitter/X as a primary publication channel for newly discovered IOCs. First-publication hash values, C2 domain registrations, and malware sample analyses frequently appear as tweets before any other publication. Filtered stream access with keyword and account filters targeting known high-signal accounts enables near-real-time IOC collection from this source. The format constraints of the platform (short text, URLs, use of defanging conventions) require specific parsing handling.
Dark web forums. Access broker forums — where initial access to compromised networks is sold — and ransomware group leak sites publish content that contains extractable IOCs: victim organization domain names, infrastructure details, and stolen file samples. Collection requires Tor-proxied HTTP scraping and is operationally more complex than surface web collection, but the intelligence value for defense organizations (advance warning of network access being listed for sale, or identification of a compromise before public disclosure) justifies the complexity.
Reddit and technical security communities. Subreddits covering malware analysis, reverse engineering, and incident response host discussions of newly discovered samples. Hash values, behavioral indicators, and C2 infrastructure details appear in these discussions, often before formal reports are published. The discourse format requires NER-based extraction rather than simple regex matching, as IOC values are embedded in free-form text.
NLP extraction pipeline: regex, NER, and normalization
An IOC extraction pipeline operates in two parallel tracks: pattern-based extraction for typed indicators and model-based extraction for unstructured entity mentions.
Refanging as a preprocessing step. Before any pattern matching, the raw text must be refanged. Security practitioners defang IOCs in text to prevent accidental activation — replacing "http" with "hxxp", inserting brackets around dots (e.g., "198.51.100[.]1"), substituting "[at]" for "@" in email addresses, and similar conventions. A refanging preprocessor restores canonical form before pattern application. Missing this step causes systematic extraction failure: defanged indicators are extremely common on Twitter/X and security forums, and a pipeline that skips refanging will miss a significant fraction of available IOCs.
Regex patterns for typed IOCs. After refanging, regex patterns extract:
- IPv4 addresses: standard dotted-quad pattern with exclusions for documentation ranges (192.0.2.0/24, 198.51.100.0/24, 203.0.113.0/24) and private ranges
- IPv6 addresses: full and compressed forms
- Domains: pattern matching registry-valid hostnames, with TLD validation against the Public Suffix List to reduce false positives from word fragments that match the hostname pattern
- URLs: full URL including scheme, optional credentials, host, path, and query string
- File hashes: MD5 (32 hex chars), SHA-1 (40 hex chars), SHA-256 (64 hex chars) — distinguished by length; a broader hex-string pattern generates too many false positives and should not be used
- CVE identifiers: CVE-YYYY-NNNNN format with year validation
- Email addresses: standard RFC 5322 pattern with defang handling
NER for unstructured entity mentions. Regex patterns do not capture threat actor names, malware family names, campaign identifiers, or contextual references to targeted organizations. A named entity recognition model trained on cybersecurity corpora extracts these entities. Pre-trained models such as those available from the CyberSecBERT or SecBERT families significantly outperform general NLP models on this vocabulary. Entity normalization — mapping aliases and variant spellings to canonical identifiers — is a separate post-processing step backed by a lookup table maintained by the threat intelligence team.
Deduplication. The same IOC value extracted from multiple sources within a short time window must be deduplicated before analyst delivery. At the value level, exact deduplication is straightforward. At the document level, MinHash locality-sensitive hashing identifies near-duplicate posts — the same announcement reshared across multiple Telegram channels — and collapses them to a single canonical record with a provenance list rather than generating separate alerts per channel.
False positive handling: context scoring and source credibility
Raw regex extraction applied to social media text produces large numbers of false positives. An IP address mentioned as a known-good DNS resolver, a domain cited as a legitimate reference, or a hash value included as a benign example all match extraction patterns but carry zero intelligence value. Filtering these requires a scoring layer applied to each candidate IOC.
Context window scoring. For each extracted candidate, a 100-character window surrounding the match is analyzed for contextual signals. Positive-signal terms — "C2", "beacon", "payload", "infected", "dropped", "malicious", "compromised", "callback" — increase the confidence score. Negative-signal terms — "sinkhole", "benign", "example", "test", "legitimate", "documented safe" — decrease it. The context window also checks for negation patterns: "not malicious" should score differently than "malicious".
Source credibility weighting. A researcher with a documented history of accurate IOC publication contributes a higher base confidence than an anonymous account on a low-reputation paste site. Source credibility scores are maintained per-source and per-account, updated based on feedback loops: when a previously extracted IOC is later confirmed in a verified incident, the source credibility score increases; when an extracted IOC is confirmed benign, it decreases. Over time this creates a self-calibrating source reputation system.
Structural heuristics. Some false positive classes are catchable with lightweight heuristics independent of context text. IPv4 addresses in documentation ranges are never actionable. Domains registered more than five years ago with no other malicious association are unlikely to be newly active C2 infrastructure. File hashes shorter than 32 characters that matched the MD5 pattern are likely truncated values from a broader hex string. A heuristic filter layer applied before context scoring reduces the candidate set without the computational cost of full context analysis.
Real-time streaming: Kafka-based pipeline architecture
At production volumes — monitoring hundreds of Telegram channels, multiple paste site feeds, and high-frequency social media streams simultaneously — a synchronous processing architecture cannot maintain low latency. A message queue architecture decouples collection from processing and enables horizontal scaling of each stage independently.
The typical architecture places Apache Kafka at the core. Collection adapters publish raw messages to a source-specific Kafka topic. A preprocessing consumer reads from these topics, performs refanging and language detection, and publishes normalized documents to a processing topic. The extraction and scoring consumer reads normalized documents, runs regex and NER extraction, applies context scoring, and publishes candidate IOCs to an extraction-results topic. An enrichment consumer reads high-confidence candidates and fires async lookups to external services (VirusTotal, Shodan, passive DNS providers). Enriched IOC records are published to a final output topic consumed by the MISP integration and analyst alerting systems.
This architecture provides several operational properties critical for a production threat intelligence pipeline. Stage failures are isolated — a VirusTotal API outage stops enrichment but does not block extraction or collection. Backpressure is handled by Kafka's consumer offset model: if extraction falls behind collection during a spike, the backlog accumulates in Kafka and processes when capacity recovers. Replay is available: any stage can reprocess historical messages by resetting consumer offsets, enabling retrospective analysis when new extraction patterns are added.
End-to-end latency from a Telegram message being posted to a high-confidence IOC reaching the analyst alert queue is typically under 90 seconds in a well-tuned deployment, with the majority of that time spent on enrichment API calls. For paste sites with polling-based collection, the latency floor is the polling interval — commonly one to five minutes for high-priority paste sources.
Feed enrichment: adding operational context
A bare extracted IOC — an IP address, a domain name, a file hash — is not yet actionable intelligence. Enrichment transforms it into a contextual record that an analyst can use to make a blocking or investigation decision without additional manual lookups.
VirusTotal reputation lookup provides the collective verdict of dozens of antivirus and threat intelligence vendors on a given indicator. A domain or hash with zero detections at extraction time may still be flagged within hours as other vendors process the same indicator. The pipeline caches VirusTotal results with a short TTL (typically 24 hours for IPs and domains, longer for file hashes) and re-queries on cache expiry to surface updated verdicts.
Passive DNS provides the resolution history of a domain or IP: which domains have resolved to this IP, which IPs has this domain resolved to, and when did those resolutions occur. Passive DNS is essential for identifying infrastructure reuse across campaigns — a new C2 domain that resolves to an IP previously associated with a known threat actor is a strong attribution signal that would be invisible from the domain record alone.
Shodan lookups for IP-type IOCs provide the open-port profile, running services, and certificate data visible on that address at collection time. An IP that is running an unbranded HTTPS service on a non-standard port, has a recently issued self-signed certificate, and shows no other hosting history is a substantially more suspicious C2 candidate than an IP running a major CDN's standard service stack.
WHOIS and registration recency. Domains registered within the past 30 days are significantly more likely to be malicious infrastructure than domains with multi-year registration histories. The WHOIS registration date is a low-cost, high-signal enrichment that should be standard for every domain-type IOC.
For an in-depth look at how Telegram specifically serves as both a collection source and a signal medium for threat actors, see our earlier article on building a Telegram threat intelligence monitoring capability. For the broader platform context in which IOC extraction sits, the cyber threat intelligence platform architecture for defense article covers the downstream workflows that consume extracted IOC feeds.
Operational note: The highest-value IOCs from open-source extraction are often not the indicators themselves but the timing signal — the fact that a specific threat actor is mentioning your organization's domain, IP range, or system names before any network activity is detected. Building keyword alerting around organization-specific identifiers (internal project names, supplier domains, technology stack component names) turns the extraction pipeline into an early-warning system that no commercial feed can replicate.
MISP integration and analyst delivery
The output of the extraction and enrichment pipeline should integrate natively with the analyst's existing threat intelligence workflow rather than creating a separate data silo. MISP (Malware Information Sharing Platform) is the standard open platform for structured IOC management in defense and government CTI environments.
Each cluster of related IOCs extracted from a single source document — a Telegram post, a paste site entry — is submitted as a MISP event. The event carries the source text as a free-text attribute, the extracted IOCs as typed attributes (ip-dst, domain, md5, sha256, url, vulnerability), and contextual tags: TLP classification (typically TLP:WHITE or TLP:GREEN for unclassified OSINT), source credibility tag, confidence level tag, and any MITRE ATT&CK technique tags derived from the context text. The enrichment metadata — VirusTotal scores, passive DNS records, Shodan data — is attached as additional attributes or object relationships.
For high-confidence IOCs from high-credibility sources, the MISP integration triggers an immediate SOAR alert, pushing the indicator to the analyst's queue with a priority flag. Bulk lower-confidence IOCs accumulate in a triage queue for periodic analyst review. This two-track delivery model prevents alert fatigue while ensuring that genuinely time-sensitive indicators receive immediate attention.
Corvus.Sense provides automated real-time IOC extraction from Telegram, paste sites, and open-source threat feeds — with enrichment, MISP integration, and analyst-facing alert delivery built in. If you are evaluating a production OSINT IOC pipeline for a defense or government CTI program, Corvus.Sense is designed for exactly this use case.
Explore Corvus.Sense →