Why use a graph database instead of a relational database for intelligence analysis?

Intelligence questions are fundamentally about relationships: who communicates with whom, which accounts share a device, how money moves between entities. In a relational database, answering 'find everyone within three hops of this person' requires repeated self-joins whose cost grows with each hop, often becoming intractable at depth four or five. A graph database stores relationships as first-class edges with direct pointers between connected records, so a multi-hop traversal walks adjacent edges in roughly constant time per hop regardless of total table size. For deep relationship queries, this is the difference between a sub-second answer and a query that never returns.

What is entity resolution and why does it matter for the graph?

Entity resolution is the process of deciding whether two records — a phone number from SIGINT and a contact from a HUMINT report, for example — refer to the same real-world entity. It matters because the entire value of a graph depends on it: if the same person exists as five separate nodes, their network is fragmented across five disconnected fragments and every link-analysis query returns an incomplete picture. Conversely, over-merging two distinct people into one node fabricates connections that do not exist. Entity resolution is therefore the highest-leverage and highest-risk step in building an intelligence graph.

What query languages are used for intelligence graphs?

The two dominant options are Cypher (used by Neo4j and standardized as GQL/openCypher) and Gremlin (the traversal language of Apache TinkerPop, supported by JanusGraph and many cloud graph stores). Cypher is declarative and pattern-based, well suited to expressing 'find this shape of subgraph.' Gremlin is imperative and step-based, well suited to expressing complex, conditional traversals. SPARQL is used where the data is modeled as RDF triples. Most intelligence platforms expose Cypher or a Cypher-like pattern syntax to analysts because the pattern-matching style maps naturally onto link-analysis questions.

How do you keep an intelligence graph from becoming a hairball?

Visualizing every node and edge at once produces an unreadable hairball that hides rather than reveals. The discipline is to never render the whole graph: start from a seed entity, expand a bounded neighborhood, and apply filters by edge type, time window, and confidence before drawing. Centrality measures and community detection are run server-side to rank what is worth showing, and the visualization surfaces only the top-ranked subgraph. The analytic value comes from the queries and metrics, not from the picture — the picture is a final, filtered communication layer, not the analysis itself.

How is uncertainty and source confidence handled in a graph model?

Every edge should carry provenance and confidence as properties: the source that asserted it, the collection time, and a confidence score. This lets queries filter to high-confidence relationships only, weight path-finding by reliability, and present analysts with the evidentiary basis for any inferred connection. Treating confidence as a first-class edge property — rather than collapsing everything into binary 'connected or not' — is what keeps a graph analytically rigorous and auditable, so an analyst can always trace why two entities appear linked.

Graph databases for intelligence analysis

Intelligence analysis is, at its core, the study of relationships. The questions analysts ask are almost never about a single record in isolation – they are about connection. Who does this person communicate with? Which accounts share a device? How does money move from this organization to that one, and through which intermediaries? These are graph questions, and a graph database is the data structure that answers them natively. This article examines how graph databases underpin modern intelligence analysis: how entities and links are modeled, how duplicate entities are resolved into a coherent picture, how traversal queries answer relationship questions that defeat conventional databases, and how the resulting networks are visualized without losing analytic rigor.

Why relationships break relational databases

A relational database stores relationships implicitly, as foreign keys spread across tables. To answer "who is connected to this person, and who is connected to them in turn," the engine must join a table to itself once per hop. Each self-join multiplies the working set, and the cost compounds with depth. A two-hop query is usually fine; a three-hop query is slow; a five-hop query against a large dataset frequently never returns within any useful time. The relationships exist in the data, but the storage model makes asking about them expensive.

A graph database inverts this. Relationships are stored as first-class edges – physical pointers from one record to its neighbors – rather than being recomputed by matching key columns at query time. Walking from a node to its neighbors is a pointer traversal whose cost depends on the local degree of the node, not on the total size of the dataset. This property, sometimes called index-free adjacency, is what makes deep relationship queries tractable. A traversal five hops out from a seed entity touches only the edges along the path, regardless of whether the graph holds a thousand nodes or a billion.

For intelligence work, where the high-value questions are almost always multi-hop relationship questions, this is not a marginal optimization. It is the difference between a question that can be asked interactively and one that cannot be asked at all.

Modeling the intelligence graph: entities and links

The dominant model for intelligence graphs is the labeled property graph. It has two structural elements and one universal mechanism for attaching meaning.

Nodes (entities). Each node carries a label that names its type – person, organization, location, device, account, vehicle, event – and a set of key-value properties. A person node might hold a canonical name, date of birth, and known identifiers; a device node might hold an IMEI and a manufacturer. The label drives both query selectivity and the visual encoding analysts see later.

Edges (relationships). Each edge has a type (communicates-with, owns, located-at, member-of, transacted-with), a direction, and – critically – its own properties. An edge is not merely a wire between two nodes; it is an observation with a source, a timestamp, and a confidence. The edge "Person A communicates-with Person B" should record which collection asserted it, when the communication occurred, and how reliable the assertion is.

This insistence on edges-as-observations is what separates a rigorous intelligence graph from a naive one. A graph that records only "A is connected to B" throws away the evidentiary basis for the connection. A graph that records "A communicated with B on 2026-03-14, asserted by source X, confidence 0.7" can be filtered, weighted, time-sliced, and audited. The same discipline that governs multi-source data fusion – propagating provenance and confidence through every record – applies directly to graph edges.

Temporal modeling

Relationships are rarely static. A person belongs to one unit, then transfers; two accounts transact once, then never again. A graph that collapses all of this into timeless edges cannot answer "who was connected to this entity in March." The standard remedy is to treat time as an edge property – start and end validity, or an observation timestamp – and to push time filters into the traversal so that a query reconstructs the network as it stood at a chosen moment. This temporal dimension is what links graph analysis to pattern-of-life analysis, where the rhythm of relationships over time is itself the signal.

Entity resolution: the foundation everything rests on

The single most consequential step in building an intelligence graph is entity resolution – deciding whether two records refer to the same real-world entity. A phone number intercepted by SIGINT, a name written in a HUMINT report, and an account flagged in financial data may all describe one person, or three. Get this right and the graph reveals genuine networks. Get it wrong and every downstream query is corrupted.

The failure modes are symmetric and both severe. Under-merging leaves one real person scattered across several disconnected nodes; their network fragments, and a link-analysis query returns a partial, misleading picture. Over-merging fuses two distinct people into a single node, fabricating connections that do not exist and potentially implicating the wrong person. Because over-merging manufactures false intelligence, entity resolution must err toward caution and remain reversible.

In practice, resolution combines two techniques. Deterministic matching uses strong identifiers: a shared passport number, IMEI, or government ID is treated as a confident merge. Probabilistic matching scores weaker evidence – similar names, shared locations, overlapping contacts – and merges only above a conservative threshold. Every merge decision should be recorded with the evidence that justified it, so an analyst can later see why two records became one node and can split them if new information contradicts the merge.

Key insight: The quality of an intelligence graph is set almost entirely at the entity-resolution layer, not at the query layer. A perfectly tuned traversal engine running over a poorly resolved graph produces confident, fast, wrong answers. Invest in resolution accuracy and merge auditability before optimizing query performance – a fragmented or over-merged graph cannot be rescued by a better query.

Traversal queries: asking the relationship questions

Once entities are resolved and edges loaded, the analytic value is unlocked through traversal queries. Two query languages dominate. Cypher – the declarative, pattern-matching language of Neo4j, now standardized as openCypher and GQL – lets an analyst describe the shape of a subgraph to find: a pattern like a person connected through two intermediaries to an organization. Gremlin, the imperative step-based language of Apache TinkerPop, expresses traversals as an explicit sequence of steps and excels at complex, conditional walks. Where the underlying store is RDF, SPARQL queries triples. Most analyst-facing platforms expose a Cypher-like pattern syntax because describing "this shape of network" maps cleanly onto how analysts think.

The recurring query patterns in intelligence work are a small, powerful set:

Shortest path. What is the closest chain of relationships linking two entities? A short path between a known hostile actor and an otherwise innocuous account is a strong lead; the path length itself is an analytic signal.

Neighborhood expansion. Starting from a seed entity, return everything within k hops, filtered by edge type, time window, and confidence. This is the workhorse of link analysis – bounded, filtered expansion rather than unconstrained exploration.

Common neighbors and shared attributes. Which entities sit between two seeds? Which devices, locations, or accounts do several persons share? Co-occurrence on a shared resource is one of the most reliable signals that two entities are operationally connected.

Centrality and community detection. Graph algorithms – betweenness and degree centrality, PageRank, Louvain community detection – rank which nodes are structurally important and which clusters form natural groups. These are run server-side over the stored graph, not in the analyst's head, and they are what turn a tangle of edges into a ranked, prioritized set of leads. Efficient retrieval of the seed and its neighborhood depends on the same indexing discipline that governs large-scale geospatial queries: without an index to locate the entry node in constant time, even a graph store falls back to a full scan.

The supernode problem

Real intelligence graphs contain supernodes – entities with enormous degree, such as a shared public phone line, a popular messaging channel, or a common service address. A naive traversal that expands through a supernode explodes combinatorially and distorts centrality, because everything appears connected to everything through that one hub. Production systems handle supernodes deliberately: capping expansion degree, treating high-degree hubs as weak evidence, or excluding known shared-resource nodes from path-finding. Failing to do so produces traversals that either never finish or return a network where the supernode swamps every real signal.

Visualization without losing rigor

The instinct to draw the whole graph is the most common way analysts mislead themselves. Render every node and edge at once and the result is a hairball – a dense, unreadable mass that hides structure rather than exposing it. The discipline is to never visualize the whole graph. Start from a seed, expand a bounded neighborhood, filter by edge type and confidence and time, and draw only what survives the filter.

The analytic work happens in the queries and the graph algorithms; the visualization is the final, filtered communication layer that conveys a conclusion the analyst has already reached through querying. Centrality scores decide which nodes are drawn large; community detection decides how they are colored and clustered; confidence properties decide which edges are drawn solid versus dashed. A good intelligence-graph interface treats the picture as the output of an analytic pipeline, not as the analysis itself. This is the same separation of concerns that distinguishes a rigorous fusion picture from a raw sensor dump – the system presents conclusions with their evidence attached, ready to be challenged.

Connecting graph-derived networks to communications data also intersects with adjacent tradecraft such as threat-actor profiling on messaging platforms, where the same entity-resolution and link-analysis methods apply to online identities, channels, and the relationships among them.

Operational considerations: scale, security, and accreditation

Choosing a graph store for intelligence work is rarely a pure performance decision. Several operational constraints shape the architecture as much as raw traversal speed does.

Scale and storage model. Native graph engines such as Neo4j store edges as physical adjacency and excel at deep traversal on a single large graph. Distributed stores such as JanusGraph layer a graph model over a partitioned backend and scale horizontally, at the cost of cross-partition traversals that cross machine boundaries. The right choice depends on the dominant query: a workload dominated by deep, interactive link-analysis favors a native engine, while a workload of shallow lookups over an enormous, sharded dataset favors a distributed store. Misjudging this is one of the most expensive architectural mistakes, because migrating between graph storage models late in a program is costly.

Classification and need-to-know. An intelligence graph routinely combines edges of differing classification. A relationship asserted by a sensitive source may be more highly classified than the entities it connects. The system must propagate classification to the edge level and enforce need-to-know at query time, so that two analysts running the same traversal see different subgraphs according to their clearance. Enforcing access at ingestion – by simply excluding data the average user cannot see – destroys the analytic value for cleared users and is the wrong layer for the control.

Auditability and accountability. Because graph-derived conclusions can drive consequential decisions, every inferred connection must be traceable to the observations that produced it. This means edges retain their source and collection metadata, merge decisions are logged, and any path the system surfaces can be expanded into the underlying evidence. A graph that asserts "A is connected to B" without being able to show why is not usable as intelligence – it is an unaccountable claim. The accreditation regimes governing defense intelligence systems formalize this requirement, and a graph platform that cannot satisfy it will not be fielded regardless of its query performance.

Keeping the graph current. Relationships decay and change; an intelligence graph that is loaded once and never updated quickly diverges from reality. Production systems treat the graph as a living store fed by continuous ingestion, with new observations resolved against existing entities in near-real time and stale edges aged out or down-weighted by confidence. The same provenance discipline that governs the initial load governs every incremental update, so the graph remains auditable as it evolves rather than accumulating unattributed edges over time.

Build relationship intelligence into your analytic picture

Corvus HEAD fuses multi-source intelligence into a queryable entity-and-relationship graph – entity resolution, link analysis, and bounded visualization built for analysts who need to trace connections, not browse hairballs.

Explore Corvus HEAD → Book a Briefing

This analysis was prepared by Corvus Intelligence engineers who build mission-critical intelligence and data-integration systems for defense and government organizations. Learn about our team →

Graph databases for intelligence analysis: entities, links, queries

Why relationships break relational databases

Modeling the intelligence graph: entities and links

Temporal modeling

Entity resolution: the foundation everything rests on

Traversal queries: asking the relationship questions

The supernode problem

Visualization without losing rigor

Operational considerations: scale, security, and accreditation

Build relationship intelligence into your analytic picture

Frequently Asked Questions

Graph databases for intelligence analysis: entities, links, queries

Why relationships break relational databases

Modeling the intelligence graph: entities and links

Temporal modeling

Entity resolution: the foundation everything rests on

Traversal queries: asking the relationship questions

The supernode problem

Visualization without losing rigor

Operational considerations: scale, security, and accreditation

Build relationship intelligence into your analytic picture

Frequently Asked Questions

Related Articles