ODKE+ pipeline

METHODMethodology

Six phases turn raw STIX / EUR-Lex / HIBP / LOLBAS / GTFOBins feeds into a typed, scored, citation-backed cybersecurity graph. Authored by Adam Lundqvist, Founder at SQUR.

Phase 1

Schema + graph primitives

EdgePredicate, EntityType, ConfidenceTier enums + PREDICATE_TYPES invariants (from-type / to-type pairing rules). Deterministic edge_id = sha256(from | predicate | to)[:16]. Corroborator scoring curve: 1 source → 0.65, 2 → 0.80, 3 → 0.95, 4+ → 1.0. Authoritative-source override (MITRE STIX, CWE XML, CISA KEV, NVD, EPSS, EUR-Lex) bypasses the curve at 1.0. Ports ODKE+ (arXiv 2509.04696) section 1.1.

Phase 2

Compliance mega-mapping

127 compliance controls across 14 frameworks (DORA, NIS2, GDPR, ISO 27001, NIST CSF, OWASP Top 10 / API / LLM, CIS v8, PCI DSS v4, AI Act, CRA, ISO 27701, TIBER-EU). Each control is fed through Vertex Gemini 2.5 Flash to extract applicable_techniques, defending_mitigations, and underlying_weaknesses arrays. Output: compliance_mappings/ collection with citation_count + cost_eur per control.

Phase 3

Edge promotion + body-text mining

Lift compliance_mappings arrays into first-class edges via deterministic ID + corroborator scoring. Body-text mining extracts MITRE URL references from Software / ThreatActor / Group bodies via matchAll() over /attack\.mitre\.org\/(techniques|groups)\/([TG]\d{4})/gi. Both pass land in edge_candidates/ with source='body-text-mining'.

Phase 4

Same-as deduplication

Detect canonical-duplicate nodes across source feeds (MISP-Galaxy ThreatActors ↔ MITRE Groups). Heuristics: MITRE G#### token in alias array → confidence 0.98; normalised name overlap → 0.82. 137 same_as aliases written, 87% MITRE Group coverage.

Phase 5

Vertex-grounded verification

For each candidate edge, ask Vertex Gemini 2.5 Flash with Google Search tool ({ google_search: {} }) whether the claim holds. Extract groundingMetadata.groundingChunks[] as distinct evidence URIs (capped at 5 per edge). Re-score via corroborator. Promote ≥0.85 confidence to edges/; keep <0.85 in edge_candidates; discard <0.6 with audit. Typical promotion rate: 67%. Spend: ~€0.0007 per edge.

Phase 6

Deterministic enrichment agents

Per-node-type body enrichers (LOLBin, ThreatActor, CAPEC, Software) and the completeness sweep. Each agent reads structured fields (LOLBin function categories, ThreatActor country + sectors, CAPEC abstraction + likelihood + related CWEs, Software platforms + attribution) and templates a richer body. No LLM cost. Healing_queue/ tracks nodes still below the 60-point threshold.

For the live tier breakdown + source refresh state see /audit. Reference paper: ODKE+ (arXiv 2509.04696).