Skip to content

Home

ddigraph

Transform DDI metadata into knowledge graphs

A Python toolkit for converting DDI (Data Documentation Initiative) XML metadata into queryable knowledge graphs. Read Codebook, Lifecycle, and CDI formats as a stream, then load the results into Neo4j, RDF triplestores, Gremlin-compatible databases, NetworkX, or pandas -- all through a single, consistent interface.

Get Started View on GitHub


Why ddigraph?

  • Parse once, write anywhere


    The streaming parser tier emits typed records you can persist to Neo4j out of the box, or to RDF/SPARQL triplestores, Gremlin databases, NetworkX, or pandas through small backend-specific adapters. The parsing code stays the same; only the writer changes. See the demo/load_*.py scripts.

  • Streaming XML Processing


    Memory-bounded iterparse-based parsing handles files of any size. Nodes and relationships are yielded incrementally, so RAM usage stays constant regardless of input volume.

  • Async I/O with Back-Pressure


    Concurrent parsing and batched writes are coordinated through async queues with configurable back-pressure, keeping throughput high while preventing memory spikes.

  • Format Auto-Detection


    Point ddigraph at any DDI file. It checks if the input is DDI Codebook, DDI Lifecycle, or DDI-CDI format. Then it picks the right parser for you. No manual setup is needed.

  • Unified Schema Definitions


    Node labels, property keys, relationship types, and constraints are defined in a single schema module. Every backend receives the same canonical structure, ensuring consistency across graph targets.

  • Production-Ready


    Built-in retry logic with exponential back-off, structured logging, health checks, and OpenTelemetry-compatible observability hooks make ddigraph suitable for automated pipelines and CI/CD workflows.


Installation

pip install ddigraph
pip install -e ".[dev,docs]"

Backend extras

The base install includes the Neo4j driver. For other backends, install the corresponding extras:

pip install ddigraph[rdf]      # RDFLib + SPARQLWrapper
pip install ddigraph[gremlin]   # Gremlin-Python

Quick Example

# 1. Configure your Neo4j connection
export DDIGRAPH_NEO4J_URI=bolt://localhost:7687
export DDIGRAPH_NEO4J_USER=neo4j
export DDIGRAPH_NEO4J_PASSWORD=secret

# 2. Create indexes and constraints
ddigraph bootstrap

# 3. Load any DDI file (format is auto-detected)
ddigraph load path/to/survey.xml --dataset-id my-survey

# 4. Verify what was loaded (use Cypher in the Neo4j Browser)
#    MATCH (n) RETURN labels(n) AS type, count(*) AS n ORDER BY n DESC

Python API

The same workflow is available programmatically. See the Quick Start guide for a full Python example using DDILoader and DDIFragmentLoader.


Documentation


Supported DDI Formats

Format Version Root Element Description
DDI Codebook 2.5 / 2.6 <codeBook> Traditional flat structure with a central Dataset node linking to variables, questions, and study-level metadata. Widely used by survey archives and data catalogs.
DDI Lifecycle 3.2 / 3.3 <FragmentInstance> Modular format built around reusable fragments. Supports questionnaire flows, CAPI/CAWI instruments, and complex study designs with fine-grained versioning.
DDI-CDI 1.0 <Wrapper> Cross-Domain Integration model. Captures conceptual variables, represented variables, instance variables, and the relationships between them for cross-study harmonisation.

Format detection

You do not need to say which format a file uses. The detect_ddi_format() function reads the root element and namespace. It then picks the right parser for you.


Supported Graph Backends

Backend Library Best For Protocol
Neo4j neo4j (Bolt driver) Production deployments, complex traversals, full-text search Bolt
RDF / SPARQL rdflib, SPARQLWrapper Linked data publishing, ontology alignment, federated queries HTTP / SPARQL 1.1
Gremlin gremlinpython JanusGraph, Amazon Neptune, Azure Cosmos DB WebSocket
NetworkX networkx Local analysis, prototyping, unit testing In-process
pandas pandas Tabular analysis, CSV/Excel export, data validation In-process

Neo4j is wired through the GraphWriteAdapter protocol in ddigraph.schema.adapter. The other backends listed above are not adapter-driven; they consume the parser tier (DDILoader, DDIFragmentLoader, DDIFragmentParser) and write through their own small adapters. The demo/load_*.py scripts show the pattern.


ddigraph is released under the MIT License.

GitHub | PyPI | Issues