Home

ddigraph¶

Transform DDI metadata into knowledge graphs¶

A Python toolkit for converting DDI (Data Documentation Initiative) XML metadata into queryable knowledge graphs. Read Codebook, Lifecycle, and CDI formats as a stream, then load the results into Neo4j, RDF triplestores, Gremlin-compatible databases, NetworkX, or pandas -- all through a single, consistent interface.

Get Started View on GitHub

Why ddigraph?¶

Parse once, write anywhere

The streaming parser tier emits typed records you can persist to Neo4j out of the box, or to RDF/SPARQL triplestores, Gremlin databases, NetworkX, or pandas through small backend-specific adapters. The parsing code stays the same; only the writer changes. See the demo/load_*.py scripts.
Streaming XML Processing

Memory-bounded iterparse-based parsing handles files of any size. Nodes and relationships are yielded incrementally, so RAM usage stays constant regardless of input volume.
Async I/O with Back-Pressure

Concurrent parsing and batched writes are coordinated through async queues with configurable back-pressure, keeping throughput high while preventing memory spikes.
Format Auto-Detection

Point ddigraph at any DDI file. It checks if the input is DDI Codebook, DDI Lifecycle, or DDI-CDI format. Then it picks the right parser for you. No manual setup is needed.
Unified Schema Definitions

Node labels, property keys, relationship types, and constraints are defined in a single schema module. Every backend receives the same canonical structure, ensuring consistency across graph targets.
Production-Ready

Built-in retry logic with exponential back-off, structured logging, health checks, and OpenTelemetry-compatible observability hooks make ddigraph suitable for automated pipelines and CI/CD workflows.

Installation¶

pipDevelopment

pip install ddigraph

pip install -e ".[dev,docs]"

Backend extras

The base install includes the Neo4j driver. For other backends, install the corresponding extras:

pip install ddigraph[rdf]      # RDFLib + SPARQLWrapper
pip install ddigraph[gremlin]   # Gremlin-Python

Quick Example¶

# 1. Configure your Neo4j connection
export DDIGRAPH_NEO4J_URI=bolt://localhost:7687
export DDIGRAPH_NEO4J_USER=neo4j
export DDIGRAPH_NEO4J_PASSWORD=secret

# 2. Create indexes and constraints
ddigraph bootstrap

# 3. Load any DDI file (format is auto-detected)
ddigraph load path/to/survey.xml --dataset-id my-survey

# 4. Verify what was loaded (use Cypher in the Neo4j Browser)
#    MATCH (n) RETURN labels(n) AS type, count(*) AS n ORDER BY n DESC

Python API

The same workflow is available programmatically. See the Quick Start guide for a full Python example using DDILoader and DDIFragmentLoader.

Documentation¶

Getting Started

Install ddigraph, run your first load, and explore the graph in under ten minutes.
User Guide

Understand the architecture, learn how DDI elements map to graph structures, and extend ddigraph with custom adapters.
Graph Backends

Detailed guides for each supported backend, including connection setup, schema bootstrapping, and query examples.
- Neo4j
- RDF / SPARQL
- Gremlin
- NetworkX
Reference

Complete CLI command reference and configuration option catalog.
- CLI Reference
- Configuration
Advanced

Tune performance, integrate with AI/LLM pipelines, and understand how ddigraph fits into the broader DDI ecosystem.
Project

Contribute to ddigraph, review the changelog, or find answers to common questions.

Supported DDI Formats¶

Format	Version	Root Element	Description
DDI Codebook	2.5 / 2.6	`<codeBook>`	Traditional flat structure with a central Dataset node linking to variables, questions, and study-level metadata. Widely used by survey archives and data catalogs.
DDI Lifecycle	3.2 / 3.3	`<FragmentInstance>`	Modular format built around reusable fragments. Supports questionnaire flows, CAPI/CAWI instruments, and complex study designs with fine-grained versioning.
DDI-CDI	1.0	`<Wrapper>`	Cross-Domain Integration model. Captures conceptual variables, represented variables, instance variables, and the relationships between them for cross-study harmonisation.

Format detection

You do not need to say which format a file uses. The detect_ddi_format() function reads the root element and namespace. It then picks the right parser for you.

Supported Graph Backends¶

Backend	Library	Best For	Protocol
Neo4j	`neo4j` (Bolt driver)	Production deployments, complex traversals, full-text search	Bolt
RDF / SPARQL	`rdflib`, `SPARQLWrapper`	Linked data publishing, ontology alignment, federated queries	HTTP / SPARQL 1.1
Gremlin	`gremlinpython`	JanusGraph, Amazon Neptune, Azure Cosmos DB	WebSocket
NetworkX	`networkx`	Local analysis, prototyping, unit testing	In-process
pandas	`pandas`	Tabular analysis, CSV/Excel export, data validation	In-process

Neo4j is wired through the GraphWriteAdapter protocol in ddigraph.schema.adapter. The other backends listed above are not adapter-driven; they consume the parser tier (DDILoader, DDIFragmentLoader, DDIFragmentParser) and write through their own small adapters. The demo/load_*.py scripts show the pattern.

ddigraph is released under the MIT License.

GitHub | PyPI | Issues