How It Loads Your Data¶
Optional deep-dive
You don't need to read this page to use ddigraph. It explains the internal design — useful when you want to tune performance or understand why something behaves a certain way. If you're just getting started, go to Your First Queries.
ddigraph converts DDI XML files into a graph database. It is built around three goals:
- Scalability — works on files of any size without running out of memory
- Observability — you can see what is happening during a load
- Fault isolation — if one record fails, the rest still succeed
Supported DDI Formats¶
ddigraph supports three DDI format families, each with its own loader:
DDI Codebook (DDILoader)¶
The older, simpler format. All metadata connects to a central Dataset node.
flowchart LR
subgraph Input
XML[DDI Codebook XML]
end
subgraph Parser
P[lxml.iterparse]<-->Q[asyncio.Queue]
end
subgraph Writer
B[DDIBatch]
C[Async Cypher UNWIND]
end
DB[(Neo4j)]
XML --> P --> Q --> B --> C --> DB
The file is read piece-by-piece (Parser), buffered in a queue, then written to Neo4j in batches (Writer).
DDI-L FragmentInstance (DDIFragmentLoader)¶
The newer DDI Lifecycle 3.x format. Each <Fragment> is a reusable piece of metadata that
can reference other fragments, naturally forming a graph structure.
flowchart LR
subgraph Input
XML[FragmentInstance XML]
end
subgraph Parser
P[iterparse streaming]
FB[FragmentBatch by type]
end
subgraph Writer
C[Batched UNWIND Cypher]
end
DB[(Neo4j)]
XML --> P --> FB --> C --> DB
DDI-CDI 1.0 (CDILoader)¶
DDI Cross-Domain Integration — the newest DDI standard. It describes a wider range of data types: concepts, variables, classifications, data structures, and processes.
flowchart LR
subgraph Input
XML[DDI-CDI XML]
end
subgraph Parser
P[iterparse streaming]
EB[EntityBatch by type]
end
subgraph Writer
C[Batched UNWIND Cypher]
end
DB[(Neo4j)]
XML --> P --> EB --> C --> DB
Format Auto-Detection¶
ddigraph can detect which format your file uses automatically. It reads the root XML element:
FragmentInstance→ uses the lifecycle loadercodeBook,codebook,DDIInstance→ uses the codebook loader- DDI-CDI namespace or root elements → uses the CDI loader
You can also call detect_ddi_format("your_file.xml") in Python, or let the CLI choose
automatically with ddigraph load.
What Gets Loaded¶
DDI Codebook¶
The Codebook loader extracts these categories of metadata into graph nodes:
- Datasets & studies — the overall study and its data collection events
- Files & structures — data files, logical records, and physical data structures
- Variables & questions — survey questions, question grids, question flows, and variables (including links to universes, categories, and source questions)
- Codes & concepts — code lists, categories, concepts, universes, and representations
- Groups & organizations — organizations, series, groups, and category groups
- Methodology & processing — sampling procedures, weights, methodology notes, and processing events
- Extended coverage — citations, geographic/temporal coverage, funding, collection instruments, and access policies
DDI-L FragmentInstance¶
The FragmentInstance loader preserves the graph structure of DDI-L 3.x files:
| Node Type | What it represents |
|---|---|
Instrument |
The survey instrument (entry point) |
Sequence |
An ordered list of questions |
IfThenElse |
A conditional branch (show question only if...) |
Loop |
A repeated section |
QuestionConstruct |
A question reference within the flow |
QuestionItem |
A single question |
QuestionGrid |
A grid/matrix of questions |
CodeList |
A list of answer choices |
Category |
One answer choice |
StatementItem |
Display text (not a question) |
ComputationItem |
A derived or calculated variable |
Universe |
The population this question applies to |
Concept |
A conceptual definition |
Variable |
A data variable |
StudyUnit |
Overall study metadata |
DataCollection |
Data collection metadata |
See DDI-L FragmentInstance for detailed coverage.
DDI-CDI 1.0¶
The CDI loader covers all 210 concrete top-level entity elements declared in ddi-cdi.xsd.
The ~35 most commonly referenced types use hand-tuned record classes; the remaining entities
round-trip through CDIGenericRecord. A representative sample of the bespoke types by domain:
| Node Type | Domain |
|---|---|
Concept, ConceptualDomain, ConceptSystem, UnitType, Universe, Population |
Concepts |
InstanceVariable, RepresentedVariable, ConceptualVariable |
Variables |
Category, Code, ClassificationIndex, ClassificationItem, CodeList, StatisticalClassification |
Classifications |
WideDataSet, WideDataStructure, DataStore, LogicalRecord |
Data structures |
Agent, Activity, CorrespondenceTable |
Agents & process |
Configuration¶
ddigraph reads its settings from environment variables (or a .env file). The main settings are:
- Neo4j connection — URI, username, password, database name
- Batching —
chunk_size(records per batch),queue_maxsize,writer_concurrency - Retry —
write_retry_attempts,write_retry_base_delay,write_retry_jitter - Parsing —
strict_parsing(fail on invalid XML),dry_run,replace
See Configuration for all available settings.
How the Schema Works¶
All node types, relationships, and database constraints are defined in one place:
ddigraph.schema.definitions. This single source of truth is used by the loaders,
the schema bootstrap, and the adapters.
from ddigraph.schema import DDISchema
# Generate all index/constraint queries
queries = DDISchema.generate_all_schema_queries(include_fragments=True, include_cdi=True)
# Look up the relationship type for a DDI-L reference field
rel_type = DDISchema.get_fragment_relationship_type("ControlConstructReference")
# Returns: "HAS_CONSTRUCT"
Schema Bootstrap¶
Run ddigraph bootstrap once before your first load. It creates:
- Unique constraints on identity fields (
idfor Codebook nodes,fragment_idfor FragmentInstance nodes) — prevents duplicate nodes - Secondary indexes on
name,label, anddescription— makes lookups fast
It is safe to run more than once. If the index or constraint already exists, nothing changes.
ddigraph bootstrap # Codebook + DDI-L (default)
ddigraph bootstrap --no-include-fragments # Codebook only
How Large Files Stay Fast¶
Streaming Parsing¶
ddigraph uses iterparse — a streaming XML parser that reads one element at a time.
After each element is processed, it is deleted from memory. This keeps memory usage
constant no matter how large the file is.
# Each element is processed then immediately cleared from memory
for _, elem in etree.iterparse(xml_file, events=("end",)):
process(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
Without streaming, a 5 GB file would require 5 GB of RAM. With streaming, the same file uses only a small, fixed amount of memory.
Back-Pressure (Codebook Loader)¶
A queue sits between the parser and the database writer. If the database is writing slowly, the queue fills up. Once full, it pauses the parser until the writer catches up. This prevents memory from growing without limit when writing is slower than reading.
Think of it like a buffer at a car wash: if the wash bay is full, no more cars are admitted until one exits.
Batched Writes¶
Instead of writing one record to Neo4j at a time, ddigraph groups records into batches and
sends them all in a single query using Cypher's UNWIND clause:
UNWIND $batch AS props
MERGE (n:Sequence {fragment_id: props.fragment_id})
SET n += props
This reduces database round trips by 10–100x. Instead of 1,000 queries (one per record), there might be 10 queries (100 records each).
Async I/O¶
- Codebook loader — reads and writes at the same time using multiple concurrent writer tasks
- FragmentInstance loader — writes asynchronously while parsing proceeds in batches
Fault Tolerance¶
| Feature | What it means |
|---|---|
| Schema bootstrap is safe to repeat | Uses IF NOT EXISTS — running it twice won't break anything |
| Batch-level isolation | Each batch is its own transaction — a failure in one batch doesn't affect others |
MERGE semantics |
Retrying a failed batch won't create duplicate nodes |
| Retry with backoff | Temporary Neo4j errors are retried automatically with increasing wait times |
| Input validation | Bad configuration values are caught at startup, not mid-load |
Adding a Custom Backend¶
The loader sends data through a GraphWriteAdapter interface (a standard set of methods).
You can plug in any database by writing a class that implements this interface.
See Custom Adapters for examples.
Adding Custom Node Types¶
To add your own node types or relationships:
- Extend
DDISchemainsrc/ddigraph/schema/definitions.py - Update the appropriate loader to extract the new XML elements
- Regenerate bootstrap queries from the schema
CI/CD¶
GitHub Actions runs Ruff (code style), mypy (type checking), and pytest (tests) on every push and pull request. This prevents regressions and keeps code quality consistent.