Architecture¶

datamodel-code-generator is organized around one central idea: many input formats are normalized into a shared generation graph, then rendered through output-model-specific backends.

This page is partly generated from source code. The generated inventory is intentionally small so the narrative can stay hand-written while release-time details such as parser routes and output backends stay synchronized.

Generation Pipeline¶

flowchart TD
    subgraph entry["Entry points"]
        direction TB
        cli["CLI\n__main__.py"]
        api["Python API\ngenerate()"]
    end

    subgraph setup["Configuration and input handling"]
        direction TB
        config["GenerateConfig\nParserConfig"]
        normalize["Input normalization\ninfer / fetch / convert"]
    end

    subgraph parsing["Parsing"]
        direction TB
        parser["Parser.parse()"]
        raw["parse_raw()\nformat-specific parser"]
        model_graph["Generation graph\nDataModel / DataType / Reference"]
    end

    subgraph rendering["Rendering"]
        direction TB
        render["Templates\nmodel/template"]
        format["CodeFormatter"]
        output["Python files\nstdout / return value"]
    end

    cli --> api --> config --> normalize --> parser
    parser --> raw --> model_graph
    parser --> model_graph
    model_graph --> render --> format --> output

The CLI builds a Config from command-line arguments, pyproject.toml, and presets. It then calls the same generate() API that library users can call directly. generate() selects the parser, runs it, and either returns code or writes files.

Entry Points¶

The executable entry point is defined in pyproject.toml:

datamodel-codegen = "datamodel_code_generator.__main__:main"

src/datamodel_code_generator/__main__.py owns CLI-only behavior:

fast paths for --version, --help, prompt helpers, and JSON Schema output
argument parsing and shell completion
pyproject.toml profile loading and inheritance
--check diff generation
--watch regeneration
structured JSON command output
--input-model loading before normal generation

src/datamodel_code_generator/__init__.py owns the public generation API:

generate()
input type inference
raw JSON/YAML/CSV/Dict conversion through genson
MCP tool schema conversion
parser construction
generated headers and output file writing
optional model metadata emission

Parser Model¶

Each parser turns its input into DataModel objects. The parser-specific part is parse_raw(). The shared part is Parser.parse(), which handles sorting, module layout, imports, rendering, exports, formatting, and optional metadata.

flowchart TD
    parse["Parser.parse()"]
    raw["parse_raw()\nimplemented by subclasses"]
    sort["sort_data_models()"]
    modules["_build_module_structure()"]
    process["_process_single_module()"]
    render["_generate_module_output()"]
    metadata["_build_model_metadata()"]
    result["str or module map"]

    parse --> raw --> sort --> modules --> process --> render --> metadata --> result

JSON Schema is the main reusable parser surface. OpenAPI and AsyncAPI extend it. Avro, XML Schema, Protocol Buffers, raw data, and MCP tools convert to a JSON Schema-shaped document before using the same model-building machinery. GraphQL is the main parser that builds models directly from its own schema API.

Ordering and module splitting use small graph helpers. parser/_scc.py finds strongly connected components with an iterative Tarjan traversal so large graphs do not hit Python recursion limits. parser/_graph.py provides stable topological ordering. Together they keep sort_data_models() deterministic and let _build_module_structure() move circular module SCCs into _internal.py forwarder modules when imports would otherwise cycle.

Generated Inventory¶

Generated inventory

This section is generated by scripts/build_architecture_docs.py from the current source tree. Edit the surrounding prose by hand, then run the script before release.

Parser Inheritance¶

classDiagram
    JsonSchemaParser <|-- AvroParser
    JsonSchemaParser <|-- OpenAPIParser
    JsonSchemaParser <|-- ProtobufParser
    JsonSchemaParser <|-- XMLSchemaParser
    OpenAPIParser <|-- AsyncAPIParser
    Parser <|-- GraphQLParser
    Parser <|-- JsonSchemaParser

Input Routes¶

Input file type	Parser route	Notes
`auto`	`pre-parser inference`	Resolved before parser selection by content inference.
`openapi`	`OpenAPIParser`	Routed directly by `_build_parser()`.
`asyncapi`	`AsyncAPIParser`	Routed directly by `_build_parser()`.
`jsonschema`	`JsonSchemaParser`	Routed directly by `_build_parser()`.
`mcp-tools`	`JsonSchemaParser after conversion`	MCP tool input/output schemas are hoisted into JSON Schema definitions first.
`xmlschema`	`XMLSchemaParser`	Routed directly by `_build_parser()`.
`protobuf`	`ProtobufParser`	Routed directly by `_build_parser()`.
`avro`	`AvroParser`	Routed directly by `_build_parser()`.
`json`	`JsonSchemaParser after conversion`	Sample data is converted to JSON Schema with genson first.
`yaml`	`JsonSchemaParser after conversion`	Sample data is converted to JSON Schema with genson first.
`dict`	`JsonSchemaParser after conversion`	In-memory mapping is converted to JSON Schema with genson first.
`csv`	`JsonSchemaParser after conversion`	The header and first data row are converted to JSON Schema with genson first.
`graphql`	`GraphQLParser`	Routed directly by `_build_parser()`.

Output Backends¶

Output model type	Data model	Root model	Field model	Type manager
`pydantic_v2.BaseModel`	`model.pydantic_v2.base_model.BaseModel`	`model.pydantic_v2.root_model.RootModel`	`model.pydantic_v2.base_model.DataModelField`	`model.pydantic_v2.types.DataTypeManager`
`pydantic_v2.dataclass`	`model.pydantic_v2.dataclass.DataClass`	`model.type_alias.TypeAliasTypeBackport`	`model.pydantic_v2.dataclass.DataModelField`	`model.pydantic_v2.types.DataTypeManager`
`dataclasses.dataclass`	`model.dataclass.DataClass`	`model.type_alias.TypeAlias`	`model.dataclass.DataModelField`	`model.dataclass.DataTypeManager`
`typing.TypedDict`	`model.typed_dict.TypedDict`	`model.type_alias.TypeAlias`	`model.typed_dict.DataModelFieldBackport`	`model.types.DataTypeManager`
`msgspec.Struct`	`model.msgspec.Struct`	`model.type_alias.TypeAlias`	`model.msgspec.DataModelField`	`model.msgspec.DataTypeManager`

Configuration Surface¶

Config model	Field count	Purpose
`BaseGenerateConfig`	135	Shared generation options.
`GenerateConfig`	150	Public `generate()` configuration.
`ParserConfig`	132	Base parser dependency injection and parser options.
`JSONSchemaParserConfig`	134	JSON Schema parser options.
`OpenAPIParserConfig`	140	OpenAPI-specific parser options.
`AsyncAPIParserConfig`	141	AsyncAPI-specific parser options.
`XMLSchemaParserConfig`	135	XML Schema-specific parser options.
`ProtobufParserConfig`	135	Protocol Buffers-specific parser options.
`AvroParserConfig`	134	Avro-specific parser options.
`GraphQLParserConfig`	135	GraphQL-specific parser options.

Formatter Names¶

Formatter	Default when unspecified
`builtin`	no
`black`	yes
`isort`	yes
`ruff-check`	no
`ruff-format`	no

Intermediate Model Graph¶

The generation graph is built from a small set of core objects:

DataModel: a generated class, root model, type alias, enum, scalar alias, or union alias.
DataModelFieldBase: one field on a generated model, including defaults, aliases, constraints, and metadata.
DataType: a Python type annotation tree, including containers, unions, literals, generated-model references, and imports.
Reference: a schema reference path and generated Python name.
GenerationStore: the parser-owned model list plus a query index over model and type dependencies.

classDiagram
    class Parser
    class GenerationStore
    class GenerationIndex
    class ModelResolver
    class DataModel
    class DataModelFieldBase
    class DataType
    class Reference

    Parser --> GenerationStore
    Parser --> ModelResolver
    GenerationStore --> GenerationIndex
    GenerationStore --> DataModel
    DataModel --> DataModelFieldBase
    DataModelFieldBase --> DataType
    DataModel --> Reference
    DataType --> Reference
    ModelResolver --> Reference

GenerationStore is the preferred mutation boundary for parser-side changes that affect dependency facts. Parser code should register models and update references, fields, bases, names, and paths through store methods instead of mutating the live objects directly. GenerationIndex rebuilds stable facts from the model list and gives later phases efficient queries such as "which data types point at this reference?".

References And Names¶

ModelResolver is the naming and reference authority. It tracks the current root, base path, base URL, root IDs, and known references while parsers traverse documents. It also applies naming options such as aliases, model_name_map, prefixes, suffixes, duplicate suffixes, enum member normalization, and field name safety.

Reference.children still links references back to users, but newer parser post-processing should prefer GenerationIndex when it needs dependency facts. The index is rebuilt from live models and avoids depending on legacy side effects alone.

Output Backends¶

Parsers do not hard-code Pydantic, dataclass, TypedDict, or msgspec classes. get_data_model_types() returns a DataModelSet for the selected DataModelType. That set injects:

the model class
the root model or type alias class
the field class
the type manager
optional reference dumping behavior
GraphQL scalar and union model classes

The same parser output can therefore render into different Python model styles while sharing the same reference and module-generation pipeline.

Rendering And Formatting¶

Every DataModel renders through a Jinja2 template. Built-in templates live in src/datamodel_code_generator/model/template. A custom template directory can override a built-in template by path.

Imports are collected separately from rendering through Import and Imports. The import layer handles grouping, aliases, reference-bound imports, future imports, unused import removal, and __all__ generation.

CodeFormatter then applies the configured formatter pipeline. The formatter layer supports the built-in formatter, black, isort, ruff check, ruff format, and user-supplied custom formatters.

Metadata Output¶

When --emit-model-metadata is enabled, Parser.parse() records source-reference information while producing models. generate() then serializes the resulting payload through model_metadata.py. The JSON Schema for that payload is stored in src/datamodel_code_generator/resources/model_metadata.schema.json and exposed through --output-format-json-schema model-metadata.

Runtime Dynamic Models¶

generate_dynamic_models() uses the normal generate() API to produce code, executes that code in temporary modules, and returns real Pydantic v2 model classes. Multi-module output is topologically sorted by relative imports before execution. The result is cached by schema and config hash when caching is enabled.

Performance-Sensitive Paths¶

Recent work has focused on reducing work without changing the generation model:

fast CLI paths avoid importing heavy modules
format.py keeps format-related types in a lighter helper module
local schema sources are reused during $ref resolution
YAML unsupported-tag scanning is skipped when no unsupported tag marker is present
JSON Schema constraint extraction avoids dumping whole schema objects when only selected constraint keys are needed
simple field import collection avoids rendering a full type hint when DataType facts are sufficient

These optimizations keep the architecture stable: parsers still build the same graph, and renderers still produce the same code, but hot paths do less incidental work.

Keeping This Page Synchronized¶

Run this before release or whenever parser routes, output backends, config models, or formatter names change:

python scripts/build_architecture_docs.py

CI can validate the generated section without rewriting files:

python scripts/build_architecture_docs.py --check

The repository test suite includes this check through tests/test_build_architecture_docs_script.py, so pull requests fail when the generated inventory is stale. The generated-docs tox environment also runs this script before build_llms_txt.py, which keeps the architecture page and LLM documentation in the same release-time sync flow.