Duffy SDK Vision¶

Status: Draft Last updated: 2026-02-27

The Problem¶

A data scientist or AI engineer who wants to work with graph-structured data today faces a fragmented stack:

Graph database (Neo4j, or NetworkX until it falls over at 5M nodes)
Vector database (Pinecone, Weaviate, Chroma) — separate system, separate API
Relational database (Postgres) — where their "real" data lives
Glue code — bespoke ETL to move data between all three, convert results into DataFrames, wrangle embeddings, build feature tensors for training

Each boundary is a place where time is wasted, bugs hide, and semantic context is lost.

The AGE + pgvector + PostgreSQL combination eliminates the infrastructure fragmentation — graph, vector, relational, time-series, geospatial, documents, all in one engine. But that potential is unrealized because there is no SDK that makes it accessible. The current AGE Python driver (v0.0.7, last released Sep 2023) provides raw Cypher execution over psycopg2 with no DataFrame output, no async, no type hints, no vector awareness, no query builder, and no connection pooling.

Duffy is the SDK that makes the unified PostgreSQL stack usable for data science and AI workflows.

Target Personas¶

Primary: The Data Scientist / ML Engineer¶

Works in Jupyter. Thinks in DataFrames and tensors. Knows enough SQL to be productive. Has heard of Cypher but hasn't committed to a graph database because the setup cost is too high and integration with their existing tools (pandas, PyTorch, scikit-learn) is poor.

What they want: - Graph query results as DataFrames, not custom objects they have to parse - Embeddings managed alongside graph structure, not in a separate system - Subgraph extraction directly into PyTorch Geometric for GNN training - One pip install and a Postgres connection string, not three databases to operate

Secondary: The AI/RAG Engineer¶

Building retrieval-augmented generation pipelines. Currently using LangChain or LlamaIndex with a vector store and maybe a knowledge graph, but the two retrieval paths are duct-taped together.

What they want: - A single backend that serves as both knowledge graph AND vector store - Graph-aware retrieval: "find semantically similar nodes within this graph neighborhood" - Drop-in LangChain/LlamaIndex integration that replaces separate Neo4j + Pinecone configs - Entity resolution and knowledge graph construction pipelines that actually work

Tertiary: The Domain Engineer (Security, Legal, Financial)¶

Knows their domain deeply. Recognizes that their problems are graph problems. Doesn't want to become a graph database expert to solve them.

What they want: - Domain-specific APIs that encode expert graph modeling patterns - "Model my attack surface" not "write Cypher to traverse your network topology" - Compliance frameworks, case law networks, financial dependency graphs — as library primitives

Architecture: Three Layers¶

┌─────────────────────────────────────────────────────────────┐
│                   LAYER 3: DOMAIN APIs                       │
│                                                              │
│  Security & Risk │ Knowledge Graphs │ Financial │ Legal      │
│  Attack graphs   │ Entity resolution│ Fraud     │ Case law   │
│  Vuln propagation│ Ontology mgmt    │ TBM/ROI   │ Precedent  │
│  Compliance      │ Graph-aware RAG  │ Lineage   │ Filing     │
├─────────────────────────────────────────────────────────────┤
│                 LAYER 2: TOOL INTEGRATION                    │
│                                                              │
│  pandas/Arrow │ PyTorch Geometric │ LangChain/LlamaIndex    │
│  NetworkX     │ DuckDB            │ MLflow/W&B              │
│  Jupyter viz  │ scikit-learn      │ Hugging Face            │
├─────────────────────────────────────────────────────────────┤
│                 LAYER 1: SMART PRIMITIVES                    │
│                                                              │
│  Graph ops    │ Vector ops    │ Hybrid ops   │ Bulk ops      │
│  Cypher query │ Embed mgmt   │ Graph+vector │ Ingest/export │
│  Query builder│ Index mgmt   │ Constrained  │ Schema mgmt   │
│  Traversal    │ Similarity   │ similarity   │ Migration     │
│  Algorithms   │ Versioning   │              │               │
├─────────────────────────────────────────────────────────────┤
│                       FOUNDATION                             │
│                                                              │
│  Connection pool │ Async (psycopg3) │ Type safety           │
│  Transaction mgmt│ Result parsing   │ Error handling         │
│  Schema intro-   │ AGE + pgvector   │ DataFrame/Arrow       │
│  spection        │ dialect mgmt     │ native output         │
└─────────────────────────────────────────────────────────────┘
                            │
                   PostgreSQL 18 + AGE + pgvector

Layer 1: Smart Primitives¶

The foundation. Every operation a data scientist does with graph + vector data, without glue code.

1.1 Modern Driver (replaces apache-age-python)¶

The current AGE Python driver is a thin psycopg2 wrapper that parses AGType results. Duffy's foundation replaces it entirely:

psycopg3 (async-native) — connection pooling, pipeline mode, COPY protocol
DataFrame-native output — query() returns pd.DataFrame by default, not custom objects. Optional pyarrow.Table output for zero-copy interop.
Type-safe — full type hints, dataclass result mapping, IDE completion
Query builder — programmatic Cypher construction with parameterization, not string concatenation
Schema introspection — discover graphs, labels, property types, indexes programmatically
Transaction management — context managers, savepoints, read/write routing

import duffy

db = duffy.connect("postgresql://localhost:5433/duffy")

# Cypher results as DataFrame — zero conversion
df = db.cypher("MATCH (p:Person)-[:KNOWS]->(f) RETURN p.name, f.name, f.age")
# Returns pd.DataFrame with columns: [p.name, f.name, f.age]

# Query builder
from duffy import match, node, rel
q = match(node("p", "Person")).rel("KNOWS").node("f").where(f.age > 30).ret("p", "f")
df = db.execute(q)

# Async
async with duffy.connect_async(dsn) as db:
    df = await db.cypher("MATCH (n:Asset) RETURN n")

Depends on upstream: agtype-to-JSONB cast (#350) would simplify the driver significantly but is not blocking — we can parse AGType directly like the current driver does, just better.

1.2 Graph Operations¶

Graph algorithms and traversal patterns as callable functions, not raw Cypher that the user has to write and debug:

Centrality — degree, betweenness, closeness, PageRank (push computation to PG where possible)
Community detection — Louvain, label propagation
Path analysis — shortest path, all paths, weighted paths
Subgraph extraction — by label, by property filter, by neighborhood (k-hop)
Graph metrics — density, diameter, clustering coefficient, component analysis

# PageRank on a subgraph
ranks = db.graph("social").pagerank(
    label="Person",
    rel_type="KNOWS",
    damping=0.85,
    iterations=20
)
# Returns DataFrame: [node_id, name, rank]

# K-hop neighborhood extraction
subgraph = db.graph("network").neighborhood(
    start={"name": "router-01"},
    hops=3,
    rel_types=["CONNECTS_TO", "DEPENDS_ON"]
)
# Returns a Subgraph object with .nodes_df, .edges_df, .to_pyg(), .to_networkx()

Implementation strategy: Start with Python-side computation over query results. As we identify hot paths, push them into PG functions or AGE extensions. The API stays the same either way.

1.3 Vector Operations¶

Embedding lifecycle management — generate, store, index, search, update — as a coherent API instead of raw pgvector SQL:

Embedding management — store, retrieve, update, delete embeddings on graph nodes/edges
Index management — create/manage HNSW and IVFFlat indexes, tune parameters
Similarity search — k-NN, filtered search, range search
Embedding generation — pluggable providers (sentence-transformers, OpenAI, Cohere, local models)
Versioning — track embedding model versions, A/B test different embeddings

# Attach embeddings to graph nodes
from duffy.embeddings import SentenceTransformerProvider

embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")

# Embed a property of existing nodes
db.graph("legal").embed(
    label="Case",
    source_property="summary",
    embedding_property="summary_vec",
    provider=embedder,
    batch_size=256
)

# Similarity search
similar = db.vector.search(
    label="Case",
    embedding_property="summary_vec",
    query="wrongful termination in financial services",
    k=20,
    provider=embedder
)
# Returns DataFrame: [node_id, case_name, summary, similarity_score]

Depends on upstream: pgvector integration (#1121) at the AGE level would make hybrid queries cleaner, but the SDK can work today using SQL-bridge patterns (Cypher for graph structure, SQL for vector ops, joined at the SDK layer).

1.4 Hybrid Operations (the differentiator)¶

The thing nobody else provides: graph traversal constrained by vector similarity, and vector search scoped by graph structure, in a single API call.

# "Find cases semantically similar to this query,
#  cited by cases in the same court"
results = db.hybrid.graph_then_vector(
    # Graph: traverse to candidates
    match="(c:Case {court: 'NYSC'})-[:CITES]->(p:Precedent)",
    # Vector: rank candidates by similarity
    embed_property="p.summary_vec",
    query="wrongful termination in financial services",
    k=10,
    provider=embedder
)

# "Find the 20 most similar assets to this compromised one,
#  then find everything within 3 hops of those"
results = db.hybrid.vector_then_graph(
    # Vector: find similar starting points
    label="Asset",
    embed_property="config_vec",
    query_node={"name": "compromised-server-01"},
    k=20,
    # Graph: expand neighborhood
    expand_hops=3,
    expand_rels=["CONNECTS_TO", "RUNS_ON"]
)

Implementation: Under the hood, this generates a SQL query that combines ag_catalog.cypher() with pgvector operators. The SDK handles the impedance mismatch so the user never writes that SQL.

1.5 Bulk Operations¶

Data scientists work with datasets, not individual rows. Bulk operations must be first-class:

Ingest from DataFrame — db.graph("g").load_nodes(df, label="Person", id_col="email")
Ingest from CSV/Parquet — direct file loading with schema inference
Upsert semantics — MERGE-based idempotent loading (depends on MERGE ON SET #1619)
Export — graph to CSV/Parquet/Arrow for downstream tools
Schema migration — add properties, rename labels, evolve the graph schema over time

import pandas as pd

# DataFrame -> graph
people_df = pd.read_csv("people.csv")
edges_df = pd.read_csv("relationships.csv")

db.graph("social").load_nodes(people_df, label="Person", id_col="email")
db.graph("social").load_edges(
    edges_df,
    rel_type="KNOWS",
    source_col="from_email", source_label="Person",
    target_col="to_email", target_label="Person"
)

# Graph -> DataFrame
nodes_df, edges_df = db.graph("social").export(labels=["Person"], rel_types=["KNOWS"])

Depends on upstream: Bulk load via client connection (#432) and edge creation performance (#2198) are relevant. The SDK can use COPY protocol via psycopg3 regardless, bypassing AGE's load_from_file.

Layer 2: Tool Integration¶

Meet data scientists where they already work. These aren't wrappers — they're native integrations that understand the semantics of graph + vector data.

2.1 PyTorch Geometric¶

Implement PyG's FeatureStore and GraphStore interfaces backed by AGE + pgvector. This makes Duffy a training data source for GNNs without loading the entire graph into memory.

from duffy.integrations.pyg import DuffyGraphStore, DuffyFeatureStore
from torch_geometric.loader import NeighborLoader

graph_store = DuffyGraphStore(db, graph_name="social")
feature_store = DuffyFeatureStore(db, graph_name="social")

# Neighbor sampling happens in-database, not in Python
loader = NeighborLoader(
    data=(feature_store, graph_store),
    num_neighbors=[10, 5],
    batch_size=128,
    input_nodes="Person"
)

for batch in loader:
    # batch is a standard PyG HeteroData object
    out = model(batch.x_dict, batch.edge_index_dict)

No upstream dependency. This is pure SDK work. Kuzu had the best implementation of this pattern before they archived — we can learn from their design.

2.2 LangChain / LlamaIndex¶

Drop-in backends that make AGE+pgvector a combined knowledge graph and vector store:

# LlamaIndex
from duffy.integrations.llamaindex import DuffyGraphVectorStore

store = DuffyGraphVectorStore(db, graph_name="knowledge")
index = KnowledgeGraphIndex.from_documents(documents, storage_context=store)
# Hybrid retrieval: graph structure + vector similarity in one query

# LangChain
from duffy.integrations.langchain import DuffyGraphStore, DuffyVectorStore

graph = DuffyGraphStore(db)
vectors = DuffyVectorStore(db, embedding=embedder)
# Both backed by the same PostgreSQL instance

Addresses AGE issue #1783 directly. Can be contributed upstream to LlamaIndex and used within Duffy.

2.3 pandas / Arrow (built into Layer 1)¶

Not a separate integration — this IS the default output format. Every query returns a DataFrame. Every input accepts a DataFrame. Arrow is available for zero-copy handoff to DuckDB, Polars, or any Arrow-compatible tool.

2.4 NetworkX / igraph¶

Export for visualization, analysis, and interop with the existing Python graph ecosystem:

G = db.graph("social").to_networkx(labels=["Person"], rel_types=["KNOWS"])
# Standard NetworkX graph, works with all nx algorithms and matplotlib viz

ig = db.graph("social").to_igraph(labels=["Person"], rel_types=["KNOWS"])
# igraph for large-graph analysis (faster than NetworkX)

2.5 Jupyter¶

Native notebook visualization — graph rendering in cells without exporting to external tools:

# In a Jupyter cell
db.graph("social").viz(
    labels=["Person", "Company"],
    rel_types=["WORKS_AT", "KNOWS"],
    layout="force",
    color_by="label",
    size_by="pagerank"
)
# Renders an interactive graph visualization inline

Implementation: Likely wraps PyGraphistry or pyvis for rendering. The value is the integration with the Duffy query layer, not the rendering engine.

Layer 3: Domain APIs¶

Thin, opinionated layers over Layers 1 and 2 that encode how an expert in each domain would model and query graph + vector data. The user doesn't need to become a graph expert — the API encodes the expertise.

3.1 Security & Risk (`duffy.domains.security`)¶

from duffy.domains.security import AttackGraph, ComplianceGraph

# Build attack graph from network topology + vulnerability scan
ag = AttackGraph(db, graph_name="attack_surface")
ag.ingest_topology(nmap_df)           # network scan results
ag.ingest_vulnerabilities(cve_df)     # vulnerability scan results
ag.compute_attack_paths(target="dc-01", max_depth=5)

# Query: what's the blast radius if this host is compromised?
blast = ag.blast_radius("web-server-03")
# Returns subgraph of reachable assets with propagation scores

# Compliance posture as a graph
cg = ComplianceGraph(db, graph_name="compliance")
cg.load_framework("NIST_CSF_2.0")    # reference layer
cg.map_controls(controls_df)          # instance layer
cg.assess_coverage()
# Returns DataFrame: [control_id, requirement, status, evidence_count, gaps]

3.2 Knowledge Graphs & RAG (`duffy.domains.knowledge`)¶

from duffy.domains.knowledge import KnowledgeGraph

kg = KnowledgeGraph(db, graph_name="docs")

# Build KG from documents with entity resolution
kg.ingest_documents(
    documents,
    extractor="llm",                  # LLM-based entity/relation extraction
    embedder=embedder,                # embed chunks alongside graph structure
    resolve_entities=True             # deduplicate entities across documents
)

# Graph-aware RAG retrieval
results = kg.retrieve(
    query="What are the compliance requirements for PCI DSS in cloud environments?",
    strategy="hybrid",                # vector similarity + graph traversal
    graph_hops=2,                     # expand to 2-hop neighbors of vector matches
    k=10
)

3.3 Financial (`duffy.domains.financial`)¶

from duffy.domains.financial import DependencyGraph, CostModel

# Technology dependency graph with cost propagation
dg = DependencyGraph(db, graph_name="tech_stack")
dg.load_assets(assets_df)
dg.load_dependencies(deps_df)
dg.load_costs(costs_df)              # star schema: measures stay relational

# What-if analysis: "what happens if we decommission Oracle?"
impact = dg.decommission_impact("Oracle Database", cascade=True)
# Returns: affected_assets, cost_delta, migration_paths, risk_score

3.4 Legal (`duffy.domains.legal`)¶

from duffy.domains.legal import CaseLawGraph

clg = CaseLawGraph(db, graph_name="case_law")
clg.ingest_cases(cases_df, embed_property="summary")

# Find persuasive precedent
precedents = clg.find_precedent(
    query="wrongful termination constructive dismissal",
    court="NYSC",
    strategy="citation_weighted",     # weight by citation graph PageRank
    k=10
)

What Must Exist in AGE for This to Work¶

The SDK can work around AGE limitations at the Python layer, but certain upstream improvements make the SDK dramatically better. These are our Track B feature contributions — AGE PRs driven by what the SDK needs:

SDK need	AGE gap	Issue	Priority
Idempotent graph loading	MERGE ON CREATE/MATCH SET	#1619	Immediate (we own this)
Graph traversal filtering	Predicate functions (all/any/none/single)	#552-556	Next sprint
Pattern-based filtering	Pattern matching in WHERE	#1577	Next sprint
Complex ETL pipelines	CALL subquery	#1690	Medium-term
Data integrity	Unique constraints	#45	Medium-term
Driver simplification	agtype → JSONB cast	#350	Quick win
Hybrid graph+vector queries	pgvector integration	#1121	Strategic (our differentiator)
Bulk data loading	Client-side bulk load	#432	SDK can work around
Real-world graph modeling	Multiple labels	#2082	Help existing PR land

Build Sequence¶

Phase 1: Foundation + First Proof of Value (Months 1-2)¶

Build the driver and one integration that proves the concept.

Milestone 1a — Modern Driver: - [ ] psycopg3 async connection pool - [ ] Cypher execution with DataFrame output (pd.DataFrame, pyarrow.Table) - [ ] AGType parsing (port + modernize existing parser) - [ ] Type hints, schema introspection - [ ] Basic query builder - [ ] Package structure, CI, pyproject.toml

Milestone 1b — Hybrid Query Proof of Concept: - [ ] Vector operations: embed, store, search on graph nodes - [ ] One hybrid query pattern working end-to-end (graph_then_vector) - [ ] Jupyter notebook demo: "Graph + Vector search in one PostgreSQL instance"

Milestone 1c — Parallel AGE Contributions: - [ ] MERGE ON CREATE/MATCH SET PR submitted (#1619) - [ ] Predicate functions PR submitted (#552-556) - [ ] agtype → JSONB cast PR submitted (#350)

Ship: pip install duffy with basic graph queries returning DataFrames, vector search on graph nodes, and one hybrid query pattern. Enough for a blog post and a demo.

Phase 2: Integrations (Months 3-4)¶

Make Duffy work with the tools people already use.

[ ] PyTorch Geometric FeatureStore/GraphStore backend
[ ] LlamaIndex GraphStore integration (contributes to issue #1783)
[ ] LangChain vector store + graph store backends
[ ] NetworkX/igraph export
[ ] Bulk operations (DataFrame → graph, graph → DataFrame)
[ ] Graph algorithm library (PageRank, community detection, centrality)

Ship: Duffy as a viable alternative to Neo4j's Python ecosystem for data science and RAG workflows.

Phase 3: Domain APIs (Months 5-8)¶

Build the opinionated layers that make specific verticals productive immediately.

[ ] duffy.domains.security — attack graphs, compliance graphs, vulnerability propagation
[ ] duffy.domains.knowledge — entity resolution, KG construction, graph-aware RAG
[ ] duffy.domains.financial — dependency graphs, cost models, what-if analysis
[ ] duffy.domains.legal — case law graphs, precedent search, citation analysis

Ship: Domain-specific demos and documentation that show Duffy solving real problems, not just providing infrastructure.

Phase 4: Production Hardening (Months 9-12)¶

[ ] Benchmark suite (vs Neo4j Python driver, vs raw psycopg3, vs standalone pgvector)
[ ] Connection pool tuning, query caching, materialized view integration
[ ] Observability (query logging, performance metrics, OpenTelemetry)
[ ] Migration tools (Neo4j → Duffy, standalone pgvector → Duffy)
[ ] Comprehensive documentation, tutorials, example notebooks

Competitive Position¶

Capability	Duffy	Neo4j Python	Kuzu (archived)	Raw psycopg3 + pgvector
Graph queries	Cypher via AGE	Cypher (native)	Cypher (embedded)	No graph support
Vector search	pgvector (same DB)	Bolt-on	No	pgvector only
Hybrid graph+vector	First-class API	Limited	No	Manual SQL
DataFrame output	Default	Supported	Manual	Manual
PyG integration	FeatureStore/GraphStore	Via GDS (enterprise)	Had it (best in class)	No
LangChain/LlamaIndex	Drop-in backend	Drop-in (Neo4j only)	No	Separate vector store
Domain APIs	Security, Legal, Financial, KG	No	No	No
Operational burden	One PostgreSQL instance	Separate DB server	Embedded (no server)	One PostgreSQL instance
License	Apache 2.0	GPL/Commercial	MIT (archived)	PostgreSQL + Apache 2.0
Async support	Yes (psycopg3)	Yes	No	Yes

The pitch: Neo4j's Python ecosystem is the gold standard, but it requires a separate database engine and the commercial license is expensive. Duffy provides the same developer experience on the PostgreSQL stack you already run, with the added capability of hybrid graph+vector queries that Neo4j can't match natively.

Open Design Questions¶

Package name: duffy on PyPI? Or age-toolkit / pg-graph for discoverability? The name needs to signal "graph + vector + PostgreSQL" to someone scanning PyPI.
Minimum viable driver scope: How thin can Milestone 1a be while still being useful? The temptation is to build everything — resist it.
Graph algorithm execution: Python-side vs PG function vs AGE extension? Start Python-side for speed, push to PG for performance-critical paths. But where's the line?
Embedding provider abstraction: How tightly should we couple to specific embedding providers? Pluggable interface, but which providers are first-class?
Domain API granularity: Each domain could be its own package (duffy-security, duffy-legal) or part of the monorepo. Separate packages reduce install size but increase maintenance.

This document is the product vision for the Duffy SDK. It should be updated as design decisions are made and milestones are completed. See docs/research/age-feature-prioritization.md for the AGE upstream backlog that feeds Track B contributions.