Graph Database Capacity Experiment Plan

Date: 2026-02-16 Author: f.luo Status: Draft — awaiting review

See also:

Action Plan — sequenced steps, decision gates, and timeline

Infrastructure Solidification Roadmap — phased approach to improving EC2 infra and managed DB options analysis

Results from this experiment plan inform the Action Plan’s decision gates — specifically whether to invest further in EC2 (solidification Phases 1-2) or migrate to a managed service (Phase 3).

1. Goals

Find the breaking point of EC2 Neo4j — How many users can the current r6i.xlarge (32GB) hold before performance degrades? What about with vectors? (Informs Action Plan — Decision Gate A)
Evaluate AuraDB — Is managed Neo4j viable? What’s the cost/performance tradeoff vs self-managed EC2? (Informs Action Plan — Decision Gate B)
Evaluate Neptune Analytics — Can it replace Neo4j? How does graph+vector unification compare to Neo4j + external vector store? (Informs Action Plan — Decision Gate B)
Compare vector storage options — Neo4j native, Neptune Analytics, Valkey VSS, OpenSearch Serverless
Establish cost models — $/user/month for each backend at 100K, 1M, 5M users

2. Decision: Repository Strategy

Recommendation: New repo — `graph-capacity-experiments`

Why not keep it in consumer-graph-worker?

Concern	consumer-graph-worker	New repo
Lifecycle	Long-lived production service	Throwaway experiments
Dependencies	Go + Neo4j driver + Kafka	Go + Neo4j driver + Neptune SDK + OpenSearch SDK + vector libs
CI/CD	Build → Docker → ECS deploy	Build → run locally or on EC2
Data generators	Not appropriate in prod code	Core purpose
Risk	Benchmark code could accidentally ship	Isolated

Repo structure:

graph-capacity-experiments/
├── cmd/
│   ├── datagen/           # Synthetic user + relationship generator (uses real catalog)
│   │   └── main.go
│   ├── loader/            # Multi-backend data loader
│   │   └── main.go
│   └── benchmark/         # Benchmark runner
│       └── main.go
├── scripts/
│   ├── export_catalog.sh              # Export real products/categories/retailers from Neo4j
│   ├── export_snowflake_catalog.py    # Export larger catalog from Snowflake
│   └── embed.py                       # Generate embeddings (OpenAI or sentence-transformer)
├── internal/
│   ├── datagen/           # Data generation logic
│   │   ├── users.go       # Synthetic user generation
│   │   ├── purchases.go   # Purchase relationship generation (assigns real products to fake users)
│   │   └── distributions.go  # Statistical distributions matching prod
│   ├── loader/            # Backend-specific loaders
│   │   ├── neo4j.go       # Bolt protocol (EC2 + AuraDB)
│   │   ├── neptune.go     # openCypher over HTTPS
│   │   └── vectors.go     # Vector-specific loaders (Valkey VSS, OpenSearch)
│   ├── benchmark/         # Benchmark queries + harness
│   │   ├── queries.go     # Standard query set
│   │   ├── runner.go      # Execution + timing
│   │   └── report.go      # Results formatting
│   └── model/             # Shared data model (mirrors consumer-graph-worker types)
│       └── types.go
├── data/
│   └── catalog/           # Real product/category/retailer data (exported, gitignored)
├── infra/                 # FSD configs for experiment instances
│   ├── experiment-neo4j-ec2.yml
│   └── experiment-neptune.yml
├── results/               # Benchmark results (committed for reference)
│   └── .gitkeep
├── go.mod
├── go.sum
├── Makefile
└── README.md

Alternative considered: Keep in consumer-graph-worker under experiments/. Rejected because it adds unnecessary dependencies to the production module and blurs the boundary.

3. Infrastructure Plan

3a. EC2 Neo4j Stress Test Instance

Clone the existing consumer-graph-neo4j-ec2.yml with modifications:

variables:
  default:
    instance_type: 'r6i.xlarge'   # Start with same as prod (32GB)
    data_volume_size: '500'
  stage:
    instance_type: 'r6i.xlarge'
    data_volume_size: '500'

tags:
  service: consumer-graph-neo4j-experiment
  purpose: capacity-testing
  ttl: 30d   # Remind to tear down

Deploy to stage account only (cheaper, no prod risk):

fsd service ec2 deploy --env stage --account stage-services experiment-neo4j-ec2.yml

Later, to test larger instances: Change instance_type to r6i.2xlarge (64GB) or r6i.4xlarge (128GB) and redeploy.

Cost: r6i.xlarge on-demand = ~$0.252/hr = ~$6/day. Budget ~$200 for a month of experiments.

3b. AuraDB

Use AuraDB Professional (not Free — 200K node limit is too restrictive):

Create via AuraDB Console
Region: us-east-1 (same as our infra)
Size: Start with 2GB RAM, scale up as needed
Cost: ~$65/mo for 2GB, ~$130/mo for 4GB
Connection: Bolt protocol (same Neo4j Go driver, different connection URI)

No FSD config needed — AuraDB is fully managed by Neo4j Inc.

3c. Neptune Analytics

Create a Neptune Analytics graph (serverless, no instance provisioning):

aws neptune-graph create-graph \
  --graph-name consumer-graph-experiment \
  --provisioned-memory 128 \
  --vector-search-configuration dimension=1024 \
  --region us-east-1

Uses openCypher (compatible with Neo4j Cypher, with caveats)
Has native vector search built in
Serverless pricing: pay per query + storage
No FSD config needed — use AWS CLI or CloudFormation

3d. Vector Storage Instances

Backend	Setup	Notes
Neo4j native (HNSW)	Already on the EC2 experiment instance	CREATE VECTOR INDEX
Neptune Analytics vectors	Already included in Neptune graph	Built-in
Valkey VSS	Create a separate Valkey node or use existing stage cache	Needs redis-cli with VSS module
OpenSearch Serverless	Create a vector search collection	`aws opensearchserverless create-collection`

4. Synthetic Data Generation

4a. Data Model

Matches the production schema exactly:

(:User {user_id, timezone, created_at, last_updated_at})
  -[:PURCHASED {times, last, timestamps[], receipt_ids[], avg_interval_days, repurchase_likelihood}]->
(:Product {product_id, name, brand, category, created_at})
  -[:IN_CATEGORY]->
(:Category {category_id, name})

(:User)-[:SHOPS_IN {purchase_count}]->(:Category)
(:User)-[:SHOPS_AT {frequency, last_visit}]->(:Retailer {name, venue_type})
(:User)-[:MEMBER_OF]->(:Community {community_id, name, primary_category, member_count, zip_code})

4b. Distribution Matching

Use distributions measured from production (from capacity planning doc):

Relationship	Distribution	Params
PURCHASED per user	Log-normal	mean=5.8, median=3, p95=20, max=103
SHOPS_IN per user	Log-normal	mean=5.7, p95=17
MEMBER_OF per user	Log-normal	mean=2.6, p95=6
SHOPS_AT per user	Log-normal	mean=1.6, p95=4
Products (shared)	Power-law	~2.5 products per user (amortized), popular products purchased by many users
Categories	Fixed catalog	~50 realistic categories (Dairy, Bakery, Snacks, etc.)
Retailers	Fixed catalog	~500 realistic retailer names
Communities	Derived	~3 per zip code × category combination

4c. Real Catalog Data (Products, Categories, Retailers)

Products, categories, and retailers use real Fetch data — not synthetic names. This ensures embeddings reflect actual product semantics and similarity searches return meaningful results.

Current prod Neo4j catalog (as of 2026-02-16, from 30 backfilled users)

Entity	Count	Properties
Products	4,586	product_id, name, brand, category
Categories	2,970	category_id, name (3-level hierarchy: `GROCERY\|DAIRY\|MILK`)
Retailers	212	retailer_id, name, venue_type

Data source strategy

Scale	Products Needed	Source	Method
Quick-start (≤50K users)	~5K	Prod Neo4j export	Cypher query → CSV
Medium (100K–500K users)	~60K	Snowflake export	SQL query → CSV
Large (1M+ users)	~400K+	Snowflake export	SQL query → CSV

Source 1: Prod Neo4j export (immediate, no extra access needed)

-- Export products
MATCH (p:Product)
RETURN p.product_id AS product_id, p.name AS name, p.brand AS brand, p.category AS category

-- Export categories
MATCH (c:Category)
RETURN c.category_id AS category_id, c.name AS name

-- Export retailers
MATCH (r:Retailer)
RETURN r.retailer_id AS retailer_id, r.name AS name, r.venue_type AS venue_type

-- Export product→category mapping
MATCH (p:Product)-[:IN_CATEGORY]->(c:Category)
RETURN p.product_id AS product_id, c.category_id AS category_id

A scripts/export_catalog.sh script runs these via the Neo4j HTTP API and writes CSVs. This gives us ~4.5K real products with names like “Hormel Black Label Thick Cut Maple Bacon - 12 Oz” and real brands like HORMEL, GREAT VALUE, CELSIUS.

Source 2: Snowflake export (for larger catalogs)

-- Unique products from receipt items (full Fetch catalog)
SELECT DISTINCT
    i.FIDO AS product_id,
    i.DESCRIPTION AS name,
    i.BRAND AS brand,
    COALESCE(i.CATEGORY_1, 'UNCATEGORIZED') AS category_l1,
    i.CATEGORY_2 AS category_l2,
    i.CATEGORY_3 AS category_l3
FROM FETCH_SERVICES_PROD.RECEIPT_SERVICE.RECEIPT_ITEMS i
WHERE i.FIDO IS NOT NULL
  AND i.DESCRIPTION IS NOT NULL
  AND i.DESCRIPTION != ''
LIMIT 500000;

-- Unique retailers
SELECT DISTINCT
    r.STORE_NAME AS name,
    r.RETAILER_CHANNEL AS venue_type
FROM FETCH_SERVICES_PROD.RECEIPT_SERVICE.RECEIPTS r
WHERE r.STORE_NAME IS NOT NULL
  AND r.STORE_NAME != '';

Run via Snowflake CLI (snowsql) or the Snowflake Python connector. Export to CSV, then use in datagen.

Source 3: Purchase History API (alternative, slower)

If Snowflake access is not available, we can discover more products by calling the Purchase History API for a batch of user IDs. Each user averages ~5.8 unique products. Querying ~30K users at 10 req/s (~50 min) would yield ~50-60K unique products. This uses the existing backfill infrastructure but is slower than a direct Snowflake query.

How `datagen` uses the real catalog

Load real product/category/retailer CSVs (exported from Neo4j or Snowflake)
Generate synthetic users with fake user_ids
Assign real products to fake users following a power-law (Zipf) distribution — popular products (bananas, eggs, milk) purchased by many users, long-tail products by few
Build SHOPS_IN, SHOPS_AT, MEMBER_OF relationships from the assigned purchases
Communities are derived from category + zip code combinations (synthetic zip codes, real categories)

Product popularity distribution from prod (top products by buyer count):

Product	Buyers	Category
Fresh Fruits	12	Pantry
Fresh Vegetables	10	Pantry
Eggs	10	Pantry
Fresh Bananas	9	Pantry
Fresh Blueberries	8	Pantry
Avocados	6	PRODUCE\|FRUITS\|AVOCADOS

The Zipf distribution in datagen should match this pattern: ~40% of purchases hit the top 5% of products.

4d. Vector Embeddings

All embeddings use real models so that kNN results reflect actual semantic similarity (not random noise). Three embedding approaches:

Model	Dimensions	Speed	Cost	Where it runs
OpenAI `text-embedding-3-small`	512 (via `dimensions` param)	~3,000 items/min	~$0.02/1M tokens	API call
OpenAI `text-embedding-3-large`	1024 (via `dimensions` param)	~2,500 items/min	~$0.13/1M tokens	API call
Sentence Transformer (`all-MiniLM-L6-v2`)	384	~10,000 items/min	Free	Local (Python)

OpenAI embedding (512-dim) — good quality, low cost:

# Using text-embedding-3-small with dimensions=512
response = openai.embeddings.create(
    model="text-embedding-3-small",
    input="Organic Whole Milk, Horizon, Dairy",
    dimensions=512
)

Input text per product: "{name}, {brand}, {category}" (e.g. “Organic Whole Milk, Horizon, Dairy”)
Input text per user: Concatenation of their top-5 purchased product names + top-3 categories
Cost estimate: 1M products × ~10 tokens each = 10M tokens → ~$0.20

OpenAI embedding (1024-dim) — highest quality, tests scaling with larger vectors:

# Using text-embedding-3-large with dimensions=1024
response = openai.embeddings.create(
    model="text-embedding-3-large",
    input="Organic Whole Milk, Horizon, Dairy",
    dimensions=1024
)

Same input text format as 512-dim
Cost estimate: 1M products × ~10 tokens each = 10M tokens → ~$1.30
2× storage and index overhead vs 512-dim — tests whether higher quality justifies the cost

Sentence Transformer — free, local, 384-dim:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384-dim
embeddings = model.encode(["Organic Whole Milk, Horizon, Dairy", ...])

Same input text format as OpenAI
Runs on local machine (CPU is fine for <1M items)
No API cost, fully offline

Embedding pipeline:

Real product catalog is exported from Neo4j or Snowflake (see section 4c)
A Python script (scripts/embed.py) reads the product CSV and generates embeddings via OpenAI API or local sentence-transformer
Embeddings are saved as .npy files (product_id → float32 array)
datagen assigns real products to synthetic users; user embeddings = weighted average of their purchased product embeddings (weighted by purchase count)

Recommended approach:

Use OpenAI 512-dim (text-embedding-3-small) as the primary embedding for most experiments (good quality, low cost)
Use OpenAI 1024-dim (text-embedding-3-large) to test whether higher dimensionality improves recall enough to justify 2× storage/index overhead
Use sentence-transformer 384-dim as a free alternative for rapid iteration and local development
Compare all three on HNSW index size, kNN recall, and query latency to determine the best quality/cost/performance tradeoff

Embeddings go on:

Product nodes: Embed "{name}, {brand}, {category}" — represents the product semantically
User nodes: Weighted average of purchased product embeddings — represents the user’s purchase behavior profile
Category nodes (optional): Embed category name — enables category-level similarity search

4e. CLI Interface

# Step 0: Export real catalog from prod Neo4j (one-time)
./scripts/export_catalog.sh --env prod --output data/catalog/
# Or from Snowflake for a larger catalog:
python scripts/export_snowflake_catalog.py --output data/catalog/ --limit 500000

# Catalog output:
#   data/catalog/products.csv       (product_id, name, brand, category)
#   data/catalog/categories.csv     (category_id, name)
#   data/catalog/retailers.csv      (retailer_id, name, venue_type)
#   data/catalog/in_category.csv    (product_id, category_id)

# Step 1: Generate embeddings for real products (Python, one-time per model)
python scripts/embed.py \
  --products data/catalog/products.csv \
  --model openai-small --dimensions 512 \
  --output data/catalog/embeddings-512/

python scripts/embed.py \
  --products data/catalog/products.csv \
  --model openai-large --dimensions 1024 \
  --output data/catalog/embeddings-1024/

python scripts/embed.py \
  --products data/catalog/products.csv \
  --model sentence-transformer \
  --output data/catalog/embeddings-384/

# Embedding output (per model):
#   data/catalog/embeddings-512/products.npy      (float32 array)
#   data/catalog/embeddings-512/manifest.json     (model, dimensions, count, cost)

# Step 2: Generate synthetic users + purchase relationships (Go)
./datagen \
  --users 100000 \
  --seed 42 \
  --catalog data/catalog/ \
  --embeddings data/catalog/embeddings-512/ \
  --output data/100k/

# Generated output (synthetic):
#   data/100k/users.csv                    (user_id, timezone, created_at)
#   data/100k/embeddings/users.npy         (weighted avg of product embeddings)
#   data/100k/embeddings/manifest.json     (model, dimensions, count)
#   data/100k/communities.csv              (community_id, name, primary_category, zip_code)
#   data/100k/purchased.csv                (user_id, product_id, times, last, ...)
#   data/100k/shops_in.csv                 (user_id, category_id, purchase_count)
#   data/100k/shops_at.csv                 (user_id, retailer_name, frequency, last_visit)
#   data/100k/member_of.csv                (user_id, community_id)
#   data/100k/manifest.json                (metadata: counts, seed, generation time)
#
# Products, categories, retailers, and in_category are real data from
# data/catalog/ — shared across all dataset sizes, not regenerated.

5. Experiment 1: EC2 Neo4j Stress Test

Objective

Find the maximum user count where:

Read query p95 < 100ms
Write throughput > 100 nodes/sec
Page cache hit ratio > 95%
No OOM crashes

Procedure

Deploy fresh experiment EC2 (r6i.xlarge, 32GB)
Load data incrementally:

Checkpoint	Users	Est. Graph Size	Est. + Vectors (512-dim)	Est. + Vectors (1024-dim)
C1	10,000	~43 MB	~150 MB	~260 MB
C2	50,000	~215 MB	~650 MB	~1.1 GB
C3	100,000	~430 MB	~1.3 GB	~2.2 GB
C4	250,000	~1.1 GB	~3.1 GB	~5.3 GB
C5	500,000	~2.1 GB	~6 GB	~10.5 GB
C6	1,000,000	~4.3 GB	~11.5 GB	~20 GB
C7	2,000,000	~8.6 GB	~23 GB	~40 GB

Vector estimates (users + products both embedded, HNSW index overhead ≈ 1.5× vector storage):

512-dim float32 = ~2.1 KB/embedding
1024-dim float32 = ~4.2 KB/embedding

At each checkpoint, run the benchmark suite
Stop when performance degrades below thresholds

Benchmark Query Set

-- Q1: Single user lookup (point query)
MATCH (u:User {user_id: $uid})-[r:PURCHASED]->(p:Product)
RETURN u, r, p

-- Q2: 2-hop category aggregation
MATCH (u:User {user_id: $uid})-[:PURCHASED]->(p:Product)-[:IN_CATEGORY]->(c:Category)
RETURN c.name, count(p) AS products ORDER BY products DESC

-- Q3: Community-based recommendation (expensive)
MATCH (u:User {user_id: $uid})-[:MEMBER_OF]->(comm:Community)<-[:MEMBER_OF]-(other:User)
MATCH (other)-[:PURCHASED]->(p:Product)
WHERE NOT (u)-[:PURCHASED]->(p)
RETURN p.name, count(DISTINCT other) AS score
ORDER BY score DESC LIMIT 10

-- Q4: User's full profile (all relationship types)
MATCH (u:User {user_id: $uid})
OPTIONAL MATCH (u)-[pur:PURCHASED]->(prod:Product)
OPTIONAL MATCH (u)-[si:SHOPS_IN]->(cat:Category)
OPTIONAL MATCH (u)-[sa:SHOPS_AT]->(ret:Retailer)
OPTIONAL MATCH (u)-[mo:MEMBER_OF]->(comm:Community)
RETURN count(DISTINCT pur) AS purchases,
       count(DISTINCT si) AS categories,
       count(DISTINCT sa) AS retailers,
       count(DISTINCT mo) AS communities

-- Q5: Global aggregation (stress test)
MATCH (u:User)-[:PURCHASED]->(p:Product)
WITH p, count(u) AS buyers
ORDER BY buyers DESC LIMIT 20
RETURN p.name, buyers

-- Q6: Vector similarity (only when vectors loaded)
CALL db.index.vector.queryNodes('product-embedding-index', 10, $queryVector)
YIELD node, score
RETURN node.product_id, node.name, score

-- Q7: Graph + vector combined (graph filter → vector rerank)
MATCH (u:User {user_id: $uid})-[:PURCHASED]->(p:Product)
WITH collect(p) AS purchased
CALL db.index.vector.queryNodes('product-embedding-index', 50, $queryVector)
YIELD node, score
WHERE NOT node IN purchased
RETURN node.product_id, node.name, score LIMIT 10

Each query runs 100 iterations with random user IDs. Record p50, p95, p99, max latency.

Run Matrix

Run	Instance	Vectors	Embedding Model	Max Users
R1	r6i.xlarge (32GB)	No	—	Until degradation
R2	r6i.xlarge (32GB)	Yes (512-dim)	OpenAI text-embedding-3-small	Until degradation
R3	r6i.xlarge (32GB)	Yes (384-dim)	Sentence Transformer all-MiniLM-L6-v2	Until degradation
R4	r6i.xlarge (32GB)	Yes (1024-dim)	OpenAI text-embedding-3-large	Until degradation
R5	r6i.2xlarge (64GB)	Yes (512-dim)	OpenAI text-embedding-3-small	Until degradation
R6	r6i.2xlarge (64GB)	Yes (1024-dim)	OpenAI text-embedding-3-large	Until degradation

6. Experiment 2: AuraDB Evaluation

Objective

Compare AuraDB Professional vs self-managed EC2 Neo4j on latency, throughput, and cost.

Procedure

Create AuraDB Professional instance (us-east-1, 4GB RAM)
Load 100K users (same dataset as EC2 experiment)
Run identical benchmark query set
Scale to 500K, 1M if 100K passes
Test vector support (AuraDB Professional supports vector indexes)

Key Questions

Cypher compatibility: Is our production Cypher 100% compatible? (MERGE, UNWIND, CASE WHEN, datetime, array properties)
Write throughput: How does Bolt-over-internet compare to Bolt-over-VPC?
Latency: Network hop to AuraDB vs local VPC EC2
Cost: AuraDB pricing vs EC2 + ops overhead
Vector support: Same HNSW API as Community Edition?

Loader Differences

Same Neo4j Go driver, different connection string:

// EC2
driver, _ := neo4j.NewDriverWithContext("neo4j://10.4.19.205:7687", auth)

// AuraDB
driver, _ := neo4j.NewDriverWithContext("neo4j+s://xxxxx.databases.neo4j.io", auth)

7. Experiment 3: Neptune Analytics

Objective

Evaluate Neptune Analytics as a unified graph + vector database, replacing both Neo4j and a separate vector store.

Procedure

Create Neptune Analytics graph (128 m-NCUs, 1024-dim vector config)
Load data via openCypher endpoint or Bulk Load from S3
Test Cypher compatibility (Neptune uses openCypher, not full Cypher)
Run benchmark queries (translate where needed)
Test vector search with neptune.algo.vectors.topKByNode() and topKByEmbedding()
Measure combined graph traversal + vector similarity queries

Cypher Compatibility Risks

Neptune’s openCypher has known gaps vs Neo4j Cypher:

Feature	Neo4j	Neptune	Risk
MERGE with ON CREATE/ON MATCH	Yes	Yes	Low
UNWIND	Yes	Yes	Low
CASE WHEN in SET	Yes	Partial	Medium — test the receipt_ids dedup pattern
datetime()	Yes	timestamp()	High — different function names
Array properties	Yes	Yes (with caveats)	Medium — test append operations
CALL procedures	Yes (APOC, vector)	Different API	High — vector queries use different syntax
CREATE VECTOR INDEX	Yes (Neo4j 5.x)	Not applicable	N/A — vectors are built-in differently

Neptune Vector Search API

-- Neptune Analytics vector similarity (different from Neo4j)
MATCH (p:Product)
CALL neptune.algo.vectors.topKByEmbedding(p, $queryVector, 10)
YIELD node, score
RETURN node.product_id, score

Data Loading

Neptune supports bulk loading from S3 (CSV format with specific headers):

# Upload generated CSV to S3
aws s3 cp data/100k/ s3://experiment-bucket/neptune/100k/ --recursive

# Start bulk load
aws neptune-graph start-import-task \
  --graph-identifier consumer-graph-experiment \
  --source s3://experiment-bucket/neptune/100k/ \
  --role-arn arn:aws:iam::role/NeptuneBulkLoadRole \
  --format openCypher

8. Experiment 4: Vector Storage Comparison

Objective

Compare vector storage options for kNN similarity search at scale. Find the best option for our latency, cost, and operational requirements.

Test Matrix

Backend	Setup	kNN API
Neo4j HNSW	Vector index on experiment EC2	`db.index.vector.queryNodes()`
Neptune Analytics	Built into Neptune graph	`neptune.algo.vectors.topKByEmbedding()`
Valkey VSS	`FT.CREATE` with HNSW, `FT.SEARCH` with KNN	`FT.SEARCH idx @vec:[VECTOR_RANGE ...]`
OpenSearch Serverless	Vector search collection	`knn` query via REST API

Benchmark

For each backend, at 100K and 1M vectors (1024-dim):

Insert throughput: vectors/sec for bulk load
kNN latency: k=10, k=50, k=100 — p50, p95, p99
Recall@10: Against brute-force exact results (measures HNSW approximation quality)
Combined query: kNN candidates → graph enrichment (two-hop) — end-to-end latency
Memory usage: Index size in RAM
Cost: $/month at steady state

Combined Graph + Vector Query Pattern

The production use case is:

Vector search → Find 50 candidate products similar to user’s embedding
Graph filter → Remove products the user already purchased
Graph enrich → Get category, brand, retailer info for remaining candidates
Graph rerank → Boost candidates purchased by users in the same community

This requires either:

Single-engine (Neptune Analytics, Neo4j native vectors): One query does it all
Two-engine (Neo4j + Valkey, Neo4j + OpenSearch): Vector search → ID list → graph query

Measure end-to-end latency for both patterns.

9. Metrics Framework

What to Capture at Every Checkpoint

{
  "experiment": "ec2-neo4j-stress",
  "checkpoint": "C3",
  "users": 100000,
  "timestamp": "2026-02-20T14:30:00Z",
  "backend": {
    "type": "neo4j-ec2",
    "instance": "r6i.xlarge",
    "memory_gb": 32,
    "version": "5.15.0"
  },
  "data": {
    "total_nodes": 180000,
    "total_relationships": 1570000,
    "store_size_mb": 430,
    "vector_index_size_mb": 0
  },
  "write_metrics": {
    "load_duration_sec": 120,
    "nodes_per_sec": 1500,
    "rels_per_sec": 13000,
    "batch_size": 1000
  },
  "read_metrics": {
    "Q1_point_lookup":    {"p50_ms": 2, "p95_ms": 5, "p99_ms": 12},
    "Q2_2hop_category":   {"p50_ms": 5, "p95_ms": 15, "p99_ms": 30},
    "Q3_community_rec":   {"p50_ms": 50, "p95_ms": 120, "p99_ms": 250},
    "Q4_full_profile":    {"p50_ms": 3, "p95_ms": 8, "p99_ms": 20},
    "Q5_global_agg":      {"p50_ms": 200, "p95_ms": 500, "p99_ms": 1000},
    "Q6_vector_knn":      {"p50_ms": 0, "p95_ms": 0, "p99_ms": 0},
    "Q7_graph_plus_vector": {"p50_ms": 0, "p95_ms": 0, "p99_ms": 0}
  },
  "system_metrics": {
    "page_cache_hit_ratio": 0.99,
    "heap_used_mb": 3200,
    "heap_max_mb": 8192,
    "cpu_percent": 15,
    "disk_iops": 120
  },
  "cost": {
    "instance_hourly": 0.252,
    "estimated_monthly": 181
  }
}

Results are committed to results/ in the experiment repo for historical comparison.

10. Gradual Data Feeding Strategy

Approach: Incremental Loading

Don’t load all data at once. Load in checkpoints so we can measure the system at each scale.

Checkpoint C1:   10K users  ──┐
                              ├── Benchmark suite
Checkpoint C2:   50K users  ──┤   (runs at each checkpoint)
                              ├──
Checkpoint C3:  100K users  ──┤
                              ├──
Checkpoint C4:  250K users  ──┤
                              ├──
Checkpoint C5:  500K users  ──┤
                              ├──
Checkpoint C6:    1M users  ──┤
                              ├──
Checkpoint C7:    2M users  ──┘  (EC2 may OOM here with vectors)

Loading Protocol (Same for All Backends)

Pre-generate all data using datagen (deterministic seed, so regeneration is identical)
Load in batches of 1,000 users — each batch includes the user + all their relationships
Use MERGE (same as production) — allows re-running without duplicates
After each checkpoint: Run full benchmark suite, save results, continue loading
If backend crashes/degrades: Record the failure point, stop loading, note as the limit

Per-Backend Loading

Backend	Protocol	Batch Method
EC2 Neo4j	Bolt (Go driver)	UNWIND + MERGE (same as production writer)
AuraDB	Bolt (Go driver)	Same UNWIND + MERGE
Neptune Analytics	openCypher HTTPS or S3 bulk load	Bulk load for initial, incremental MERGE for checkpoints
Valkey VSS	Redis protocol	`FT.CREATE` index, `HSET` for each vector
OpenSearch Serverless	REST API	`_bulk` API for batch indexing

11. Vector Storage Deep Dive

What to Try

A. Neo4j Native Vectors (HNSW)

-- Create vector index
CREATE VECTOR INDEX `product-embedding-index`
FOR (p:Product) ON (p.embedding)
OPTIONS {indexConfig: {
  `vector.dimensions`: 512,
  `vector.similarity_function`: 'cosine'
}}

-- Query
CALL db.index.vector.queryNodes('product-embedding-index', 10, $queryVector)
YIELD node, score
RETURN node.product_id, score

Pros: Single query engine, simplest architecture Cons: HNSW index must fit in RAM, scaling requires bigger instance

B. Neptune Analytics Vectors

-- Vectors are set as node properties
MATCH (p:Product {product_id: $id})
SET p.embedding = $vector

-- Query (Neptune-specific syntax)
CALL neptune.algo.vectors.topKByEmbedding($queryVector, 10)
YIELD node, score
RETURN node.product_id, score

Pros: Graph + vector in one engine, serverless scaling Cons: Different query syntax, potential Cypher compatibility gaps

C. Valkey VSS (Vector Similarity Search)

# Create index
FT.CREATE product-idx ON HASH PREFIX 1 "product:" \
  SCHEMA embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1024 DISTANCE_METRIC COSINE

# Insert
HSET product:abc123 embedding <binary_vector> name "Product Name" category "Dairy"

# kNN search
FT.SEARCH product-idx "*=>[KNN 10 @embedding $query_vec AS score]" \
  PARAMS 2 query_vec <binary_vector> \
  RETURN 2 name score \
  SORTBY score ASC

Pros: Sub-millisecond latency, reuses existing Valkey infrastructure Cons: Two-engine pattern (Valkey kNN → Neo4j graph enrichment), no graph traversal in vector query

D. OpenSearch Serverless (Vector Search Collection)

// Create collection
{
  "name": "product-vectors",
  "type": "VECTORSEARCH"
}

// Index document
{
  "product_id": "abc123",
  "name": "Product Name",
  "embedding": [0.1, 0.2, ...]
}

// kNN query
{
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.1, 0.2, ...],
        "k": 10
      }
    }
  }
}

Pros: Fully managed, scales independently, FAISS-backed Cons: Higher latency (~10-50ms), separate service, two-engine pattern

12. Order of Operations

Phase 1: Foundation

Create graph-capacity-experiments repo
Export real product catalog from prod Neo4j (~4.5K products) via scripts/export_catalog.sh
(Optional) Export larger catalog from Snowflake (~500K products) via scripts/export_snowflake_catalog.py
Generate product embeddings for all three models (512-dim, 1024-dim, 384-dim) via scripts/embed.py
Implement datagen tool (synthetic users + relationships using real catalog)
Implement loader tool (Neo4j Bolt backend first)
Implement benchmark tool (query runner + metrics capture)
Deploy experiment EC2 Neo4j instance (stage account)

Phase 2: EC2 Neo4j Stress Test (Graph Only)

Generate datasets: 10K → 2M users (no vectors)
Run incremental load + benchmark (runs R1)
Find the graph-only breaking point
Document results

Phase 3: EC2 Neo4j + Vectors

Create Neo4j vector indexes on experiment instance
Run incremental load + benchmark with 512-dim (run R2)
Run with 384-dim sentence-transformer (run R3)
Run with 1024-dim on r6i.xlarge (run R4) — expect earlier OOM than 512-dim
Test with larger instance r6i.2xlarge: 512-dim (run R5) and 1024-dim (run R6)
Document vector overhead and breaking points per dimension

Phase 4: AuraDB

Create AuraDB Professional instance
Add AuraDB connection to loader
Run 100K → 1M user benchmarks
Compare with EC2 results
Document cost model

Phase 5: Neptune Analytics

Create Neptune Analytics graph
Add Neptune openCypher loader
Test Cypher compatibility (especially MERGE + CASE WHEN patterns)
Run benchmarks (graph + vector combined)
Document compatibility gaps and performance

Phase 6: Vector Comparison

Set up Valkey VSS + OpenSearch Serverless
Load same vector dataset to all four backends
Run kNN benchmarks
Run combined graph+vector end-to-end benchmarks
Produce comparison matrix

Phase 7: Final Report

Compile all results into a recommendation document
Cost projections at 100K, 1M, 5M, 10M users
Recommended architecture per scale tier
Migration effort estimate for each option

13. Infrastructure Teardown

All experiment infrastructure should be tagged with purpose: capacity-testing and torn down after experiments:

EC2 Neo4j experiment instance
AuraDB Professional instance
Neptune Analytics graph
OpenSearch Serverless collection
Any experiment Valkey nodes
S3 buckets with generated data

Appendix A: Estimated Costs

Resource	Duration	Cost
EC2 r6i.xlarge (experiment Neo4j)	30 days	~$180
EC2 r6i.2xlarge (upgrade test)	7 days	~$68
AuraDB Professional (4GB)	30 days	~$130
Neptune Analytics (128 m-NCUs)	30 days	~$100-200 (usage-based)
OpenSearch Serverless (2 OCUs)	7 days	~$60
S3 storage for datasets	30 days	~$5
Total estimated		~$550-650

Appendix B: Key Dependencies

Go (datagen, loader, benchmark):

// go.mod for graph-capacity-experiments
github.com/neo4j/neo4j-go-driver/v5   // Neo4j + AuraDB
github.com/aws/aws-sdk-go-v2          // Neptune, OpenSearch, S3
github.com/redis/go-redis/v9          // Valkey VSS
golang.org/x/time/rate                // Rate limiting for loaders
gonum.org/v1/gonum                    // Statistical distributions

Python (embedding generation + catalog export):

openai>=1.0                            # OpenAI text-embedding-3-small/large
sentence-transformers>=2.0             # all-MiniLM-L6-v2 (384-dim, local)
numpy>=1.24                            # Vector I/O (.npy format)
pandas>=2.0                            # CSV reading
snowflake-connector-python>=3.0        # Snowflake catalog export (optional)

Graph Database Capacity Experiment Plan

Graph Database Capacity Experiment Plan

1. Goals

2. Decision: Repository Strategy

Recommendation: New repo — graph-capacity-experiments

3. Infrastructure Plan

3a. EC2 Neo4j Stress Test Instance

3b. AuraDB

3c. Neptune Analytics

3d. Vector Storage Instances

4. Synthetic Data Generation

4a. Data Model

4b. Distribution Matching

4c. Real Catalog Data (Products, Categories, Retailers)

Current prod Neo4j catalog (as of 2026-02-16, from 30 backfilled users)

Data source strategy

How datagen uses the real catalog

4d. Vector Embeddings

4e. CLI Interface

5. Experiment 1: EC2 Neo4j Stress Test

Objective

Procedure

Benchmark Query Set

Run Matrix

6. Experiment 2: AuraDB Evaluation

Objective

Procedure

Key Questions

Loader Differences

7. Experiment 3: Neptune Analytics

Objective

Procedure

Cypher Compatibility Risks

Neptune Vector Search API

Data Loading

8. Experiment 4: Vector Storage Comparison

Objective

Test Matrix

Benchmark

Combined Graph + Vector Query Pattern

9. Metrics Framework

What to Capture at Every Checkpoint

10. Gradual Data Feeding Strategy

Approach: Incremental Loading

Loading Protocol (Same for All Backends)

Per-Backend Loading

11. Vector Storage Deep Dive

What to Try

12. Order of Operations

Phase 1: Foundation

Phase 2: EC2 Neo4j Stress Test (Graph Only)

Phase 3: EC2 Neo4j + Vectors

Phase 4: AuraDB

Phase 5: Neptune Analytics

Phase 6: Vector Comparison

Phase 7: Final Report

13. Infrastructure Teardown

Appendix A: Estimated Costs

Appendix B: Key Dependencies

Recommendation: New repo — `graph-capacity-experiments`

How `datagen` uses the real catalog