Graph Database Capacity Experiment Plan
Graph Database Capacity Experiment Plan
Section titled “Graph Database Capacity Experiment Plan”Date: 2026-02-16 Author: f.luo Status: Draft — awaiting review
See also:
- Action Plan — sequenced steps, decision gates, and timeline
- Infrastructure Solidification Roadmap — phased approach to improving EC2 infra and managed DB options analysis
Results from this experiment plan inform the Action Plan’s decision gates — specifically whether to invest further in EC2 (solidification Phases 1-2) or migrate to a managed service (Phase 3).
1. Goals
Section titled “1. Goals”- Find the breaking point of EC2 Neo4j — How many users can the current r6i.xlarge (32GB) hold before performance degrades? What about with vectors? (Informs Action Plan — Decision Gate A)
- Evaluate AuraDB — Is managed Neo4j viable? What’s the cost/performance tradeoff vs self-managed EC2? (Informs Action Plan — Decision Gate B)
- Evaluate Neptune Analytics — Can it replace Neo4j? How does graph+vector unification compare to Neo4j + external vector store? (Informs Action Plan — Decision Gate B)
- Compare vector storage options — Neo4j native, Neptune Analytics, Valkey VSS, OpenSearch Serverless
- Establish cost models — $/user/month for each backend at 100K, 1M, 5M users
2. Decision: Repository Strategy
Section titled “2. Decision: Repository Strategy”Recommendation: New repo — graph-capacity-experiments
Section titled “Recommendation: New repo — graph-capacity-experiments”Why not keep it in consumer-graph-worker?
| Concern | consumer-graph-worker | New repo |
|---|---|---|
| Lifecycle | Long-lived production service | Throwaway experiments |
| Dependencies | Go + Neo4j driver + Kafka | Go + Neo4j driver + Neptune SDK + OpenSearch SDK + vector libs |
| CI/CD | Build → Docker → ECS deploy | Build → run locally or on EC2 |
| Data generators | Not appropriate in prod code | Core purpose |
| Risk | Benchmark code could accidentally ship | Isolated |
Repo structure:
graph-capacity-experiments/├── cmd/│ ├── datagen/ # Synthetic user + relationship generator (uses real catalog)│ │ └── main.go│ ├── loader/ # Multi-backend data loader│ │ └── main.go│ └── benchmark/ # Benchmark runner│ └── main.go├── scripts/│ ├── export_catalog.sh # Export real products/categories/retailers from Neo4j│ ├── export_snowflake_catalog.py # Export larger catalog from Snowflake│ └── embed.py # Generate embeddings (OpenAI or sentence-transformer)├── internal/│ ├── datagen/ # Data generation logic│ │ ├── users.go # Synthetic user generation│ │ ├── purchases.go # Purchase relationship generation (assigns real products to fake users)│ │ └── distributions.go # Statistical distributions matching prod│ ├── loader/ # Backend-specific loaders│ │ ├── neo4j.go # Bolt protocol (EC2 + AuraDB)│ │ ├── neptune.go # openCypher over HTTPS│ │ └── vectors.go # Vector-specific loaders (Valkey VSS, OpenSearch)│ ├── benchmark/ # Benchmark queries + harness│ │ ├── queries.go # Standard query set│ │ ├── runner.go # Execution + timing│ │ └── report.go # Results formatting│ └── model/ # Shared data model (mirrors consumer-graph-worker types)│ └── types.go├── data/│ └── catalog/ # Real product/category/retailer data (exported, gitignored)├── infra/ # FSD configs for experiment instances│ ├── experiment-neo4j-ec2.yml│ └── experiment-neptune.yml├── results/ # Benchmark results (committed for reference)│ └── .gitkeep├── go.mod├── go.sum├── Makefile└── README.mdAlternative considered: Keep in consumer-graph-worker under experiments/. Rejected because it adds unnecessary dependencies to the production module and blurs the boundary.
3. Infrastructure Plan
Section titled “3. Infrastructure Plan”3a. EC2 Neo4j Stress Test Instance
Section titled “3a. EC2 Neo4j Stress Test Instance”Clone the existing consumer-graph-neo4j-ec2.yml with modifications:
variables: default: instance_type: 'r6i.xlarge' # Start with same as prod (32GB) data_volume_size: '500' stage: instance_type: 'r6i.xlarge' data_volume_size: '500'
tags: service: consumer-graph-neo4j-experiment purpose: capacity-testing ttl: 30d # Remind to tear downDeploy to stage account only (cheaper, no prod risk):
fsd service ec2 deploy --env stage --account stage-services experiment-neo4j-ec2.ymlLater, to test larger instances: Change instance_type to r6i.2xlarge (64GB) or r6i.4xlarge (128GB) and redeploy.
Cost: r6i.xlarge on-demand = ~$0.252/hr = ~$6/day. Budget ~$200 for a month of experiments.
3b. AuraDB
Section titled “3b. AuraDB”Use AuraDB Professional (not Free — 200K node limit is too restrictive):
- Create via AuraDB Console
- Region: us-east-1 (same as our infra)
- Size: Start with 2GB RAM, scale up as needed
- Cost: ~$65/mo for 2GB, ~$130/mo for 4GB
- Connection: Bolt protocol (same Neo4j Go driver, different connection URI)
No FSD config needed — AuraDB is fully managed by Neo4j Inc.
3c. Neptune Analytics
Section titled “3c. Neptune Analytics”Create a Neptune Analytics graph (serverless, no instance provisioning):
aws neptune-graph create-graph \ --graph-name consumer-graph-experiment \ --provisioned-memory 128 \ --vector-search-configuration dimension=1024 \ --region us-east-1- Uses openCypher (compatible with Neo4j Cypher, with caveats)
- Has native vector search built in
- Serverless pricing: pay per query + storage
- No FSD config needed — use AWS CLI or CloudFormation
3d. Vector Storage Instances
Section titled “3d. Vector Storage Instances”| Backend | Setup | Notes |
|---|---|---|
| Neo4j native (HNSW) | Already on the EC2 experiment instance | CREATE VECTOR INDEX |
| Neptune Analytics vectors | Already included in Neptune graph | Built-in |
| Valkey VSS | Create a separate Valkey node or use existing stage cache | Needs redis-cli with VSS module |
| OpenSearch Serverless | Create a vector search collection | aws opensearchserverless create-collection |
4. Synthetic Data Generation
Section titled “4. Synthetic Data Generation”4a. Data Model
Section titled “4a. Data Model”Matches the production schema exactly:
(:User {user_id, timezone, created_at, last_updated_at}) -[:PURCHASED {times, last, timestamps[], receipt_ids[], avg_interval_days, repurchase_likelihood}]->(:Product {product_id, name, brand, category, created_at}) -[:IN_CATEGORY]->(:Category {category_id, name})
(:User)-[:SHOPS_IN {purchase_count}]->(:Category)(:User)-[:SHOPS_AT {frequency, last_visit}]->(:Retailer {name, venue_type})(:User)-[:MEMBER_OF]->(:Community {community_id, name, primary_category, member_count, zip_code})4b. Distribution Matching
Section titled “4b. Distribution Matching”Use distributions measured from production (from capacity planning doc):
| Relationship | Distribution | Params |
|---|---|---|
| PURCHASED per user | Log-normal | mean=5.8, median=3, p95=20, max=103 |
| SHOPS_IN per user | Log-normal | mean=5.7, p95=17 |
| MEMBER_OF per user | Log-normal | mean=2.6, p95=6 |
| SHOPS_AT per user | Log-normal | mean=1.6, p95=4 |
| Products (shared) | Power-law | ~2.5 products per user (amortized), popular products purchased by many users |
| Categories | Fixed catalog | ~50 realistic categories (Dairy, Bakery, Snacks, etc.) |
| Retailers | Fixed catalog | ~500 realistic retailer names |
| Communities | Derived | ~3 per zip code × category combination |
4c. Real Catalog Data (Products, Categories, Retailers)
Section titled “4c. Real Catalog Data (Products, Categories, Retailers)”Products, categories, and retailers use real Fetch data — not synthetic names. This ensures embeddings reflect actual product semantics and similarity searches return meaningful results.
Current prod Neo4j catalog (as of 2026-02-16, from 30 backfilled users)
Section titled “Current prod Neo4j catalog (as of 2026-02-16, from 30 backfilled users)”| Entity | Count | Properties |
|---|---|---|
| Products | 4,586 | product_id, name, brand, category |
| Categories | 2,970 | category_id, name (3-level hierarchy: GROCERY|DAIRY|MILK) |
| Retailers | 212 | retailer_id, name, venue_type |
Data source strategy
Section titled “Data source strategy”| Scale | Products Needed | Source | Method |
|---|---|---|---|
| Quick-start (≤50K users) | ~5K | Prod Neo4j export | Cypher query → CSV |
| Medium (100K–500K users) | ~60K | Snowflake export | SQL query → CSV |
| Large (1M+ users) | ~400K+ | Snowflake export | SQL query → CSV |
Source 1: Prod Neo4j export (immediate, no extra access needed)
-- Export productsMATCH (p:Product)RETURN p.product_id AS product_id, p.name AS name, p.brand AS brand, p.category AS category
-- Export categoriesMATCH (c:Category)RETURN c.category_id AS category_id, c.name AS name
-- Export retailersMATCH (r:Retailer)RETURN r.retailer_id AS retailer_id, r.name AS name, r.venue_type AS venue_type
-- Export product→category mappingMATCH (p:Product)-[:IN_CATEGORY]->(c:Category)RETURN p.product_id AS product_id, c.category_id AS category_idA scripts/export_catalog.sh script runs these via the Neo4j HTTP API and writes CSVs. This gives us ~4.5K real products with names like “Hormel Black Label Thick Cut Maple Bacon - 12 Oz” and real brands like HORMEL, GREAT VALUE, CELSIUS.
Source 2: Snowflake export (for larger catalogs)
-- Unique products from receipt items (full Fetch catalog)SELECT DISTINCT i.FIDO AS product_id, i.DESCRIPTION AS name, i.BRAND AS brand, COALESCE(i.CATEGORY_1, 'UNCATEGORIZED') AS category_l1, i.CATEGORY_2 AS category_l2, i.CATEGORY_3 AS category_l3FROM FETCH_SERVICES_PROD.RECEIPT_SERVICE.RECEIPT_ITEMS iWHERE i.FIDO IS NOT NULL AND i.DESCRIPTION IS NOT NULL AND i.DESCRIPTION != ''LIMIT 500000;
-- Unique retailersSELECT DISTINCT r.STORE_NAME AS name, r.RETAILER_CHANNEL AS venue_typeFROM FETCH_SERVICES_PROD.RECEIPT_SERVICE.RECEIPTS rWHERE r.STORE_NAME IS NOT NULL AND r.STORE_NAME != '';Run via Snowflake CLI (snowsql) or the Snowflake Python connector. Export to CSV, then use in datagen.
Source 3: Purchase History API (alternative, slower)
If Snowflake access is not available, we can discover more products by calling the Purchase History API for a batch of user IDs. Each user averages ~5.8 unique products. Querying ~30K users at 10 req/s (~50 min) would yield ~50-60K unique products. This uses the existing backfill infrastructure but is slower than a direct Snowflake query.
How datagen uses the real catalog
Section titled “How datagen uses the real catalog”- Load real product/category/retailer CSVs (exported from Neo4j or Snowflake)
- Generate synthetic users with fake user_ids
- Assign real products to fake users following a power-law (Zipf) distribution — popular products (bananas, eggs, milk) purchased by many users, long-tail products by few
- Build SHOPS_IN, SHOPS_AT, MEMBER_OF relationships from the assigned purchases
- Communities are derived from category + zip code combinations (synthetic zip codes, real categories)
Product popularity distribution from prod (top products by buyer count):
| Product | Buyers | Category |
|---|---|---|
| Fresh Fruits | 12 | Pantry |
| Fresh Vegetables | 10 | Pantry |
| Eggs | 10 | Pantry |
| Fresh Bananas | 9 | Pantry |
| Fresh Blueberries | 8 | Pantry |
| Avocados | 6 | PRODUCE|FRUITS|AVOCADOS |
The Zipf distribution in datagen should match this pattern: ~40% of purchases hit the top 5% of products.
4d. Vector Embeddings
Section titled “4d. Vector Embeddings”All embeddings use real models so that kNN results reflect actual semantic similarity (not random noise). Three embedding approaches:
| Model | Dimensions | Speed | Cost | Where it runs |
|---|---|---|---|---|
OpenAI text-embedding-3-small | 512 (via dimensions param) | ~3,000 items/min | ~$0.02/1M tokens | API call |
OpenAI text-embedding-3-large | 1024 (via dimensions param) | ~2,500 items/min | ~$0.13/1M tokens | API call |
Sentence Transformer (all-MiniLM-L6-v2) | 384 | ~10,000 items/min | Free | Local (Python) |
OpenAI embedding (512-dim) — good quality, low cost:
# Using text-embedding-3-small with dimensions=512response = openai.embeddings.create( model="text-embedding-3-small", input="Organic Whole Milk, Horizon, Dairy", dimensions=512)- Input text per product:
"{name}, {brand}, {category}"(e.g. “Organic Whole Milk, Horizon, Dairy”) - Input text per user: Concatenation of their top-5 purchased product names + top-3 categories
- Cost estimate: 1M products × ~10 tokens each = 10M tokens → ~$0.20
OpenAI embedding (1024-dim) — highest quality, tests scaling with larger vectors:
# Using text-embedding-3-large with dimensions=1024response = openai.embeddings.create( model="text-embedding-3-large", input="Organic Whole Milk, Horizon, Dairy", dimensions=1024)- Same input text format as 512-dim
- Cost estimate: 1M products × ~10 tokens each = 10M tokens → ~$1.30
- 2× storage and index overhead vs 512-dim — tests whether higher quality justifies the cost
Sentence Transformer — free, local, 384-dim:
from sentence_transformers import SentenceTransformermodel = SentenceTransformer('all-MiniLM-L6-v2') # 384-dimembeddings = model.encode(["Organic Whole Milk, Horizon, Dairy", ...])- Same input text format as OpenAI
- Runs on local machine (CPU is fine for
<1Mitems) - No API cost, fully offline
Embedding pipeline:
- Real product catalog is exported from Neo4j or Snowflake (see section 4c)
- A Python script (
scripts/embed.py) reads the product CSV and generates embeddings via OpenAI API or local sentence-transformer - Embeddings are saved as
.npyfiles (product_id → float32 array) datagenassigns real products to synthetic users; user embeddings = weighted average of their purchased product embeddings (weighted by purchase count)
Recommended approach:
- Use OpenAI 512-dim (
text-embedding-3-small) as the primary embedding for most experiments (good quality, low cost) - Use OpenAI 1024-dim (
text-embedding-3-large) to test whether higher dimensionality improves recall enough to justify 2× storage/index overhead - Use sentence-transformer 384-dim as a free alternative for rapid iteration and local development
- Compare all three on HNSW index size, kNN recall, and query latency to determine the best quality/cost/performance tradeoff
Embeddings go on:
- Product nodes: Embed
"{name}, {brand}, {category}"— represents the product semantically - User nodes: Weighted average of purchased product embeddings — represents the user’s purchase behavior profile
- Category nodes (optional): Embed category name — enables category-level similarity search
4e. CLI Interface
Section titled “4e. CLI Interface”# Step 0: Export real catalog from prod Neo4j (one-time)./scripts/export_catalog.sh --env prod --output data/catalog/# Or from Snowflake for a larger catalog:python scripts/export_snowflake_catalog.py --output data/catalog/ --limit 500000
# Catalog output:# data/catalog/products.csv (product_id, name, brand, category)# data/catalog/categories.csv (category_id, name)# data/catalog/retailers.csv (retailer_id, name, venue_type)# data/catalog/in_category.csv (product_id, category_id)
# Step 1: Generate embeddings for real products (Python, one-time per model)python scripts/embed.py \ --products data/catalog/products.csv \ --model openai-small --dimensions 512 \ --output data/catalog/embeddings-512/
python scripts/embed.py \ --products data/catalog/products.csv \ --model openai-large --dimensions 1024 \ --output data/catalog/embeddings-1024/
python scripts/embed.py \ --products data/catalog/products.csv \ --model sentence-transformer \ --output data/catalog/embeddings-384/
# Embedding output (per model):# data/catalog/embeddings-512/products.npy (float32 array)# data/catalog/embeddings-512/manifest.json (model, dimensions, count, cost)
# Step 2: Generate synthetic users + purchase relationships (Go)./datagen \ --users 100000 \ --seed 42 \ --catalog data/catalog/ \ --embeddings data/catalog/embeddings-512/ \ --output data/100k/
# Generated output (synthetic):# data/100k/users.csv (user_id, timezone, created_at)# data/100k/embeddings/users.npy (weighted avg of product embeddings)# data/100k/embeddings/manifest.json (model, dimensions, count)# data/100k/communities.csv (community_id, name, primary_category, zip_code)# data/100k/purchased.csv (user_id, product_id, times, last, ...)# data/100k/shops_in.csv (user_id, category_id, purchase_count)# data/100k/shops_at.csv (user_id, retailer_name, frequency, last_visit)# data/100k/member_of.csv (user_id, community_id)# data/100k/manifest.json (metadata: counts, seed, generation time)## Products, categories, retailers, and in_category are real data from# data/catalog/ — shared across all dataset sizes, not regenerated.5. Experiment 1: EC2 Neo4j Stress Test
Section titled “5. Experiment 1: EC2 Neo4j Stress Test”Objective
Section titled “Objective”Find the maximum user count where:
- Read query p95 < 100ms
- Write throughput > 100 nodes/sec
- Page cache hit ratio > 95%
- No OOM crashes
Procedure
Section titled “Procedure”- Deploy fresh experiment EC2 (r6i.xlarge, 32GB)
- Load data incrementally:
| Checkpoint | Users | Est. Graph Size | Est. + Vectors (512-dim) | Est. + Vectors (1024-dim) |
|---|---|---|---|---|
| C1 | 10,000 | ~43 MB | ~150 MB | ~260 MB |
| C2 | 50,000 | ~215 MB | ~650 MB | ~1.1 GB |
| C3 | 100,000 | ~430 MB | ~1.3 GB | ~2.2 GB |
| C4 | 250,000 | ~1.1 GB | ~3.1 GB | ~5.3 GB |
| C5 | 500,000 | ~2.1 GB | ~6 GB | ~10.5 GB |
| C6 | 1,000,000 | ~4.3 GB | ~11.5 GB | ~20 GB |
| C7 | 2,000,000 | ~8.6 GB | ~23 GB | ~40 GB |
Vector estimates (users + products both embedded, HNSW index overhead ≈ 1.5× vector storage):
- 512-dim float32 = ~2.1 KB/embedding
- 1024-dim float32 = ~4.2 KB/embedding
- At each checkpoint, run the benchmark suite
- Stop when performance degrades below thresholds
Benchmark Query Set
Section titled “Benchmark Query Set”-- Q1: Single user lookup (point query)MATCH (u:User {user_id: $uid})-[r:PURCHASED]->(p:Product)RETURN u, r, p
-- Q2: 2-hop category aggregationMATCH (u:User {user_id: $uid})-[:PURCHASED]->(p:Product)-[:IN_CATEGORY]->(c:Category)RETURN c.name, count(p) AS products ORDER BY products DESC
-- Q3: Community-based recommendation (expensive)MATCH (u:User {user_id: $uid})-[:MEMBER_OF]->(comm:Community)<-[:MEMBER_OF]-(other:User)MATCH (other)-[:PURCHASED]->(p:Product)WHERE NOT (u)-[:PURCHASED]->(p)RETURN p.name, count(DISTINCT other) AS scoreORDER BY score DESC LIMIT 10
-- Q4: User's full profile (all relationship types)MATCH (u:User {user_id: $uid})OPTIONAL MATCH (u)-[pur:PURCHASED]->(prod:Product)OPTIONAL MATCH (u)-[si:SHOPS_IN]->(cat:Category)OPTIONAL MATCH (u)-[sa:SHOPS_AT]->(ret:Retailer)OPTIONAL MATCH (u)-[mo:MEMBER_OF]->(comm:Community)RETURN count(DISTINCT pur) AS purchases, count(DISTINCT si) AS categories, count(DISTINCT sa) AS retailers, count(DISTINCT mo) AS communities
-- Q5: Global aggregation (stress test)MATCH (u:User)-[:PURCHASED]->(p:Product)WITH p, count(u) AS buyersORDER BY buyers DESC LIMIT 20RETURN p.name, buyers
-- Q6: Vector similarity (only when vectors loaded)CALL db.index.vector.queryNodes('product-embedding-index', 10, $queryVector)YIELD node, scoreRETURN node.product_id, node.name, score
-- Q7: Graph + vector combined (graph filter → vector rerank)MATCH (u:User {user_id: $uid})-[:PURCHASED]->(p:Product)WITH collect(p) AS purchasedCALL db.index.vector.queryNodes('product-embedding-index', 50, $queryVector)YIELD node, scoreWHERE NOT node IN purchasedRETURN node.product_id, node.name, score LIMIT 10Each query runs 100 iterations with random user IDs. Record p50, p95, p99, max latency.
Run Matrix
Section titled “Run Matrix”| Run | Instance | Vectors | Embedding Model | Max Users |
|---|---|---|---|---|
| R1 | r6i.xlarge (32GB) | No | — | Until degradation |
| R2 | r6i.xlarge (32GB) | Yes (512-dim) | OpenAI text-embedding-3-small | Until degradation |
| R3 | r6i.xlarge (32GB) | Yes (384-dim) | Sentence Transformer all-MiniLM-L6-v2 | Until degradation |
| R4 | r6i.xlarge (32GB) | Yes (1024-dim) | OpenAI text-embedding-3-large | Until degradation |
| R5 | r6i.2xlarge (64GB) | Yes (512-dim) | OpenAI text-embedding-3-small | Until degradation |
| R6 | r6i.2xlarge (64GB) | Yes (1024-dim) | OpenAI text-embedding-3-large | Until degradation |
6. Experiment 2: AuraDB Evaluation
Section titled “6. Experiment 2: AuraDB Evaluation”Objective
Section titled “Objective”Compare AuraDB Professional vs self-managed EC2 Neo4j on latency, throughput, and cost.
Procedure
Section titled “Procedure”- Create AuraDB Professional instance (us-east-1, 4GB RAM)
- Load 100K users (same dataset as EC2 experiment)
- Run identical benchmark query set
- Scale to 500K, 1M if 100K passes
- Test vector support (AuraDB Professional supports vector indexes)
Key Questions
Section titled “Key Questions”- Cypher compatibility: Is our production Cypher 100% compatible? (MERGE, UNWIND, CASE WHEN, datetime, array properties)
- Write throughput: How does Bolt-over-internet compare to Bolt-over-VPC?
- Latency: Network hop to AuraDB vs local VPC EC2
- Cost: AuraDB pricing vs EC2 + ops overhead
- Vector support: Same HNSW API as Community Edition?
Loader Differences
Section titled “Loader Differences”Same Neo4j Go driver, different connection string:
// EC2driver, _ := neo4j.NewDriverWithContext("neo4j://10.4.19.205:7687", auth)
// AuraDBdriver, _ := neo4j.NewDriverWithContext("neo4j+s://xxxxx.databases.neo4j.io", auth)7. Experiment 3: Neptune Analytics
Section titled “7. Experiment 3: Neptune Analytics”Objective
Section titled “Objective”Evaluate Neptune Analytics as a unified graph + vector database, replacing both Neo4j and a separate vector store.
Procedure
Section titled “Procedure”- Create Neptune Analytics graph (128 m-NCUs, 1024-dim vector config)
- Load data via openCypher endpoint or Bulk Load from S3
- Test Cypher compatibility (Neptune uses openCypher, not full Cypher)
- Run benchmark queries (translate where needed)
- Test vector search with
neptune.algo.vectors.topKByNode()andtopKByEmbedding() - Measure combined graph traversal + vector similarity queries
Cypher Compatibility Risks
Section titled “Cypher Compatibility Risks”Neptune’s openCypher has known gaps vs Neo4j Cypher:
| Feature | Neo4j | Neptune | Risk |
|---|---|---|---|
| MERGE with ON CREATE/ON MATCH | Yes | Yes | Low |
| UNWIND | Yes | Yes | Low |
| CASE WHEN in SET | Yes | Partial | Medium — test the receipt_ids dedup pattern |
| datetime() | Yes | timestamp() | High — different function names |
| Array properties | Yes | Yes (with caveats) | Medium — test append operations |
| CALL procedures | Yes (APOC, vector) | Different API | High — vector queries use different syntax |
| CREATE VECTOR INDEX | Yes (Neo4j 5.x) | Not applicable | N/A — vectors are built-in differently |
Neptune Vector Search API
Section titled “Neptune Vector Search API”-- Neptune Analytics vector similarity (different from Neo4j)MATCH (p:Product)CALL neptune.algo.vectors.topKByEmbedding(p, $queryVector, 10)YIELD node, scoreRETURN node.product_id, scoreData Loading
Section titled “Data Loading”Neptune supports bulk loading from S3 (CSV format with specific headers):
# Upload generated CSV to S3aws s3 cp data/100k/ s3://experiment-bucket/neptune/100k/ --recursive
# Start bulk loadaws neptune-graph start-import-task \ --graph-identifier consumer-graph-experiment \ --source s3://experiment-bucket/neptune/100k/ \ --role-arn arn:aws:iam::role/NeptuneBulkLoadRole \ --format openCypher8. Experiment 4: Vector Storage Comparison
Section titled “8. Experiment 4: Vector Storage Comparison”Objective
Section titled “Objective”Compare vector storage options for kNN similarity search at scale. Find the best option for our latency, cost, and operational requirements.
Test Matrix
Section titled “Test Matrix”| Backend | Setup | kNN API |
|---|---|---|
| Neo4j HNSW | Vector index on experiment EC2 | db.index.vector.queryNodes() |
| Neptune Analytics | Built into Neptune graph | neptune.algo.vectors.topKByEmbedding() |
| Valkey VSS | FT.CREATE with HNSW, FT.SEARCH with KNN | FT.SEARCH idx @vec:[VECTOR_RANGE ...] |
| OpenSearch Serverless | Vector search collection | knn query via REST API |
Benchmark
Section titled “Benchmark”For each backend, at 100K and 1M vectors (1024-dim):
- Insert throughput: vectors/sec for bulk load
- kNN latency: k=10, k=50, k=100 — p50, p95, p99
- Recall@10: Against brute-force exact results (measures HNSW approximation quality)
- Combined query: kNN candidates → graph enrichment (two-hop) — end-to-end latency
- Memory usage: Index size in RAM
- Cost: $/month at steady state
Combined Graph + Vector Query Pattern
Section titled “Combined Graph + Vector Query Pattern”The production use case is:
- Vector search → Find 50 candidate products similar to user’s embedding
- Graph filter → Remove products the user already purchased
- Graph enrich → Get category, brand, retailer info for remaining candidates
- Graph rerank → Boost candidates purchased by users in the same community
This requires either:
- Single-engine (Neptune Analytics, Neo4j native vectors): One query does it all
- Two-engine (Neo4j + Valkey, Neo4j + OpenSearch): Vector search → ID list → graph query
Measure end-to-end latency for both patterns.
9. Metrics Framework
Section titled “9. Metrics Framework”What to Capture at Every Checkpoint
Section titled “What to Capture at Every Checkpoint”{ "experiment": "ec2-neo4j-stress", "checkpoint": "C3", "users": 100000, "timestamp": "2026-02-20T14:30:00Z", "backend": { "type": "neo4j-ec2", "instance": "r6i.xlarge", "memory_gb": 32, "version": "5.15.0" }, "data": { "total_nodes": 180000, "total_relationships": 1570000, "store_size_mb": 430, "vector_index_size_mb": 0 }, "write_metrics": { "load_duration_sec": 120, "nodes_per_sec": 1500, "rels_per_sec": 13000, "batch_size": 1000 }, "read_metrics": { "Q1_point_lookup": {"p50_ms": 2, "p95_ms": 5, "p99_ms": 12}, "Q2_2hop_category": {"p50_ms": 5, "p95_ms": 15, "p99_ms": 30}, "Q3_community_rec": {"p50_ms": 50, "p95_ms": 120, "p99_ms": 250}, "Q4_full_profile": {"p50_ms": 3, "p95_ms": 8, "p99_ms": 20}, "Q5_global_agg": {"p50_ms": 200, "p95_ms": 500, "p99_ms": 1000}, "Q6_vector_knn": {"p50_ms": 0, "p95_ms": 0, "p99_ms": 0}, "Q7_graph_plus_vector": {"p50_ms": 0, "p95_ms": 0, "p99_ms": 0} }, "system_metrics": { "page_cache_hit_ratio": 0.99, "heap_used_mb": 3200, "heap_max_mb": 8192, "cpu_percent": 15, "disk_iops": 120 }, "cost": { "instance_hourly": 0.252, "estimated_monthly": 181 }}Results are committed to results/ in the experiment repo for historical comparison.
10. Gradual Data Feeding Strategy
Section titled “10. Gradual Data Feeding Strategy”Approach: Incremental Loading
Section titled “Approach: Incremental Loading”Don’t load all data at once. Load in checkpoints so we can measure the system at each scale.
Checkpoint C1: 10K users ──┐ ├── Benchmark suiteCheckpoint C2: 50K users ──┤ (runs at each checkpoint) ├──Checkpoint C3: 100K users ──┤ ├──Checkpoint C4: 250K users ──┤ ├──Checkpoint C5: 500K users ──┤ ├──Checkpoint C6: 1M users ──┤ ├──Checkpoint C7: 2M users ──┘ (EC2 may OOM here with vectors)Loading Protocol (Same for All Backends)
Section titled “Loading Protocol (Same for All Backends)”- Pre-generate all data using
datagen(deterministic seed, so regeneration is identical) - Load in batches of 1,000 users — each batch includes the user + all their relationships
- Use MERGE (same as production) — allows re-running without duplicates
- After each checkpoint: Run full benchmark suite, save results, continue loading
- If backend crashes/degrades: Record the failure point, stop loading, note as the limit
Per-Backend Loading
Section titled “Per-Backend Loading”| Backend | Protocol | Batch Method |
|---|---|---|
| EC2 Neo4j | Bolt (Go driver) | UNWIND + MERGE (same as production writer) |
| AuraDB | Bolt (Go driver) | Same UNWIND + MERGE |
| Neptune Analytics | openCypher HTTPS or S3 bulk load | Bulk load for initial, incremental MERGE for checkpoints |
| Valkey VSS | Redis protocol | FT.CREATE index, HSET for each vector |
| OpenSearch Serverless | REST API | _bulk API for batch indexing |
11. Vector Storage Deep Dive
Section titled “11. Vector Storage Deep Dive”What to Try
Section titled “What to Try”A. Neo4j Native Vectors (HNSW)
-- Create vector indexCREATE VECTOR INDEX `product-embedding-index`FOR (p:Product) ON (p.embedding)OPTIONS {indexConfig: { `vector.dimensions`: 512, `vector.similarity_function`: 'cosine'}}
-- QueryCALL db.index.vector.queryNodes('product-embedding-index', 10, $queryVector)YIELD node, scoreRETURN node.product_id, scorePros: Single query engine, simplest architecture Cons: HNSW index must fit in RAM, scaling requires bigger instance
B. Neptune Analytics Vectors
-- Vectors are set as node propertiesMATCH (p:Product {product_id: $id})SET p.embedding = $vector
-- Query (Neptune-specific syntax)CALL neptune.algo.vectors.topKByEmbedding($queryVector, 10)YIELD node, scoreRETURN node.product_id, scorePros: Graph + vector in one engine, serverless scaling Cons: Different query syntax, potential Cypher compatibility gaps
C. Valkey VSS (Vector Similarity Search)
# Create indexFT.CREATE product-idx ON HASH PREFIX 1 "product:" \ SCHEMA embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 1024 DISTANCE_METRIC COSINE
# InsertHSET product:abc123 embedding <binary_vector> name "Product Name" category "Dairy"
# kNN searchFT.SEARCH product-idx "*=>[KNN 10 @embedding $query_vec AS score]" \ PARAMS 2 query_vec <binary_vector> \ RETURN 2 name score \ SORTBY score ASCPros: Sub-millisecond latency, reuses existing Valkey infrastructure Cons: Two-engine pattern (Valkey kNN → Neo4j graph enrichment), no graph traversal in vector query
D. OpenSearch Serverless (Vector Search Collection)
// Create collection{ "name": "product-vectors", "type": "VECTORSEARCH"}
// Index document{ "product_id": "abc123", "name": "Product Name", "embedding": [0.1, 0.2, ...]}
// kNN query{ "query": { "knn": { "embedding": { "vector": [0.1, 0.2, ...], "k": 10 } } }}Pros: Fully managed, scales independently, FAISS-backed Cons: Higher latency (~10-50ms), separate service, two-engine pattern
12. Order of Operations
Section titled “12. Order of Operations”Phase 1: Foundation
Section titled “Phase 1: Foundation”- Create
graph-capacity-experimentsrepo - Export real product catalog from prod Neo4j (~4.5K products) via
scripts/export_catalog.sh - (Optional) Export larger catalog from Snowflake (~500K products) via
scripts/export_snowflake_catalog.py - Generate product embeddings for all three models (512-dim, 1024-dim, 384-dim) via
scripts/embed.py - Implement
datagentool (synthetic users + relationships using real catalog) - Implement
loadertool (Neo4j Bolt backend first) - Implement
benchmarktool (query runner + metrics capture) - Deploy experiment EC2 Neo4j instance (stage account)
Phase 2: EC2 Neo4j Stress Test (Graph Only)
Section titled “Phase 2: EC2 Neo4j Stress Test (Graph Only)”- Generate datasets: 10K → 2M users (no vectors)
- Run incremental load + benchmark (runs R1)
- Find the graph-only breaking point
- Document results
Phase 3: EC2 Neo4j + Vectors
Section titled “Phase 3: EC2 Neo4j + Vectors”- Create Neo4j vector indexes on experiment instance
- Run incremental load + benchmark with 512-dim (run R2)
- Run with 384-dim sentence-transformer (run R3)
- Run with 1024-dim on r6i.xlarge (run R4) — expect earlier OOM than 512-dim
- Test with larger instance r6i.2xlarge: 512-dim (run R5) and 1024-dim (run R6)
- Document vector overhead and breaking points per dimension
Phase 4: AuraDB
Section titled “Phase 4: AuraDB”- Create AuraDB Professional instance
- Add AuraDB connection to loader
- Run 100K → 1M user benchmarks
- Compare with EC2 results
- Document cost model
Phase 5: Neptune Analytics
Section titled “Phase 5: Neptune Analytics”- Create Neptune Analytics graph
- Add Neptune openCypher loader
- Test Cypher compatibility (especially MERGE + CASE WHEN patterns)
- Run benchmarks (graph + vector combined)
- Document compatibility gaps and performance
Phase 6: Vector Comparison
Section titled “Phase 6: Vector Comparison”- Set up Valkey VSS + OpenSearch Serverless
- Load same vector dataset to all four backends
- Run kNN benchmarks
- Run combined graph+vector end-to-end benchmarks
- Produce comparison matrix
Phase 7: Final Report
Section titled “Phase 7: Final Report”- Compile all results into a recommendation document
- Cost projections at 100K, 1M, 5M, 10M users
- Recommended architecture per scale tier
- Migration effort estimate for each option
13. Infrastructure Teardown
Section titled “13. Infrastructure Teardown”All experiment infrastructure should be tagged with purpose: capacity-testing and torn down after experiments:
- EC2 Neo4j experiment instance
- AuraDB Professional instance
- Neptune Analytics graph
- OpenSearch Serverless collection
- Any experiment Valkey nodes
- S3 buckets with generated data
Appendix A: Estimated Costs
Section titled “Appendix A: Estimated Costs”| Resource | Duration | Cost |
|---|---|---|
| EC2 r6i.xlarge (experiment Neo4j) | 30 days | ~$180 |
| EC2 r6i.2xlarge (upgrade test) | 7 days | ~$68 |
| AuraDB Professional (4GB) | 30 days | ~$130 |
| Neptune Analytics (128 m-NCUs) | 30 days | ~$100-200 (usage-based) |
| OpenSearch Serverless (2 OCUs) | 7 days | ~$60 |
| S3 storage for datasets | 30 days | ~$5 |
| Total estimated | ~$550-650 |
Appendix B: Key Dependencies
Section titled “Appendix B: Key Dependencies”Go (datagen, loader, benchmark):
// go.mod for graph-capacity-experimentsgithub.com/neo4j/neo4j-go-driver/v5 // Neo4j + AuraDBgithub.com/aws/aws-sdk-go-v2 // Neptune, OpenSearch, S3github.com/redis/go-redis/v9 // Valkey VSSgolang.org/x/time/rate // Rate limiting for loadersgonum.org/v1/gonum // Statistical distributionsPython (embedding generation + catalog export):
openai>=1.0 # OpenAI text-embedding-3-small/largesentence-transformers>=2.0 # all-MiniLM-L6-v2 (384-dim, local)numpy>=1.24 # Vector I/O (.npy format)pandas>=2.0 # CSV readingsnowflake-connector-python>=3.0 # Snowflake catalog export (optional)