Neo4j Action Plan

Date: 2026-02-23 Author: f.luo Status: Draft

Context

Neo4j runs on a self-managed EC2 instance with operational pain points (IP changes, SG resets, no health checks). We need to stabilize the current setup while evaluating whether to stay on EC2 long-term or migrate to a managed service.

Two supporting documents provide the detail behind this plan:

Infrastructure Solidification Roadmap — phased approach to stabilizing EC2 infra (Phases 1-2) and managed DB overview (Phase 3)
Capacity Experiment Plan — detailed experiment design for benchmarking EC2, AuraDB, and Neptune at scale

Sequenced Action Plan

Step 1: Stabilize EC2 (Fixed IP + Security Groups) — DONE

What: Automate post-deploy steps (SG rules + URI update + worker restart) in the “Deploy Neo4j EC2” GitHub Actions workflow. Worker reads Neo4j address from Secrets Manager instead of hardcoded IPs. EC2 user_data preserves existing data on redeploy.

Why now: This is the most frequent manual step today. Zero risk, low effort, immediate value regardless of long-term direction.

Effort: Low (1-2 days) Depends on: Nothing PR: #49 — feat: automate Neo4j EC2 post-deploy and preserve data on redeploy Details: Solidification Roadmap — Phase 1

Step 2: Run EC2 Capacity Experiments

What: Stand up an experiment EC2 instance and stress-test it with synthetic data (10K → 2M users), with and without vector embeddings, across multiple instance sizes.

Why now: Before investing further in EC2 infrastructure (Route 53, health checks) or evaluating managed alternatives, we need to know the breaking point. If EC2 tops out at 250K users and we need 5M, the urgency to migrate changes everything.

Key questions answered:

How many users can r6i.xlarge (32GB) hold before p95 > 100ms?
How much does adding 512-dim / 1024-dim vectors reduce capacity?
Does r6i.2xlarge (64GB) meaningfully extend the ceiling?

Effort: Medium (1-2 weeks) Depends on: Step 1 (for prod stability while experimenting on a separate instance) Details: Capacity Experiment Plan — Experiments 1 & 5

Decision Gate A: EC2 Capacity Assessment

EC2 Result	Next Steps
EC2 handles target scale comfortably	→ Step 3 (solidify EC2 further), defer Step 4
EC2 hits limits near target scale	→ Step 3 (quick stabilization) + Step 4 (evaluate alternatives urgently)
EC2 hits limits well below target	→ Skip Step 3, go directly to Step 4

Step 3: Solidify EC2 (Route 53 + Health Checks)

What: Set up a Route 53 private hosted zone so EC2 redeploys are fully automated (DNS self-registration), add health checks on port 7687.

Why: Eliminates all manual steps for EC2 redeploys. Only worth doing if EC2 capacity experiments show it has enough headroom for the medium term.

Effort: Medium (3-5 days) Depends on: Decision Gate A (confirms EC2 is viable at target scale) Details: Solidification Roadmap — Phase 2

Step 4: Evaluate Managed Alternatives (AuraDB + Neptune)

What: Benchmark AuraDB Professional and Neptune Analytics with the same datasets and queries used for EC2. Test Cypher compatibility, write throughput, read latency, vector support, and cost.

Why: Determines whether a managed service is worth the migration. AuraDB requires zero code changes (Bolt protocol); Neptune requires a data access layer rewrite but offers AWS-native integration.

Key questions answered:

Is AuraDB latency comparable to EC2-in-VPC?
Does Neptune’s openCypher support our MERGE + CASE WHEN patterns?
What’s the cost per user per month for each option at 1M, 5M users?

Effort: Medium (2-3 weeks) Depends on: Decision Gate A (determines urgency — leisurely evaluation vs urgent migration) Details:

Decision Gate B: Backend Selection

Result	Recommendation
AuraDB matches EC2 perf, lower ops cost	→ Migrate to AuraDB (config change, no code rewrite)
Neptune outperforms on graph+vector combined	→ Migrate to Neptune (data access layer rewrite needed)
EC2 is cheapest and meets scale needs	→ Stay on EC2 with Phase 2 solidification

Step 5: Vector Storage Decision

What: Compare Neo4j native HNSW, Neptune Analytics vectors, Valkey VSS, and OpenSearch Serverless for kNN similarity search.

Why: The consumer graph will need vector similarity for recommendations. The vector backend choice may be independent of the graph backend choice (e.g., Neo4j for graph + Valkey for vectors).

Effort: Medium (1-2 weeks, can run in parallel with Step 4) Depends on: Step 2 results (establishes Neo4j vector baseline) Details: Capacity Experiment Plan — Vector Comparison

Step 6: Final Report + Migration

What: Compile experiment results into a recommendation document with cost projections at 100K, 1M, 5M, 10M users. Execute the chosen migration path.

Depends on: Decision Gates A and B Details: Capacity Experiment Plan — Phase 7

Timeline

Week 1        Step 1: Fixed IP + SG rules
Week 1-2      Step 2: EC2 capacity experiments
              ── Decision Gate A ──
Week 3        Step 3: Route 53 (if EC2 viable)
Week 3-5      Step 4: AuraDB + Neptune benchmarks
Week 3-4      Step 5: Vector storage comparison (parallel with Step 4)
              ── Decision Gate B ──
Week 6        Step 6: Final report + migration plan

Estimated Cost

Item	Cost
EC2 experiment instance (30 days)	~$180
EC2 r6i.2xlarge upgrade test (7 days)	~$68
AuraDB Professional (30 days)	~$130
Neptune Analytics (30 days)	~$100-200
OpenSearch Serverless (7 days)	~$60
OpenAI embeddings	~$2
Total	~$550-650