Neo4j Action Plan
Neo4j Action Plan
Section titled “Neo4j Action Plan”Date: 2026-02-23 Author: f.luo Status: Draft
Context
Section titled “Context”Neo4j runs on a self-managed EC2 instance with operational pain points (IP changes, SG resets, no health checks). We need to stabilize the current setup while evaluating whether to stay on EC2 long-term or migrate to a managed service.
Two supporting documents provide the detail behind this plan:
- Infrastructure Solidification Roadmap — phased approach to stabilizing EC2 infra (Phases 1-2) and managed DB overview (Phase 3)
- Capacity Experiment Plan — detailed experiment design for benchmarking EC2, AuraDB, and Neptune at scale
Sequenced Action Plan
Section titled “Sequenced Action Plan”Step 1: Stabilize EC2 (Fixed IP + Security Groups) — DONE
Section titled “Step 1: Stabilize EC2 (Fixed IP + Security Groups) — DONE”What: Automate post-deploy steps (SG rules + URI update + worker restart) in the “Deploy Neo4j EC2” GitHub Actions workflow. Worker reads Neo4j address from Secrets Manager instead of hardcoded IPs. EC2 user_data preserves existing data on redeploy.
Why now: This is the most frequent manual step today. Zero risk, low effort, immediate value regardless of long-term direction.
Effort: Low (1-2 days) Depends on: Nothing PR: #49 — feat: automate Neo4j EC2 post-deploy and preserve data on redeploy Details: Solidification Roadmap — Phase 1
Step 2: Run EC2 Capacity Experiments
Section titled “Step 2: Run EC2 Capacity Experiments”What: Stand up an experiment EC2 instance and stress-test it with synthetic data (10K → 2M users), with and without vector embeddings, across multiple instance sizes.
Why now: Before investing further in EC2 infrastructure (Route 53, health checks) or evaluating managed alternatives, we need to know the breaking point. If EC2 tops out at 250K users and we need 5M, the urgency to migrate changes everything.
Key questions answered:
- How many users can r6i.xlarge (32GB) hold before p95 > 100ms?
- How much does adding 512-dim / 1024-dim vectors reduce capacity?
- Does r6i.2xlarge (64GB) meaningfully extend the ceiling?
Effort: Medium (1-2 weeks) Depends on: Step 1 (for prod stability while experimenting on a separate instance) Details: Capacity Experiment Plan — Experiments 1 & 5
Decision Gate A: EC2 Capacity Assessment
Section titled “Decision Gate A: EC2 Capacity Assessment”| EC2 Result | Next Steps |
|---|---|
| EC2 handles target scale comfortably | → Step 3 (solidify EC2 further), defer Step 4 |
| EC2 hits limits near target scale | → Step 3 (quick stabilization) + Step 4 (evaluate alternatives urgently) |
| EC2 hits limits well below target | → Skip Step 3, go directly to Step 4 |
Step 3: Solidify EC2 (Route 53 + Health Checks)
Section titled “Step 3: Solidify EC2 (Route 53 + Health Checks)”What: Set up a Route 53 private hosted zone so EC2 redeploys are fully automated (DNS self-registration), add health checks on port 7687.
Why: Eliminates all manual steps for EC2 redeploys. Only worth doing if EC2 capacity experiments show it has enough headroom for the medium term.
Effort: Medium (3-5 days) Depends on: Decision Gate A (confirms EC2 is viable at target scale) Details: Solidification Roadmap — Phase 2
Step 4: Evaluate Managed Alternatives (AuraDB + Neptune)
Section titled “Step 4: Evaluate Managed Alternatives (AuraDB + Neptune)”What: Benchmark AuraDB Professional and Neptune Analytics with the same datasets and queries used for EC2. Test Cypher compatibility, write throughput, read latency, vector support, and cost.
Why: Determines whether a managed service is worth the migration. AuraDB requires zero code changes (Bolt protocol); Neptune requires a data access layer rewrite but offers AWS-native integration.
Key questions answered:
- Is AuraDB latency comparable to EC2-in-VPC?
- Does Neptune’s openCypher support our MERGE + CASE WHEN patterns?
- What’s the cost per user per month for each option at 1M, 5M users?
Effort: Medium (2-3 weeks) Depends on: Decision Gate A (determines urgency — leisurely evaluation vs urgent migration) Details:
Decision Gate B: Backend Selection
Section titled “Decision Gate B: Backend Selection”| Result | Recommendation |
|---|---|
| AuraDB matches EC2 perf, lower ops cost | → Migrate to AuraDB (config change, no code rewrite) |
| Neptune outperforms on graph+vector combined | → Migrate to Neptune (data access layer rewrite needed) |
| EC2 is cheapest and meets scale needs | → Stay on EC2 with Phase 2 solidification |
Step 5: Vector Storage Decision
Section titled “Step 5: Vector Storage Decision”What: Compare Neo4j native HNSW, Neptune Analytics vectors, Valkey VSS, and OpenSearch Serverless for kNN similarity search.
Why: The consumer graph will need vector similarity for recommendations. The vector backend choice may be independent of the graph backend choice (e.g., Neo4j for graph + Valkey for vectors).
Effort: Medium (1-2 weeks, can run in parallel with Step 4) Depends on: Step 2 results (establishes Neo4j vector baseline) Details: Capacity Experiment Plan — Vector Comparison
Step 6: Final Report + Migration
Section titled “Step 6: Final Report + Migration”What: Compile experiment results into a recommendation document with cost projections at 100K, 1M, 5M, 10M users. Execute the chosen migration path.
Depends on: Decision Gates A and B Details: Capacity Experiment Plan — Phase 7
Timeline
Section titled “Timeline”Week 1 Step 1: Fixed IP + SG rulesWeek 1-2 Step 2: EC2 capacity experiments ── Decision Gate A ──Week 3 Step 3: Route 53 (if EC2 viable)Week 3-5 Step 4: AuraDB + Neptune benchmarksWeek 3-4 Step 5: Vector storage comparison (parallel with Step 4) ── Decision Gate B ──Week 6 Step 6: Final report + migration planEstimated Cost
Section titled “Estimated Cost”| Item | Cost |
|---|---|
| EC2 experiment instance (30 days) | ~$180 |
| EC2 r6i.2xlarge upgrade test (7 days) | ~$68 |
| AuraDB Professional (30 days) | ~$130 |
| Neptune Analytics (30 days) | ~$100-200 |
| OpenSearch Serverless (7 days) | ~$60 |
| OpenAI embeddings | ~$2 |
| Total | ~$550-650 |