Neo4j Infrastructure Solidification
Neo4j Infrastructure Solidification
Section titled “Neo4j Infrastructure Solidification”Neo4j Infrastructure Roadmap
Section titled “Neo4j Infrastructure Roadmap”This document outlines a phased approach to improving Neo4j infrastructure for the consumer-graph-worker service, moving from the current self-managed EC2 setup toward a fully managed solution.
See also:
- Action Plan — sequenced steps, decision gates, and timeline
- Capacity Experiment Plan — benchmarking EC2, AuraDB, and Neptune to inform Phase 3 decision
Summary
Section titled “Summary”| Phase | Approach | Effort | Solves | Monthly Cost |
|---|---|---|---|---|
| Current | Hardcoded private IP on EC2 | — | — | ~$226 (r6i.xlarge + 500GB gp3) |
| 1 | Fixed private IP | Low | IP changes on redeploy, SG resets | Same |
| 2 | Route 53 private hosted zone | Medium | Automated deploys, health checks, DNS abstraction | +$1.50 |
| 3 | Managed DB (AuraDB or Neptune — pending experiments) | Low–High | All EC2 ops (patching, backups, HA) | ~$65-250+ |
Recommended path: Phase 1 now (immediate pain relief) -> EC2 capacity experiments (determine urgency) -> Phase 2 if EC2 is viable at target scale -> Phase 3 informed by experiment results. See Action Plan for sequencing and decision gates.
Current State
Section titled “Current State”Neo4j runs on a single EC2 instance deployed via FSD (consumer-graph-neo4j-ec2.yml). The worker connects using a hardcoded private IP in the config YAML. This creates several operational issues:
- IP changes on redeploy — every EC2 redeployment assigns a new private IP, requiring a config update + worker redeployment
- Security group resets — FSD resets the EC2 security group on redeploy, requiring manual re-addition of port 7687/7474 inbound rules
- No health checks — if Neo4j goes down, the worker crashes with no automatic recovery
- No HA/failover — single instance, single AZ
- Manual setup — password configuration, security groups, and port rules require manual steps after each deploy
Phase 1: Fixed Private IP
Section titled “Phase 1: Fixed Private IP”Goal: Eliminate IP changes on EC2 redeployment.
What Changes
Section titled “What Changes”Assign a static private IP to the Neo4j EC2 instance within the subnet’s CIDR range. Update the FSD EC2 config to specify the private IP.
Implementation
Section titled “Implementation”- Choose a private IP within the subnet’s available range that isn’t in use
- Add
private_iptoconsumer-graph-neo4j-ec2.yml(if FSD supports it), or specify it in the EC2 launch configuration - Update stage/prod config YAMLs with the fixed IPs (one-time change)
- Add security group ingress rules to the FSD config (ports 7687, 7474 from the ECS worker security group) so they persist across redeploys
What This Solves
Section titled “What This Solves”- No more IP changes on redeploy
- No more config updates + worker redeployment after Neo4j EC2 redeploy
- Security group rules codified in infrastructure config
What This Doesn’t Solve
Section titled “What This Doesn’t Solve”- Still a single point of failure (single EC2 instance)
- No automatic health checks or failover
- IP is tied to a specific subnet — moving to a different subnet requires a new IP
- Still requires manual Neo4j version upgrades, patching, backups
- If the chosen IP is already allocated to another resource in the subnet, the deploy will fail
- IP must be within the subnet’s CIDR range and not reserved by AWS (first 4 and last 1 address in each subnet)
Phase 2: Route 53 Private Hosted Zone
Section titled “Phase 2: Route 53 Private Hosted Zone”Goal: Decouple the worker from any specific IP address using DNS.
Why Move Beyond Fixed IP
Section titled “Why Move Beyond Fixed IP”Fixed private IPs are tied to a single subnet. If the infrastructure changes (VPC restructuring, subnet changes, multi-AZ deployment), the fixed IP approach breaks. DNS provides an abstraction layer that survives infrastructure changes.
Additionally, Route 53 enables:
- Health checks that can trigger alerts or automated failover
- A path toward multi-instance setups (multiple A records behind a single DNS name)
- Self-registration from the EC2
user_datascript, making the deploy fully automated
What Changes
Section titled “What Changes”- Create a Route 53 private hosted zone (e.g.,
consumer-graph.internal) associated with the VPC - The Neo4j EC2
user_datascript registers its own IP as a DNS A record after boot (e.g.,neo4j.consumer-graph.internal) - Worker config uses the DNS hostname instead of an IP address
- Add Route 53 health checks on port 7687
Implementation
Section titled “Implementation”Create private hosted zone via Terraform/FSD or AWS CLI:
aws route53 create-hosted-zone --name consumer-graph.internal --vpc VPCRegion=us-east-1,VPCId=vpc-xxx --caller-reference $(date +%s)Add DNS self-registration to consumer-graph-neo4j-ec2.yml user_data (after Neo4j starts):
# Get instance private IPPRIVATE_IP=$(curl -s -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/meta-data/local-ipv4)
# Update Route 53 recordHOSTED_ZONE_ID="Z0123456789" # from hosted zone creationaws route53 change-resource-record-sets --hosted-zone-id "$HOSTED_ZONE_ID" --change-batch '{ "Changes": [{ "Action": "UPSERT", "ResourceRecordSet": { "Name": "neo4j.consumer-graph.internal", "Type": "A", "TTL": 60, "ResourceRecords": [{"Value": "'"$PRIVATE_IP"'"}] } }]}'Update worker configs:
neo4j: uri: "neo4j://neo4j.consumer-graph.internal:7687"What This Solves
Section titled “What This Solves”- Fully automated deploy — no manual IP updates, no worker redeployment after Neo4j redeploy
- DNS abstraction survives infrastructure changes
- Health checks provide monitoring and alerting
- Foundation for multi-instance (multiple A records)
What This Doesn’t Solve
Section titled “What This Doesn’t Solve”- Still self-managed Neo4j (patching, upgrades, backups)
- No built-in HA or clustering
- DNS TTL means brief connectivity gaps during redeploy (mitigated by low TTL)
- EC2 operational burden remains
- Route 53 hosted zone: $0.50/month
- Health checks: $0.50-$1.00/month
- Negligible compared to EC2 instance cost
Phase 3: Managed Graph Database
Section titled “Phase 3: Managed Graph Database”Goal: Eliminate EC2 management entirely with a fully managed graph database service.
Why Consider Moving Beyond EC2
Section titled “Why Consider Moving Beyond EC2”Even with Route 53, EC2 Neo4j carries operational burden: manual version upgrades, no automated backups, single-AZ deployment, OS/Java patching. A managed service eliminates these concerns.
Candidates
Section titled “Candidates”Two managed options are under evaluation:
| Option | Migration Effort | Key Tradeoff |
|---|---|---|
| Neo4j AuraDB | Low (config change only, Bolt protocol compatible) | Third-party managed, less AWS integration |
| Amazon Neptune | High (data access layer rewrite, openCypher subset) | AWS-native, serverless option |
Decision: Pending Experiment Results
Section titled “Decision: Pending Experiment Results”The Phase 3 backend choice depends on empirical performance and cost data from the Capacity Experiment Plan, which benchmarks both options at scale (Experiments 2 and 3). Key questions only experiments can answer:
- Does AuraDB latency over internet match EC2-in-VPC?
- Does Neptune’s openCypher support our MERGE + CASE WHEN patterns?
- What’s the real cost per user per month at 1M+ users?
- How do graph+vector combined queries perform on each platform?
See Action Plan — Decision Gate B for the decision criteria.
Migration Timeline
Section titled “Migration Timeline”Current State Phase 1 Phase 2 Phase 3(Hardcoded IP) -> (Fixed Private IP) -> (Route 53 DNS) -> (Managed DB) ↑ ↑ EC2 capacity Capacity experiments experiments (AuraDB + Neptune) determine if determine which this is neededEach phase is independently valuable and can be deployed without committing to subsequent phases. The progression is driven by operational pain and informed by experiment results:
- Phase 1 is worth doing immediately — it eliminates the most frequent manual step (IP updates)
- EC2 capacity experiments determine how much headroom the current setup has, and whether Phase 2 or Phase 3 is the right next investment
- Phase 2 is worth doing if EC2 has enough capacity for the medium term — adds DNS health checks and fully hands-off EC2 redeploys
- Phase 3 is worth doing when EC2 capacity or operational overhead becomes a bottleneck — the Capacity Experiment Plan will provide the data to choose between AuraDB and Neptune
See Action Plan for the full sequenced timeline and decision gates.