Neo4j Infrastructure Solidification

Neo4j Infrastructure Roadmap

This document outlines a phased approach to improving Neo4j infrastructure for the consumer-graph-worker service, moving from the current self-managed EC2 setup toward a fully managed solution.

See also:

Action Plan — sequenced steps, decision gates, and timeline

Capacity Experiment Plan — benchmarking EC2, AuraDB, and Neptune to inform Phase 3 decision

Summary

Phase	Approach	Effort	Solves	Monthly Cost
Current	Hardcoded private IP on EC2	—	—	~$226 (r6i.xlarge + 500GB gp3)
1	Fixed private IP	Low	IP changes on redeploy, SG resets	Same
2	Route 53 private hosted zone	Medium	Automated deploys, health checks, DNS abstraction	+$1.50
3	Managed DB (AuraDB or Neptune — pending experiments)	Low–High	All EC2 ops (patching, backups, HA)	~$65-250+

Recommended path: Phase 1 now (immediate pain relief) -> EC2 capacity experiments (determine urgency) -> Phase 2 if EC2 is viable at target scale -> Phase 3 informed by experiment results. See Action Plan for sequencing and decision gates.

Current State

Neo4j runs on a single EC2 instance deployed via FSD (consumer-graph-neo4j-ec2.yml). The worker connects using a hardcoded private IP in the config YAML. This creates several operational issues:

IP changes on redeploy — every EC2 redeployment assigns a new private IP, requiring a config update + worker redeployment
Security group resets — FSD resets the EC2 security group on redeploy, requiring manual re-addition of port 7687/7474 inbound rules
No health checks — if Neo4j goes down, the worker crashes with no automatic recovery
No HA/failover — single instance, single AZ
Manual setup — password configuration, security groups, and port rules require manual steps after each deploy

Phase 1: Fixed Private IP

Goal: Eliminate IP changes on EC2 redeployment.

What Changes

Assign a static private IP to the Neo4j EC2 instance within the subnet’s CIDR range. Update the FSD EC2 config to specify the private IP.

Implementation

Choose a private IP within the subnet’s available range that isn’t in use
Add private_ip to consumer-graph-neo4j-ec2.yml (if FSD supports it), or specify it in the EC2 launch configuration
Update stage/prod config YAMLs with the fixed IPs (one-time change)
Add security group ingress rules to the FSD config (ports 7687, 7474 from the ECS worker security group) so they persist across redeploys

What This Solves

No more IP changes on redeploy
No more config updates + worker redeployment after Neo4j EC2 redeploy
Security group rules codified in infrastructure config

What This Doesn’t Solve

Still a single point of failure (single EC2 instance)
No automatic health checks or failover
IP is tied to a specific subnet — moving to a different subnet requires a new IP
Still requires manual Neo4j version upgrades, patching, backups

Risk

If the chosen IP is already allocated to another resource in the subnet, the deploy will fail
IP must be within the subnet’s CIDR range and not reserved by AWS (first 4 and last 1 address in each subnet)

Phase 2: Route 53 Private Hosted Zone

Goal: Decouple the worker from any specific IP address using DNS.

Why Move Beyond Fixed IP

Fixed private IPs are tied to a single subnet. If the infrastructure changes (VPC restructuring, subnet changes, multi-AZ deployment), the fixed IP approach breaks. DNS provides an abstraction layer that survives infrastructure changes.

Additionally, Route 53 enables:

Health checks that can trigger alerts or automated failover
A path toward multi-instance setups (multiple A records behind a single DNS name)
Self-registration from the EC2 user_data script, making the deploy fully automated

What Changes

Create a Route 53 private hosted zone (e.g., consumer-graph.internal) associated with the VPC
The Neo4j EC2 user_data script registers its own IP as a DNS A record after boot (e.g., neo4j.consumer-graph.internal)
Worker config uses the DNS hostname instead of an IP address
Add Route 53 health checks on port 7687

Implementation

Create private hosted zone via Terraform/FSD or AWS CLI:

aws route53 create-hosted-zone --name consumer-graph.internal --vpc VPCRegion=us-east-1,VPCId=vpc-xxx --caller-reference $(date +%s)

Add DNS self-registration to consumer-graph-neo4j-ec2.yml user_data (after Neo4j starts):

# Get instance private IP
PRIVATE_IP=$(curl -s -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/meta-data/local-ipv4)

# Update Route 53 record
HOSTED_ZONE_ID="Z0123456789"  # from hosted zone creation
aws route53 change-resource-record-sets --hosted-zone-id "$HOSTED_ZONE_ID" --change-batch '{
  "Changes": [{
    "Action": "UPSERT",
    "ResourceRecordSet": {
      "Name": "neo4j.consumer-graph.internal",
      "Type": "A",
      "TTL": 60,
      "ResourceRecords": [{"Value": "'"$PRIVATE_IP"'"}]
    }
  }]
}'

Update worker configs:

neo4j:
  uri: "neo4j://neo4j.consumer-graph.internal:7687"

What This Solves

Fully automated deploy — no manual IP updates, no worker redeployment after Neo4j redeploy
DNS abstraction survives infrastructure changes
Health checks provide monitoring and alerting
Foundation for multi-instance (multiple A records)

What This Doesn’t Solve

Still self-managed Neo4j (patching, upgrades, backups)
No built-in HA or clustering
DNS TTL means brief connectivity gaps during redeploy (mitigated by low TTL)
EC2 operational burden remains

Cost

Route 53 hosted zone: $0.50/month
Health checks: $0.50-$1.00/month
Negligible compared to EC2 instance cost

Phase 3: Managed Graph Database

Goal: Eliminate EC2 management entirely with a fully managed graph database service.

Why Consider Moving Beyond EC2

Even with Route 53, EC2 Neo4j carries operational burden: manual version upgrades, no automated backups, single-AZ deployment, OS/Java patching. A managed service eliminates these concerns.

Candidates

Two managed options are under evaluation:

Option	Migration Effort	Key Tradeoff
Neo4j AuraDB	Low (config change only, Bolt protocol compatible)	Third-party managed, less AWS integration
Amazon Neptune	High (data access layer rewrite, openCypher subset)	AWS-native, serverless option

Decision: Pending Experiment Results

The Phase 3 backend choice depends on empirical performance and cost data from the Capacity Experiment Plan, which benchmarks both options at scale (Experiments 2 and 3). Key questions only experiments can answer:

Does AuraDB latency over internet match EC2-in-VPC?
Does Neptune’s openCypher support our MERGE + CASE WHEN patterns?
What’s the real cost per user per month at 1M+ users?
How do graph+vector combined queries perform on each platform?

See Action Plan — Decision Gate B for the decision criteria.

Migration Timeline

Current State          Phase 1              Phase 2              Phase 3
(Hardcoded IP)   ->  (Fixed Private IP) -> (Route 53 DNS)   -> (Managed DB)
                                                 ↑                   ↑
                                         EC2 capacity          Capacity experiments
                                         experiments           (AuraDB + Neptune)
                                         determine if          determine which
                                         this is needed

Each phase is independently valuable and can be deployed without committing to subsequent phases. The progression is driven by operational pain and informed by experiment results:

Phase 1 is worth doing immediately — it eliminates the most frequent manual step (IP updates)
EC2 capacity experiments determine how much headroom the current setup has, and whether Phase 2 or Phase 3 is the right next investment
Phase 2 is worth doing if EC2 has enough capacity for the medium term — adds DNS health checks and fully hands-off EC2 redeploys
Phase 3 is worth doing when EC2 capacity or operational overhead becomes a bottleneck — the Capacity Experiment Plan will provide the data to choose between AuraDB and Neptune

See Action Plan for the full sequenced timeline and decision gates.