Skip to content

Neo4j Infrastructure Solidification

This document outlines a phased approach to improving Neo4j infrastructure for the consumer-graph-worker service, moving from the current self-managed EC2 setup toward a fully managed solution.

See also:

PhaseApproachEffortSolvesMonthly Cost
CurrentHardcoded private IP on EC2~$226 (r6i.xlarge + 500GB gp3)
1Fixed private IPLowIP changes on redeploy, SG resetsSame
2Route 53 private hosted zoneMediumAutomated deploys, health checks, DNS abstraction+$1.50
3Managed DB (AuraDB or Neptune — pending experiments)Low–HighAll EC2 ops (patching, backups, HA)~$65-250+

Recommended path: Phase 1 now (immediate pain relief) -> EC2 capacity experiments (determine urgency) -> Phase 2 if EC2 is viable at target scale -> Phase 3 informed by experiment results. See Action Plan for sequencing and decision gates.

Neo4j runs on a single EC2 instance deployed via FSD (consumer-graph-neo4j-ec2.yml). The worker connects using a hardcoded private IP in the config YAML. This creates several operational issues:

  • IP changes on redeploy — every EC2 redeployment assigns a new private IP, requiring a config update + worker redeployment
  • Security group resets — FSD resets the EC2 security group on redeploy, requiring manual re-addition of port 7687/7474 inbound rules
  • No health checks — if Neo4j goes down, the worker crashes with no automatic recovery
  • No HA/failover — single instance, single AZ
  • Manual setup — password configuration, security groups, and port rules require manual steps after each deploy

Goal: Eliminate IP changes on EC2 redeployment.

Assign a static private IP to the Neo4j EC2 instance within the subnet’s CIDR range. Update the FSD EC2 config to specify the private IP.

  1. Choose a private IP within the subnet’s available range that isn’t in use
  2. Add private_ip to consumer-graph-neo4j-ec2.yml (if FSD supports it), or specify it in the EC2 launch configuration
  3. Update stage/prod config YAMLs with the fixed IPs (one-time change)
  4. Add security group ingress rules to the FSD config (ports 7687, 7474 from the ECS worker security group) so they persist across redeploys
  • No more IP changes on redeploy
  • No more config updates + worker redeployment after Neo4j EC2 redeploy
  • Security group rules codified in infrastructure config
  • Still a single point of failure (single EC2 instance)
  • No automatic health checks or failover
  • IP is tied to a specific subnet — moving to a different subnet requires a new IP
  • Still requires manual Neo4j version upgrades, patching, backups
  • If the chosen IP is already allocated to another resource in the subnet, the deploy will fail
  • IP must be within the subnet’s CIDR range and not reserved by AWS (first 4 and last 1 address in each subnet)

Goal: Decouple the worker from any specific IP address using DNS.

Fixed private IPs are tied to a single subnet. If the infrastructure changes (VPC restructuring, subnet changes, multi-AZ deployment), the fixed IP approach breaks. DNS provides an abstraction layer that survives infrastructure changes.

Additionally, Route 53 enables:

  • Health checks that can trigger alerts or automated failover
  • A path toward multi-instance setups (multiple A records behind a single DNS name)
  • Self-registration from the EC2 user_data script, making the deploy fully automated
  1. Create a Route 53 private hosted zone (e.g., consumer-graph.internal) associated with the VPC
  2. The Neo4j EC2 user_data script registers its own IP as a DNS A record after boot (e.g., neo4j.consumer-graph.internal)
  3. Worker config uses the DNS hostname instead of an IP address
  4. Add Route 53 health checks on port 7687

Create private hosted zone via Terraform/FSD or AWS CLI:

Terminal window
aws route53 create-hosted-zone --name consumer-graph.internal --vpc VPCRegion=us-east-1,VPCId=vpc-xxx --caller-reference $(date +%s)

Add DNS self-registration to consumer-graph-neo4j-ec2.yml user_data (after Neo4j starts):

Terminal window
# Get instance private IP
PRIVATE_IP=$(curl -s -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/meta-data/local-ipv4)
# Update Route 53 record
HOSTED_ZONE_ID="Z0123456789" # from hosted zone creation
aws route53 change-resource-record-sets --hosted-zone-id "$HOSTED_ZONE_ID" --change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "neo4j.consumer-graph.internal",
"Type": "A",
"TTL": 60,
"ResourceRecords": [{"Value": "'"$PRIVATE_IP"'"}]
}
}]
}'

Update worker configs:

neo4j:
uri: "neo4j://neo4j.consumer-graph.internal:7687"
  • Fully automated deploy — no manual IP updates, no worker redeployment after Neo4j redeploy
  • DNS abstraction survives infrastructure changes
  • Health checks provide monitoring and alerting
  • Foundation for multi-instance (multiple A records)
  • Still self-managed Neo4j (patching, upgrades, backups)
  • No built-in HA or clustering
  • DNS TTL means brief connectivity gaps during redeploy (mitigated by low TTL)
  • EC2 operational burden remains
  • Route 53 hosted zone: $0.50/month
  • Health checks: $0.50-$1.00/month
  • Negligible compared to EC2 instance cost

Goal: Eliminate EC2 management entirely with a fully managed graph database service.

Even with Route 53, EC2 Neo4j carries operational burden: manual version upgrades, no automated backups, single-AZ deployment, OS/Java patching. A managed service eliminates these concerns.

Two managed options are under evaluation:

OptionMigration EffortKey Tradeoff
Neo4j AuraDBLow (config change only, Bolt protocol compatible)Third-party managed, less AWS integration
Amazon NeptuneHigh (data access layer rewrite, openCypher subset)AWS-native, serverless option

The Phase 3 backend choice depends on empirical performance and cost data from the Capacity Experiment Plan, which benchmarks both options at scale (Experiments 2 and 3). Key questions only experiments can answer:

  • Does AuraDB latency over internet match EC2-in-VPC?
  • Does Neptune’s openCypher support our MERGE + CASE WHEN patterns?
  • What’s the real cost per user per month at 1M+ users?
  • How do graph+vector combined queries perform on each platform?

See Action Plan — Decision Gate B for the decision criteria.

Current State Phase 1 Phase 2 Phase 3
(Hardcoded IP) -> (Fixed Private IP) -> (Route 53 DNS) -> (Managed DB)
↑ ↑
EC2 capacity Capacity experiments
experiments (AuraDB + Neptune)
determine if determine which
this is needed

Each phase is independently valuable and can be deployed without committing to subsequent phases. The progression is driven by operational pain and informed by experiment results:

  • Phase 1 is worth doing immediately — it eliminates the most frequent manual step (IP updates)
  • EC2 capacity experiments determine how much headroom the current setup has, and whether Phase 2 or Phase 3 is the right next investment
  • Phase 2 is worth doing if EC2 has enough capacity for the medium term — adds DNS health checks and fully hands-off EC2 redeploys
  • Phase 3 is worth doing when EC2 capacity or operational overhead becomes a bottleneck — the Capacity Experiment Plan will provide the data to choose between AuraDB and Neptune

See Action Plan for the full sequenced timeline and decision gates.