Skip to content

Backfill Runbook

How to backfill a set of user IDs into Neo4j and enable real-time Kafka ingestion for them.

Use case: You have a file of user IDs (e.g., experiment cohort, new segment). You want to:

  1. Load their 365-day purchase history into Neo4j
  2. Enable real-time Kafka ingestion so their future receipts/factoids flow into the graph

  • Unified worker is deployed with the FullLoad dedup fix (overwrites SHOPS_IN/SHOPS_AT aggregates on backfill)
  • Unified worker is deployed with category enrichment code (backfill now enriches Product nodes with category_id and category_hierarchy via FPS + category-service)
  • You have AWS credentials with access to the target environment (stage or prod)
  • You have tempadmin credentials for the target AWS account (needed for graph audit and backfill CLI):
  • Your user ID file is newline-delimited (one user ID per line)

Both consumer groups must be caught up to the tip of their topics before proceeding.

Terminal window
# Factoids consumer lag
aws cloudwatch get-metric-statistics \
--namespace AWS/Kafka \
--metric-name SumOffsetLag \
--dimensions "Name=Consumer Group,Value=unified-worker-factoids" \
"Name=Cluster Name,Value=prod-factoids" \
"Name=Topic,Value=factoid-stream" \
--start-time $(date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Maximum \
--profile prod-services-admin \
--region us-east-1
# Receipts consumer lag
aws cloudwatch get-metric-statistics \
--namespace AWS/Kafka \
--metric-name SumOffsetLag \
--dimensions "Name=Consumer Group,Value=unified-worker-receipts" \
"Name=Cluster Name,Value=prod-receipt-infra" \
"Name=Topic,Value=pipeline-v1-processed-receipt-events" \
--start-time $(date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Maximum \
--profile prod-services-admin \
--region us-east-1

For stage, replace the following in both commands:

ParameterProdStage
Cluster Name (factoids)prod-factoidsstage-factoids
Cluster Name (receipts)prod-receipt-infrastage-receipt-infra
AWS profileprod-services-adminstage-services-admin

All Maximum datapoints should be 0 (or near-zero, e.g., 1-2 from real-time messages). If datapoints are empty, the consumer group is not active.

Do not proceed if lag > 0. If consumers are replaying a backlog, enabling the flag would cause double-counting of SHOPS_IN/SHOPS_AT aggregates.

Step 2: Add user IDs to the feature flag that gates Kafka ingestion

Section titled “Step 2: Add user IDs to the feature flag that gates Kafka ingestion”

Add the user IDs as a segment in Feature Flipper on the flag that gates Kafka ingestion for your users. This ensures future receipts/factoids flow into the graph after the backfill.

There are two approaches depending on the number of users:

Option A: Inline IDs (small sets, <1K users)

Section titled “Option A: Inline IDs (small sets, <1K users)”

In the Feature Flipper UI, add user IDs directly to a segment’s allowedIDs field (comma-separated).

Option B: S3-backed segment (large sets, >1K users)

Section titled “Option B: S3-backed segment (large sets, >1K users)”

Upload the user ID file to the Feature Flipper S3 bucket, then point a segment’s allowedS3IDs to the file.

Terminal window
# Upload to prod
aws s3 cp <user-file> \
s3://prod-feature-flipper/<filename>.csv \
--profile prod-services-admin
# Upload to stage
aws s3 cp <user-file> \
s3://stage-feature-flipper/<filename>.csv \
--profile stage-services-admin

Then in the Feature Flipper UI:

  1. Go to the flag
  2. Add a new segment (or update an existing S3-backed segment)
  3. Set allowedS3IDs to <filename>.csv
  4. Set rollout to 100

Do this BEFORE the backfill. It’s safe because:

  • Kafka consumers are already at the tip of the topic (verified in Step 1)
  • The 7-day retained messages were already consumed and discarded when the flag was off for these users
  • No retained messages will replay for newly-enabled users
  • There is no wait needed between enabling the flag and running the backfill

Note: For experiment-enrolled users (via /consumer-graph-worker/enroll), the Valkey key experiment:enrolled:{user_id} serves as an alternative Kafka gate alongside the feature flag. Those users do not need to be added to the flag. This step is only needed for users being backfilled outside the enrollment API flow.

Run /graph-audit <env> (Claude Code skill) before the backfill to capture a baseline. Key metrics to record:

  • Total nodes and relationships
  • Products with category_id / category_hierarchy counts
  • IN_CATEGORY relationship count
  • Any existing CRITICAL or WARN issues

This baseline is compared against the post-backfill audit to verify the backfill wrote data correctly.

Section titled “Option A: Cloud endpoint (recommended — no TempAdmin needed)”

The unified worker exposes POST /admin/backfill for cloud-based execution. No local AWS credentials required.

Terminal window
# Backfill specific users
curl -X POST http://<ecs-service>:8080/admin/backfill \
-d '{"user_ids": ["user1", "user2"], "rate_limit": 10, "lookback_days": 365}'
# Backfill from S3 file
curl -X POST http://<ecs-service>:8080/admin/backfill \
-d '{"s3_uri": "s3://bucket/user-ids.txt"}'
# Backfill all users currently in Neo4j
curl -X POST http://<ecs-service>:8080/admin/backfill \
-d '{"source": "neo4j"}'
# Check progress
curl http://<ecs-service>:8080/admin/backfill/<job_id>
# Cancel a running backfill
curl -X POST http://<ecs-service>:8080/admin/backfill/<job_id>/cancel
Terminal window
# Dry run
./bin/backfill \
--user-file <user-file> \
--queue-url "https://sqs.us-east-1.amazonaws.com/705747465351/prod-consumer-graph-queue" \
--rate-limit 10 \
--lookback-days 365 \
--dry-run
# Actual run
AWS_PROFILE=prod-services-admin ./bin/backfill \
--user-file <user-file> \
--queue-url "https://sqs.us-east-1.amazonaws.com/705747465351/prod-consumer-graph-queue" \
--rate-limit 10 \
--lookback-days 365

Stage queue URL: https://sqs.us-east-1.amazonaws.com/643098206224/stage-consumer-graph-queue

Timing estimates at 10 req/s:

UsersTime
1,000~2 min
10,000~17 min
75,000~2 hours
200,000~5.5 hours

The backfill sends SQS messages. The unified worker’s SQS consumer picks them up and runs two backfill steps per user:

  1. Receipt backfill — calls Purchase History API for 365 days of receipt data, enriches with FPS + category-service, writes to Neo4j
  2. Shop order backfill — calls Button Transaction Service (BTS) for 365 days of shop order data, enriches FIDOs with FPS, writes to Neo4j

Each step has an independent idempotency key (receipt-{requestID} / shop-{requestID}), so partial failures can be retried without reprocessing the successful step. Shop order dedup uses buttonOrderId as receiptID, which matches the factoid Kafka stream’s summaryID.

Watch the queue drain to confirm messages are being processed:

Terminal window
watch -n 5 'aws sqs get-queue-attributes \
--queue-url "https://sqs.us-east-1.amazonaws.com/705747465351/prod-consumer-graph-queue" \
--attribute-names ApproximateNumberOfMessages ApproximateNumberOfMessagesNotVisible \
--profile prod-services-admin \
--output table'
  • ApproximateNumberOfMessages = messages waiting to be picked up
  • ApproximateNumberOfMessagesNotVisible = messages currently being processed

Both should trend to 0 as the backfill completes.

Check consumer-graph-worker logs for backfill progress:

{service_name="consumer-graph-worker", deployment_environment="prod"} |= "user_load"

Spot-check a few user IDs to confirm their graphs were written:

Terminal window
curl -s -u 'neo4j:<password>' \
-H "Content-Type: application/json" \
-d '{"statements":[{"statement":"MATCH (u:User {user_id: $uid})-[r]-() RETURN type(r) AS rel, count(r) AS cnt","parameters":{"uid":"<user_id>"}}]}' \
http://<neo4j-host>:7474/db/neo4j/tx/commit

Run /graph-audit <env> again after the SQS queue drains to 0/0. Compare against the pre-backfill baseline:

MetricWhat to check
User countShould match number of unique user IDs in the backfill file
Product countShould increase (new products discovered from purchase history)
Products with category_idShould increase — FPS returns categoryId for ~47% of products
Products with category_hierarchyShould match category_id count (category-service resolves all valid IDs)
IN_CATEGORY countShould increase proportionally to new products
PURCHASED countShould increase (new purchase relationships)
Future timestampsShould remain 0
Health scoreShould remain HEALTHY (no new CRITICAL issues)

If category_hierarchy count is significantly lower than category_id count, investigate category-service connectivity.


Use Case: Backfill AI Assistant Beta Users (~50K from Feature Flipper)

Section titled “Use Case: Backfill AI Assistant Beta Users (~50K from Feature Flipper)”

The ai_assistant_enabled Feature Flipper flag contains ~50K users across multiple segments (internal employees, beta testers, active users via S3). To backfill these users into the consumer graph and enable Kafka ingestion:

Terminal window
cd ~/src/consumer-graph-worker
# Fetch all user IDs from ai_assistant_enabled segments + add extra IDs
python3 scripts/fetch-beta-users.py \
--flag ai_assistant_enabled \
--extra-ids <comma-separated-extra-ids> \
-o ~/src/data/ai-assistant-beta-users.csv

The script pulls user IDs from all segments (inline allowedIDs and S3-backed allowedS3IDs in s3://prod-feature-flipper/). Use --env stage for staging.

Expected output: ~50K user IDs written to ~/src/data/ai-assistant-beta-users.csv.

Follow Step 1. All datapoints must be 0 or near-zero before proceeding.

Follow Step 3. Record baseline metrics for comparison.

4. Upload to consumer_graph_ingestion flag via S3

Section titled “4. Upload to consumer_graph_ingestion flag via S3”

Follow Step 2 using Option B (S3-backed segment):

Terminal window
aws s3 cp ~/src/data/ai-assistant-beta-users.csv \
s3://prod-feature-flipper/ai-assistant-beta-users.csv \
--profile prod-services-admin

Then in the Feature Flipper UI, add a segment to the consumer_graph_ingestion flag:

  1. Add a new segment (e.g., “AI Assistant Beta Users”)
  2. Set allowedS3IDs to ai-assistant-beta-users.csv
  3. Set rollout to 100

The consumer_graph_ingestion flag currently has two segments:

  • Internal Users (internalUsers=true) — internal employees
  • Team Users — 1 inline ID

The new S3-backed segment adds the ~50K AI assistant beta users.

Follow Step 4 with the user file:

Terminal window
# Dry run
./bin/backfill \
--user-file ~/src/data/ai-assistant-beta-users.csv \
--queue-url "https://sqs.us-east-1.amazonaws.com/705747465351/prod-consumer-graph-queue" \
--rate-limit 10 \
--lookback-days 365 \
--dry-run
# Actual run
AWS_PROFILE=prod-services-admin ./bin/backfill \
--user-file ~/src/data/ai-assistant-beta-users.csv \
--queue-url "https://sqs.us-east-1.amazonaws.com/705747465351/prod-consumer-graph-queue" \
--rate-limit 10 \
--lookback-days 365

At 10 req/s, ~50K users takes ~1.5 hours.

Follow Step 5. Watch SQS queue depth, Loki logs, and spot-check Neo4j.

Once the queue drains to 0/0, follow Step 6 to compare against the baseline.


Backfills are safe to re-run at any time as long as Kafka lag is zero:

  • PURCHASED: Receipt-ID dedup guard prevents duplicate edges (receipts use receipt_id, shop orders use buttonOrderId/summaryID)
  • SHOPS_IN / SHOPS_AT: Receipt backfill sets FullLoad=true (overwrites aggregates). Shop order backfill uses FullLoad=false (incremental — adds to receipt-based aggregates)
  • User node: MERGE with ON MATCH SET updates last_updated_at
  • Product nodes: category_id and category_hierarchy are updated via ON MATCH SET with CASE guards — only overwrites when the new value is non-empty, so re-runs and Kafka writes coexist safely
  • Request IDs: Each run generates new backfill-<uuid> request IDs with independent receipt/shop idempotency keys, so the orchestrator won’t block re-runs

As of the category enrichment update, the backfill path now populates category_id and category_hierarchy on Product nodes. The enrichment chain is:

  1. Purchase History API returns purchase data with FIDO UUIDs
  2. FIDO Product Service (FPS) returns product metadata including categoryId (a UUID)
  3. category-service resolves the categoryId into a hierarchy array (list of UUIDs from root to leaf)
  4. Neo4j write persists category_id and category_hierarchy on the Product node

This only runs when the orchestrator has a categoryClient configured — which is the case in the unified worker (both SQS consumer component and main). The standalone cmd/worker passes nil for the category client (no category enrichment).

Kafka path is not affected. The Kafka consumer has its own category enrichment pipeline (PR #134). The Cypher ON MATCH SET with CASE guards ensures safe coexistence:

  • If backfill writes first, Kafka overwrites with its own data (always non-empty from real-time events)
  • If Kafka writes first, backfill only overwrites if its data is non-empty (which it will be)
  • Both paths produce the same result for the same product

Deployment required. The worker must be deployed with the category enrichment code before running backfill. Without it, Product nodes will have category_id and category_hierarchy only from the Kafka path (real-time events after flag enablement).


SymptomCauseFix
Backfill CLI exits with resume file writtenSQS send failuresUse --resume-from <file> to continue from where it stopped
Users in Neo4j but no new Kafka data flowingUser not in feature flag segmentAdd user IDs to consumer_graph_ingestion flag
Duplicate SHOPS_IN/SHOPS_AT countsKafka replayed retained messages (lag was not zero)Re-run backfill — FullLoad=true will overwrite with correct totals
loaded_users cache blocking re-backfill7-day TTL on loaded_users:{user_id}Wait for TTL expiry, or delete the key manually in Valkey
Purchase History API 429 rate limitsToo many concurrent requestsReduce --rate-limit (default 10 is safe)

1. Verify Kafka lag = 0 ← consumers must be caught up
2. Upload users to FF S3 + add segment ← enables Kafka gate for future events
3. Run pre-backfill graph audit ← baseline metrics
4. Run backfill (dry-run first) ← loads 365-day history + category enrichment into Neo4j
5. Monitor SQS + Loki + Neo4j ← confirm completion
6. Run post-backfill graph audit ← compare against baseline, verify category data

See also: