Engineering a Large-Scale Data Acquisition System

Large-scale data acquisition is not about writing a scraper. It is about engineering a system that survives scale, concurrency, failures, and time.

This article documents the architecture and execution of a production-grade data acquisition and ingestion system that processed:

30M+ menu records
352K business listings
2.4M menu item categories
~70GB of structured data
Multi-day continuous execution
Less than 2% initial failure rate, resolved via retry pipelines

The system was intentionally engineered for control, observability, and reliability, not shortcuts.

Design Philosophy

From day one, the system was designed around three principles:

Predictable behavior beats raw speed
Every failure must be recoverable
Data correctness matters more than volume

At this scale, any weak assumption eventually breaks.

High-Level Architecture

The system consisted of four tightly coordinated layers:

Frontend execution layer (JavaScript)
Request/session & networking layer
Processing & normalization layer
Persistence & indexing layer (MySQL)

Each layer was isolated to prevent cascading failure while remaining fully observable.

Frontend Execution Layer (Raw JavaScript)

One of the most critical components was the frontend execution logic.

Instead of browser automation frameworks, I used raw JavaScript execution, injected and executed via controlled code snippets.

Execution characteristics:

Pure JavaScript (no frameworks)
Batch-based execution
Recursive resolution until completion
Explicit state tracking
Zero reliance on UI rendering

Why this mattered:

Minimal overhead
Fine-grained control over execution
Faster iteration
Easier recovery on partial failures

Each batch processed a known subset of tasks and recursively continued until all targets were resolved.

This ensured:

No orphaned tasks
Deterministic progress
Safe restarts at any point in the pipeline

Distributed Network & Proxy Layer

Network reliability was handled through a distributed proxy architecture.

Infrastructure setup:

20+ lightweight VPS instances (DigitalOcean)
Each VPS acted as a controlled outbound proxy
Health checks disabled unstable nodes
Traffic distribution balanced across workers

This architecture provided:

Predictable network behavior
Isolation of network failures
Stable long-running execution
Easy replacement of unhealthy nodes

The system favored many small, disposable nodes over complex centralized infrastructure.

Concurrency Model & Thread Coordination

The core execution ran with:

12 concurrent worker threads
Shared task queues
Centralized state tracking
Coordinated lifecycle management

The real challenge:

Every worker thread was:

Actively reading tasks
Processing complex relational data
Writing to the same MySQL database

This required:

Strict transaction boundaries
Idempotent writes
Collision-safe schema design
Careful index strategy

Concurrency was tuned to avoid:

Write contention
Lock escalation
Memory exhaustion
Deadlocks

Native Session Handling

Native session management proved to be a major performance multiplier.

Session strategy:

Sessions persisted per worker lifecycle
Automatic invalidation on error signals
Controlled recreation logic
Scoped reuse across deep navigation paths

Benefits:

Reduced repeated negotiation overhead
Improved request consistency
Lower failure rates
Faster overall throughput

Stateless execution was deliberately avoided where it harmed efficiency.

Data Modeling & Schema Normalization

The dataset was deeply relational and required strict normalization.

Core entities:

Business listings
Menus
Menu categories
Menu items
Add-ons
Variants
Pricing rules
Metadata
Geospatial attributes

Each entity was:

Modeled independently
Assigned stable identifiers
Linked via foreign keys
Designed for incremental processing

This enabled:

Deduplication
Partial reprocessing
Safe retries
Long-term maintainability

Geolocation & Spatial Indexing

Location accuracy was a first-class requirement.

Geospatial handling:

Latitude & longitude normalized per listing
Precision validation during ingestion
Spatial indexes applied at the database level
Optimized queries for radius-based search

This allowed:

Accurate geographic filtering
Fast proximity queries
Scalable location-based operations

Geolocation was engineered as infrastructure, not metadata.

High-Throughput Persistence Layer (MySQL + Raw PHP)

At this scale, the write path determines success or failure.

Persistence strategy:

Raw PHP write layer
No heavy ORM abstractions
Batch inserts where possible
Deferred index creation
Controlled transaction scopes
Streaming data writes to limit memory usage

Why raw PHP:

Predictable performance
Minimal overhead
Full control over execution
Better handling of massive write volumes

This layer handled millions of writes reliably, where abstraction-heavy approaches would have collapsed.

Failure Detection, Retry & Recovery

Failures were treated as normal operating conditions.

Recovery strategy:

Every request outcome logged
Failed tasks persisted separately
Independent retry pipelines
Multiple retry passes supported
Idempotent writes ensured safe reprocessing

Results:

Initial failure rate: ~2%
Post-retry unresolved failures: near zero

The system could be:

Stopped
Restarted
Partially rerun without data corruption or duplication.

Observability & Control

Visibility was essential for multi-day execution.

Observability features:

Structured logging
Per-thread metrics
Progress checkpoints
Failure categorization
Throughput monitoring

This allowed proactive intervention before issues escalated.

Why This Architecture Worked

This system succeeded because it prioritized:

Engineering discipline
Controlled complexity
Explicit state management
Reliable persistence
Recovery-first thinking

No single component was extraordinary — the composition was.

Final Results

70GB+ of structured data
Millions of normalized records
Accurate geospatial indexing
Multi-day uninterrupted execution
Recoverable, restart-safe pipeline

This was not about scraping faster. It was about building systems that don't break when they matter most.

Why This Matters for Real Businesses

Businesses face similar challenges when:

Aggregating external data
Processing large datasets
Running long batch jobs
Scaling ingestion pipelines
Maintaining data integrity under load

This project demonstrates how engineering-first thinking enables scale — without unnecessary infrastructure or fragile shortcuts.

Want to Build Systems Like This?

If you're dealing with:

Large-scale data acquisition
Distributed execution
High-concurrency write workloads
Geospatial data
Long-running batch pipelines

I help design systems that stay reliable under pressure.

👉 Reach out for a technical discussion.