Back to articles
data-engineeringsystem-designjavascriptmysqldistributed-systemsperformance

Engineering a Large-Scale Data Acquisition System

December 15, 2025
6 min read
Zaman Tauhid
Engineering a Large-Scale Data Acquisition System

Large-scale data acquisition is not about writing a scraper. It is about engineering a system that survives scale, concurrency, failures, and time.

This article documents the architecture and execution of a production-grade data acquisition and ingestion system that processed:

  • 30M+ menu records
  • 352K business listings
  • 2.4M menu item categories
  • ~70GB of structured data
  • Multi-day continuous execution
  • Less than 2% initial failure rate, resolved via retry pipelines

The system was intentionally engineered for control, observability, and reliability, not shortcuts.


Design Philosophy

From day one, the system was designed around three principles:

  1. Predictable behavior beats raw speed
  2. Every failure must be recoverable
  3. Data correctness matters more than volume

At this scale, any weak assumption eventually breaks.


High-Level Architecture

The system consisted of four tightly coordinated layers:

  1. Frontend execution layer (JavaScript)
  2. Request/session & networking layer
  3. Processing & normalization layer
  4. Persistence & indexing layer (MySQL)

Each layer was isolated to prevent cascading failure while remaining fully observable.


Frontend Execution Layer (Raw JavaScript)

One of the most critical components was the frontend execution logic.

Instead of browser automation frameworks, I used raw JavaScript execution, injected and executed via controlled code snippets.

Execution characteristics:

  • Pure JavaScript (no frameworks)
  • Batch-based execution
  • Recursive resolution until completion
  • Explicit state tracking
  • Zero reliance on UI rendering

Why this mattered:

  • Minimal overhead
  • Fine-grained control over execution
  • Faster iteration
  • Easier recovery on partial failures

Each batch processed a known subset of tasks and recursively continued until all targets were resolved.

This ensured:

  • No orphaned tasks
  • Deterministic progress
  • Safe restarts at any point in the pipeline

Distributed Network & Proxy Layer

Network reliability was handled through a distributed proxy architecture.

Infrastructure setup:

  • 20+ lightweight VPS instances (DigitalOcean)
  • Each VPS acted as a controlled outbound proxy
  • Health checks disabled unstable nodes
  • Traffic distribution balanced across workers

This architecture provided:

  • Predictable network behavior
  • Isolation of network failures
  • Stable long-running execution
  • Easy replacement of unhealthy nodes

The system favored many small, disposable nodes over complex centralized infrastructure.


Concurrency Model & Thread Coordination

The core execution ran with:

  • 12 concurrent worker threads
  • Shared task queues
  • Centralized state tracking
  • Coordinated lifecycle management

The real challenge:

Every worker thread was:

  • Actively reading tasks
  • Processing complex relational data
  • Writing to the same MySQL database

This required:

  • Strict transaction boundaries
  • Idempotent writes
  • Collision-safe schema design
  • Careful index strategy

Concurrency was tuned to avoid:

  • Write contention
  • Lock escalation
  • Memory exhaustion
  • Deadlocks

Native Session Handling

Native session management proved to be a major performance multiplier.

Session strategy:

  • Sessions persisted per worker lifecycle
  • Automatic invalidation on error signals
  • Controlled recreation logic
  • Scoped reuse across deep navigation paths

Benefits:

  • Reduced repeated negotiation overhead
  • Improved request consistency
  • Lower failure rates
  • Faster overall throughput

Stateless execution was deliberately avoided where it harmed efficiency.


Data Modeling & Schema Normalization

The dataset was deeply relational and required strict normalization.

Core entities:

  • Business listings
  • Menus
  • Menu categories
  • Menu items
  • Add-ons
  • Variants
  • Pricing rules
  • Metadata
  • Geospatial attributes

Each entity was:

  • Modeled independently
  • Assigned stable identifiers
  • Linked via foreign keys
  • Designed for incremental processing

This enabled:

  • Deduplication
  • Partial reprocessing
  • Safe retries
  • Long-term maintainability

Geolocation & Spatial Indexing

Location accuracy was a first-class requirement.

Geospatial handling:

  • Latitude & longitude normalized per listing
  • Precision validation during ingestion
  • Spatial indexes applied at the database level
  • Optimized queries for radius-based search

This allowed:

  • Accurate geographic filtering
  • Fast proximity queries
  • Scalable location-based operations

Geolocation was engineered as infrastructure, not metadata.


High-Throughput Persistence Layer (MySQL + Raw PHP)

At this scale, the write path determines success or failure.

Persistence strategy:

  • Raw PHP write layer
  • No heavy ORM abstractions
  • Batch inserts where possible
  • Deferred index creation
  • Controlled transaction scopes
  • Streaming data writes to limit memory usage

Why raw PHP:

  • Predictable performance
  • Minimal overhead
  • Full control over execution
  • Better handling of massive write volumes

This layer handled millions of writes reliably, where abstraction-heavy approaches would have collapsed.


Failure Detection, Retry & Recovery

Failures were treated as normal operating conditions.

Recovery strategy:

  • Every request outcome logged
  • Failed tasks persisted separately
  • Independent retry pipelines
  • Multiple retry passes supported
  • Idempotent writes ensured safe reprocessing

Results:

  • Initial failure rate: ~2%
  • Post-retry unresolved failures: near zero

The system could be:

  • Stopped
  • Restarted
  • Partially rerun without data corruption or duplication.

Observability & Control

Visibility was essential for multi-day execution.

Observability features:

  • Structured logging
  • Per-thread metrics
  • Progress checkpoints
  • Failure categorization
  • Throughput monitoring

This allowed proactive intervention before issues escalated.


Why This Architecture Worked

This system succeeded because it prioritized:

  • Engineering discipline
  • Controlled complexity
  • Explicit state management
  • Reliable persistence
  • Recovery-first thinking

No single component was extraordinary — the composition was.


Final Results

  • 70GB+ of structured data
  • Millions of normalized records
  • Accurate geospatial indexing
  • Multi-day uninterrupted execution
  • Recoverable, restart-safe pipeline

This was not about scraping faster. It was about building systems that don't break when they matter most.


Why This Matters for Real Businesses

Businesses face similar challenges when:

  • Aggregating external data
  • Processing large datasets
  • Running long batch jobs
  • Scaling ingestion pipelines
  • Maintaining data integrity under load

This project demonstrates how engineering-first thinking enables scale — without unnecessary infrastructure or fragile shortcuts.


Want to Build Systems Like This?

If you're dealing with:

  • Large-scale data acquisition
  • Distributed execution
  • High-concurrency write workloads
  • Geospatial data
  • Long-running batch pipelines

I help design systems that stay reliable under pressure.

👉 Reach out for a technical discussion.

Table of Contents