Skip to content

Data Collection Service — Design Document

Overview

Market data fetching and storage service. Syncs OHLCV data from exchanges (via CCXT) into ArcticDB, providing both REST API and CLI interfaces.

Architecture

3-Layer Pattern

src/tradai/data_collection/
├── api/                    # Presentation layer
│   ├── routes.py           # REST endpoints (sync, freshness, symbols)
│   ├── streaming_routes.py # WebSocket streaming endpoints
│   ├── schemas.py          # Request/response Pydantic models
│   └── dependencies.py     # FastAPI dependency injection
├── core/                   # Business logic
│   ├── service.py          # DataCollectionService (orchestrates sync)
│   ├── entities.py         # Domain entities (SyncResult, FreshnessCheck)
│   ├── factories.py        # Repository/adapter factories
│   ├── settings.py         # Service configuration (Pydantic Settings)
│   └── streaming/          # Real-time data streaming logic
└── infrastructure/         # External adapters
    └── health_checkers.py  # Service health check implementations

Module Responsibilities

Module Purpose
api/routes.py REST endpoints: /sync, /sync/incremental, /freshness, /symbols
api/streaming_routes.py WebSocket endpoints for real-time data
core/service.py Orchestrates data fetching, validation, and storage
core/entities.py SyncResult, FreshnessStatus domain entities
core/factories.py Creates exchange clients and storage adapters
core/settings.py DataCollectionSettings from environment variables
infrastructure/health_checkers.py ArcticDB and exchange connectivity checks

Dependencies

Libraries Used

  • tradai-common: LoggerMixin, health check framework, FastAPI utilities
  • tradai-data: CCXT exchange adapters, ArcticDB storage adapters

External Services

  • ArcticDB (S3-backed): Time-series storage for OHLCV data
  • Exchange APIs: Binance Futures/Spot via CCXT

Consumed By

  • Backend service: Proxies data collection requests
  • CLI: tradai data sync, tradai data check-freshness
  • Lambdas: data-collection-proxy Lambda invokes this service

Key Design Decisions

  1. Incremental sync — Only fetches data newer than the latest stored timestamp, minimizing API calls and storage writes.
  2. Exchange abstraction via CCXT — All exchange interactions go through tradai-data's CCXT adapter, making it easy to add new exchanges.
  3. ArcticDB for time-series — Chose ArcticDB over PostgreSQL for OHLCV data due to columnar storage efficiency and native S3 backend.
  4. Streaming support — WebSocket endpoints allow real-time data consumption for live trading scenarios.

Configuration

Variable Description Default
DATA_COLLECTION_HOST Server host 0.0.0.0
DATA_COLLECTION_PORT Server port 8002
DATA_COLLECTION_EXCHANGES Exchange configs (JSON) Required
DATA_COLLECTION_ARCTIC_S3_BUCKET S3 bucket for ArcticDB Required
DATA_COLLECTION_ARCTIC_LIBRARY ArcticDB library name futures

API Reference

See Data Collection README for complete endpoint documentation.