Skip to content

ArcticDB Data Model

Overview

TradAI stores and retrieves time-series market data using ArcticDB, an open-source DataFrame database backed by S3. ArcticDB provides versioned, columnar storage optimized for financial time-series workloads, with native support for date-range queries and batch operations.

All OHLCV (Open, High, Low, Close, Volume) data flows through the DataAdapter protocol, with ArcticAdapter as the production implementation.

Storage Architecture

S3 Bucket

Each environment has a dedicated S3 bucket following the naming convention:

tradai-arcticdb-{env}

For example: tradai-arcticdb-dev, tradai-arcticdb-staging, tradai-arcticdb-prod.

This is defined in the infrastructure config (infra/shared/tradai_infra_shared/config.py):

S3_BUCKETS = {
    "arcticdb": f"tradai-arcticdb-{ENVIRONMENT}",
    ...
}

The bucket has versioning enabled and no lifecycle policy (data is retained indefinitely).

Connection String

ArcticDB connects to S3 using a URI of the form:

s3s://<host>:<bucket>?aws_auth=true

For LocalStack or MinIO development environments, the adapter switches to plain s3:// with explicit credentials and path-style addressing:

s3://<host>:<bucket>?port=4566&access=test&secret=test&region=eu-central-1&use_virtual_addressing=false

Library

Within a bucket, data is organized into libraries. The default library name is ohlcv, configured via the ARCTIC_LIBRARY setting. The library is created automatically on first access (create_if_missing=True).

Library names are normalized to lowercase by the settings validator.

Data Schema

Each symbol is stored as a separate ArcticDB symbol entry. The DataFrame for each symbol has the following structure:

Column Dtype Description
date datetime Candle timestamp (UTC, used as DataFrame index)
open float64 Opening price
high float64 Highest price in the period
low float64 Lowest price in the period
close float64 Closing price
volume float64 Trading volume

Index column

When stored in ArcticDB, the date column becomes the DataFrame index. On read, it is reset back to a regular column before being wrapped in OHLCVData.

The in-memory representation (OHLCVData) adds a symbol column to identify which trading pair each row belongs to. This column is stripped before writing to ArcticDB and re-added on read.

Required Columns

The OHLCVData entity enforces these seven required columns:

REQUIRED_COLUMNS = frozenset({"symbol", "date", "open", "high", "low", "close", "volume"})

Symbol Naming

ArcticDB symbol names cannot contain / or : characters. The adapter normalizes trading symbols using a double-underscore separator (__).

Normalization Rules

Trading Symbol ArcticDB Symbol Type
BTC/USDT:USDT BTC__USDT__USDT Futures
ETH/USDT:USDT ETH__USDT__USDT Futures
BTC/USDT BTC__USDT Spot

Normalization (write path): replaces / and : with __.

# "BTC/USDT:USDT" -> "BTC__USDT__USDT"
symbol.replace("/", "__").replace(":", "__")

Denormalization (read path): splits on __ and reconstructs the trading format.

  • 3+ parts: futures format BASE/QUOTE:SETTLE (e.g., BTC__USDT__USDT becomes BTC/USDT:USDT)
  • 2 parts: spot format BASE/QUOTE (e.g., BTC__USDT becomes BTC/USDT)
  • 1 part: returned as-is

Read/Write Patterns

Write: Single Symbol (save)

The save() method uses upsert semantics:

  1. Groups the OHLCVData DataFrame by symbol.
  2. For each symbol, drops the symbol column and sets date as the index.
  3. Checks if the symbol already exists in the library:
    • Existing symbol: calls library.update() with upsert=True to merge new rows by index.
    • New symbol: calls library.write() to create the entry.
  4. Attaches metadata (see Metadata below).
  5. Prunes previous versions (prune_previous_versions=True) to avoid unbounded storage growth.

Write: Batch (save_batch)

The save_batch() method uses library.write_batch() for 2-3x faster writes when saving multiple symbols. Unlike save(), batch write replaces existing data rather than upserting. Use this for initial data loads.

Read: Batch (load)

The load() method uses library.read_batch() for efficient multi-symbol loading:

  1. Builds a ReadRequest per symbol with the requested (start, end) date range.
  2. Executes a batch read.
  3. For each successful result, resets the index (moving date back to a column) and re-inserts the denormalized symbol column.
  4. Concatenates all DataFrames and wraps in OHLCVData.

Symbols that fail to read (e.g., not found) are silently skipped. If no symbols return data, a DataNotFoundError is raised.

Incremental Sync

The data-collection service supports incremental sync to avoid re-fetching historical data:

  1. get_latest_date() reads metadata for each symbol to find the last stored candle date.
  2. The CoverageChecker compares this against the requested date range.
  3. Only symbols with incomplete coverage are fetched from the exchange API.
  4. New data is upserted via save(), extending the existing time series.

Metadata

Each symbol write includes a metadata dictionary attached to the ArcticDB entry. The current schema is version 2:

Field Type Description
metadata_version int Schema version (currently 2)
last_query_date string ISO 8601 timestamp of when the exchange API was queried
last_candle_date string ISO 8601 timestamp of the latest candle in the data
timeframe string CCXT timeframe string (e.g., "1h", "1d"). Optional; absent in legacy data.

When reading the latest date for incremental sync, the adapter prefers last_candle_date and falls back to last_query_date for backwards compatibility with pre-version-2 data.

Versioning

ArcticDB supports automatic versioning of symbol data. Each write or update creates a new version. TradAI uses prune_previous_versions=True on all write operations, which means only the latest version is retained. This prevents unbounded growth of version history in S3.

ArcticDB's internal versioning is separate from the metadata_version field, which tracks the schema of the metadata dictionary itself.

Concurrent Access

ArcticDB relies on S3's strong read-after-write consistency (available since December 2020). Key behaviors:

  • Multiple readers: fully supported with no coordination needed.
  • Single writer per symbol: the adapter does not implement locking. In practice, each symbol is owned by one data-collection service instance at a time.
  • Batch operations: write_batch and read_batch are atomic per-symbol but not across symbols. Individual symbol failures in a batch are reported as DataError entries in the result list without failing the entire batch.

Platform Support

ArcticDB is only available on Linux x86_64 and Windows. For macOS ARM development:

  • The create_data_adapter() factory automatically returns InMemoryAdapter on macOS.
  • Tests inject a mock library via the arctic_library constructor parameter.
  • The ArcticLibraryProtocol in libs/tradai-data/src/tradai/data/infrastructure/adapters/protocols.py enables type-safe mocking.

Configuration

Environment Variables

All ArcticDB settings are configured via environment variables with the service prefix (e.g., DATA_COLLECTION_ for the data-collection service). The ArcticSettingsMixin provides the base fields.

Variable Default Description
{PREFIX}_ARCTIC_S3_BUCKET (required) S3 bucket name (e.g., tradai-arcticdb-dev)
{PREFIX}_ARCTIC_LIBRARY ohlcv ArcticDB library name
{PREFIX}_ARCTIC_S3_ENDPOINT s3.{region}.amazonaws.com S3 endpoint (use localstack:4566 for local dev)
{PREFIX}_ARCTIC_REGION eu-central-1 AWS region
{PREFIX}_ARCTIC_USE_SSL true Use TLS (set false for LocalStack)
{PREFIX}_ARCTIC_ACCESS_KEY (none) Explicit S3 access key (LocalStack/MinIO only)
{PREFIX}_ARCTIC_SECRET_KEY (none) Explicit S3 secret key (LocalStack/MinIO only)
{PREFIX}_ARCTIC_USE_VIRTUAL_ADDRESSING true Virtual-hosted style URLs (set false for LocalStack)

Where {PREFIX} is the service-specific env var prefix: DATA_COLLECTION, STRATEGY_SERVICE, etc.

Safety Checks

  • LocalStack in production: If use_ssl=True (indicating production) and the endpoint contains localstack or localhost:4566, the adapter raises a ConfigurationError.
  • Non-dev environments: The DataCollectionSettings validator rejects LocalStack endpoints when ENVIRONMENT is not local or dev.

Key Source Files

File Description
libs/tradai-data/src/tradai/data/infrastructure/adapters/arctic_adapter.py ArcticAdapter implementation (read, write, batch, symbol normalization)
libs/tradai-data/src/tradai/data/infrastructure/adapters/protocols.py ArcticLibraryProtocol and related protocols for DI/mocking
libs/tradai-data/src/tradai/data/infrastructure/adapters/__init__.py create_data_adapter() factory with platform detection
libs/tradai-data/src/tradai/data/core/entities.py OHLCVData, DateRange, SymbolList, Timeframe value objects
libs/tradai-data/src/tradai/data/core/repositories.py DataAdapter protocol (storage interface)
libs/tradai-data/src/tradai/data/core/coverage.py CoverageChecker for incremental sync decisions
libs/tradai-common/src/tradai/common/settings_mixins.py ArcticSettingsMixin with shared config fields
services/data-collection/src/tradai/data_collection/core/factories.py create_arctic_adapter() factory wiring settings to adapter
services/data-collection/src/tradai/data_collection/core/settings.py DataCollectionSettings with ArcticDB validation
services/data-collection/src/tradai/data_collection/core/service.py DataCollectionService orchestrating sync flows
infra/shared/tradai_infra_shared/config.py S3 bucket naming (tradai-arcticdb-{env}) and bucket config