Industrial-Strength Database Engineering for Temporal Data
Data modeling is the foundation of a successful time-series database deployment. Poor modeling decisions compound over time, leading to explosive cardinality growth, query performance degradation, and operational nightmares. This guide covers industrial-grade data modeling patterns, tagging strategies, retention design, and optimization techniques that separate robust TSDB systems from struggling ones.
A time-series data point fundamentally consists of four components: a timestamp, a metric name, one or more tags (labels), and a value. This structure is deceptively simple but has profound implications for system design.
timestamp: 2026-04-23T14:30:45.123Z
metric: cpu.usage
tags: {host: "prod-01", region: "us-east", datacenter: "dc1"}
value: 67.8
The metric name identifies what is being measured. Tags are key-value pairs that add context and enable filtering. The combination of metric, tags, and timestamp creates a unique series. Understanding this model is critical because every unique combination of metric and tag values creates a separate time-series stream that the database must manage independently.
Cardinality is the number of unique time series. If you have a metric with 10 possible values for one tag and 100 possible values for another tag, that's 1,000 unique combinations. Cardinality explosion—where unexpected combinations cause millions of series—is the #1 cause of TSDB system failure in production environments.
The first decision in modeling is determining what becomes a metric name and what becomes a tag. This choice has massive performance implications.
Use tags for dimensions that:
Use separate metrics for dimensions that:
A financial trading platform instruments every trade with a unique order ID. A naive approach might create:
metric: trade.profit
tags: {order_id: "ORD-12345678", trader_id: "T001", exchange: "NYSE"}
With millions of orders per day, order_id creates unbounded cardinality. The TSDB must track millions of distinct series, consuming massive memory and disk space. Queries become slow. This is a catastrophic design.
The better approach: trade order_id becomes part of transaction logs or a separate system, not TSDB tags.
metric: trades.executed
tags: {trader_id: "T001", exchange: "NYSE", product: "equities"}
value: 1
Or for aggregated metrics:
metric: trade.volume
tags: {trader_id: "T001", exchange: "NYSE"}
value: 5000000 (notional)
A microservices-based infrastructure monitors application performance. A well-designed schema:
metric: http.request.duration_ms
tags: {service: "api-gateway", endpoint: "/v1/users", method: "GET", status_code: "200"}
value: 145
metric: database.query.duration_ms
tags: {service: "user-service", database: "postgres", table: "users", operation: "SELECT"}
value: 23
metric: cache.hit_ratio
tags: {service: "api-gateway", cache_type: "redis", region: "us-east"}
value: 0.92
Each tag has low cardinality. Queries can efficiently answer questions like "Which endpoints are slowest?" or "Compare cache hit ratios across regions?" Memory usage is predictable.
Tags are not free. Each tag adds indexing overhead and increases query complexity. Follow these patterns:
Standardize tag naming across all metrics. If one metric uses "hostname" and another uses "host", your queries become fragmented.
GOOD:
tags: {host: "web-01", region: "us-east"}
BAD:
tags: {hostname: "web-01", aws_region: "us-east"}
If a host moves from region A to region B, don't change the region tag on existing data. Instead, start a new series with the new region tag. Changing tag values invalidates historical queries and confuses analysis.
Plan for the product of all tag values. If you have:
That's 50 * 200 * 5 * 10 = 500,000 unique series for a single metric. This may be acceptable, but you need to know and plan for it.
Maintain a central registry of metrics, their tags, and expected cardinality. This prevents surprise explosions when new code starts emitting unexpected tag combinations.
The value field can hold different types of data depending on your measurement pattern.
Different measurement types require different modeling:
| Type | Description | Example | TSDB Handling |
|---|---|---|---|
| Gauge | Point-in-time value that can increase or decrease | CPU usage, memory, active connections | Store as-is; queries use latest or average |
| Counter | Monotonically increasing value (only goes up) | Total requests, errors, bytes processed | Store cumulative; calculate rate of change with rate() |
| Histogram | Distribution of values in buckets | Request latencies, payload sizes | Store bucket counts; aggregate for percentiles |
Understanding these distinctions affects query logic and alerting thresholds. A gauge for CPU usage uses average; a counter for request rate uses rate-of-change functions.
Not all data has equal value or query patterns. Design retention policies that balance query capability, storage costs, and access patterns.
Implement a tiered approach:
Data in local fast storage (NVMe SSD). Full resolution, immediate query access, high ingestion rate. Used for real-time dashboards, alerts, and short-term troubleshooting.
Data in slower storage (HDD, cloud blob storage). Downsampled to 5-minute or 1-hour resolution. Queries are slower but still acceptable. Used for trend analysis and capacity planning.
Data in long-term archival storage (glacier, object store). Heavily aggregated or deleted depending on requirements. Rarely queried; kept for compliance or historical reference.
Reducing data resolution over time is essential for long-term storage. Common strategies:
Original resolution: 10 seconds
After 7 days: downsample to 1 minute (average, max, min, count)
After 30 days: downsample to 5 minutes
After 90 days: downsample to 1 hour
After 1 year: delete or archive
Downsampling decisions require understanding query needs. If alerts depend on detecting anomalies within 5-minute windows, downsampling below that resolution loses detection capability.
Establish clear, hierarchical naming for metrics. Use dot notation to reflect logical organization:
system.cpu.usage
system.memory.available
system.disk.io.read_bytes
system.disk.io.write_bytes
application.request.count
application.request.duration_ms
application.error.rate
database.connection.count
database.query.duration_ms
database.replication.lag_ms
This hierarchy enables intuitive navigation, wildcard queries, and logical grouping in dashboards. Tools like Prometheus use this pattern, and it's industry standard.
Time-series data quality directly impacts analysis validity. Implement validation at ingestion:
Reject data with unexpected tags or values. Prevent new tag combinations from being auto-created. Many cardinality explosions result from typos: a host labeled "web_01" instead of "web-01" creates duplicate series.
Implement bounds checking. A CPU usage value of 500% is clearly invalid. Percentages should be 0-100. Reject obvious anomalies at ingestion time.
Define how gaps in data are interpreted. If a series stops sending data, is it offline or just not updating? Some systems auto-fill with null; others leave gaps. Choose a strategy and document it.
When transitioning from one schema to another, plan carefully:
Never attempt sudden cutover. Gradual migration prevents alerting breakage and gives time to discover missing data or logic errors.
Before deploying any new metric, estimate cardinality. Product the unique values for each tag. If the result exceeds 100,000, review the design. If it exceeds 1 million, redesign. Most production TSDB outages trace to cardinality surprises.
A cloud platform monitors thousands of customer containers. Initial design:
metric: container.cpu_usage
tags: {customer_id, region, container_id, image_name}
With 1,000 customers, 5 regions, 1 million containers (1,000 per customer), and 500 unique images, that's 1000 * 5 * 1M * 500 = insanity. System crashed in days.
Revised design:
metric: container.cpu_usage
tags: {region, container_pool}
Where container_pool is a high-level grouping (e.g., "web", "worker", "cache"). Container-specific data goes to application logs or a time-series event system. Cardinality is now 5 * 20 = 100 unique series. System scales.
After modeling is fixed, optimize query patterns:
Store pre-aggregated metrics alongside raw data. If you frequently query "average response time by service", compute and store that once instead of recalculating on every query.
Some TSDBs optimize for tag query order. If most queries filter by host first, then service, order tags that way in your schema.
Use naming that enables wildcard matching. "system.cpu.usage" allows "system.cpu.*" queries that return all CPU metrics efficiently.
Time-series data modeling is not a one-time activity. As systems evolve, schemas must adapt. The difference between a thriving TSDB deployment and a failing one often comes down to thoughtful initial design decisions, cardinality discipline, and proactive migration planning. Invest time upfront in understanding your data patterns, document your schema, and iterate based on production experience. The TSDB systems that scale painlessly are those where the data model reflects the actual query patterns and operational needs.
Explore Key Features