Skip to main content

Observability

Three pillars, three purposes:

PillarAnswersCost per event
Logs"What happened?" — exact context for one eventHighest
Metrics"How much, how often?" — aggregates over many eventsLowest
Traces"How did the work flow?" — relationship between eventsMedium

Each pillar answers a question the others can't. You need all three.

Logs

For every event you might need to investigate one day. Structured, contextual, level-appropriate.

this.logger.info('order created', {
orderId: order.id,
userId: ctx.auth.userId,
amount: order.total,
currency: order.currency,
});

Don't:

  • Log in tight loops without sampling.
  • Log secrets (use a RedactionProcessor).
  • Log objects without bounded size — they end up megabytes long.

→ See Logging.

Metrics

For every quantity you want to chart, alert on, or query in aggregate.

metrics.histogram('order.processing.ms', { tier: order.tier }).observe(durationMs);
metrics.counter('order.created.total', { source: 'web' }).inc();
metrics.gauge('cache.size').set(this.cache.size);

Three metric types:

  • Counter — monotonically increasing. Total count of events.
  • Gauge — point-in-time value. Current size, current connections.
  • Histogram — distribution of values. Latencies, sizes.

Cardinality matters: every unique combination of label values creates a separate time series. tier=basic|pro|enterprise is fine (3 series); userId=… is not (millions of series).

→ See titan-metrics.

Traces

For finding which call to which service caused the slowness or the failure. One trace per request, spans for sub-operations.

const span = startSpan('upstream.fetch', { attributes: { url: this.url } });
try {
return await this.fetch();
} finally {
span.end();
}

Don't span every function call. Span:

  • Inbound RPC calls (Netron does this automatically).
  • Outbound HTTP / DB calls (your responsibility).
  • Long units of work that can fail.

→ See Tracing.

What to instrument

A canonical service exposes:

SignalTypeWhy
rpc.call.total{service,method,outcome}CounterThroughput, error rate
rpc.duration_ms{service,method,outcome}HistogramLatency distribution
db.query.duration_ms{table}HistogramDatabase tail latency
outbound.duration_ms{provider}HistogramThird-party performance
cache.hit_rate{cache}GaugeCache effectiveness
app.active_connections{transport}GaugeResource utilisation
lifecycle.boot_msHistogramBoot SLA

Most of these are auto-emitted by titan-metrics and the relevant modules. Custom metrics are for domain-specific quantities.

Alerting policy

Alerts on:

  • Error raterpc.call.total{outcome=error} divided by rpc.call.total exceeds threshold.
  • Latencyrpc.duration_ms p99 > SLO.
  • Resource exhaustion — connection pool full, queue depth high.
  • Health — readiness flipping unhealthy.

Don't alert on:

  • Individual log lines (use logs for investigation, not alerting).
  • Counter rate changes that aren't errors (info-level changes don't page anyone).
  • Cosmetic deviations (a 5% latency increase is normal noise).

Correlation

The three pillars correlate via shared identifiers:

  • traceId — same across every log line and every span in one trace.
  • service / method — labels on metrics, fields on logs.
  • userId / requestId — fields on logs and span attributes.

In practice: a slow request in your dashboard (metric) has a traceId; pull the trace (spans) to see where time was spent; pull the logs by traceId to see what happened in each span.

Anti-patterns

  • One pillar. Logs alone: can't aggregate. Metrics alone: can't investigate. Traces alone: can't quantify. Use all three.
  • Logs as metrics. Counting grep error /var/log/app.log is slow, brittle, and lossy. Emit a counter; alert on the counter.
  • Metrics as logs. A metric with a userId label is a high-cardinality timeseries that explodes the metrics backend. Use logs for per-user data.
  • Tracing without sampling. Every span ends up in the collector; the cost is real. Sample at scale (the relay handles this).

→ Next: Performance.