Skip to main content

titan-discovery

Official@omnitron-dev/titan-discovery

Maintained by the Omnitron team. Independent npm package.

Redis-backed service discovery with TTL-based heartbeats, pub/sub event broadcasting, automatic Netron integration so clients can resolve services without hard-coded URLs, atomic Lua-script registration to avoid races, and exponential-backoff retry on heartbeat failures.

pnpm add @omnitron-dev/titan-discovery @omnitron-dev/titan-redis

When you need it

  • Dynamic service topology. Pods come and go (autoscaling, rolling deploys); discovery keeps the routing table current.
  • Service-to-service resolution. Caller queries users@1.0.0 without knowing which pods host it.
  • Health-aware routing. Unhealthy nodes drop out of the registry automatically when heartbeats stop and the TTL expires.

Quickstart

import { DiscoveryModule } from '@omnitron-dev/titan-discovery';
import { TitanRedisModule } from '@omnitron-dev/titan-redis';

@Module({
imports: [
TitanRedisModule.forRoot({ config: { url: env.REDIS_URL } }),
DiscoveryModule.forRoot({
heartbeatInterval: 5_000,
heartbeatTTL: 15_000,
pubSubEnabled: true,
pubSubChannel: 'titan:discovery:events',
clientMode: false,
redisPrefix: 'titan:discovery',
maxRetries: 3,
retryDelay: 1_000,
enableNetronIntegration: true,
}),
],
})
class AppModule {}

DiscoveryOptions

OptionTypeDefault
heartbeatIntervalnumber (ms)5_000
heartbeatTTLnumber (ms)15_000
pubSubEnabledbooleanfalse
pubSubChannelstring'titan:discovery:events'
clientModeboolean (no register, only consume)false
redisPrefixstring'titan:discovery'
maxRetriesnumber3
retryDelaynumber (ms)1_000
redisUrlstring
redisOptionsany
enableNetronIntegrationbooleantrue

heartbeatTTL must be ≥ 3 × heartbeatInterval. A missed- heartbeat ratio of 3 strikes balance between premature ejection (TTL < 2×) and slow recovery (TTL > 5×).

DiscoveryService — the API

import { DISCOVERY_SERVICE_TOKEN, type IDiscoveryService }
from '@omnitron-dev/titan-discovery';

@Service({ name: 'admin@1.0.0' })
class AdminService {
constructor(@Inject(DISCOVERY_SERVICE_TOKEN) private readonly disco: IDiscoveryService) {}

@Public()
async listInstances() {
return this.disco.findNodesByService('users', '1.0.0');
}
}

Registration

MethodPurpose
registerNode(nodeId, address, services)Manually register this node
deregisterNode(nodeId)Manually deregister
registerService(service)Announce a new service this node hosts
unregisterService(serviceName)Drop a service from this node
updateNodeAddress(nodeId, address)Change the reachable URL
updateNodeServices(nodeId, services)Replace the service list
updateAddress(address)Self-update (uses this.nodeId)
updateServices(services)Self-update services

Discovery

MethodReturns
getActiveNodes()Promise<NodeInfo[]>
findNodesByService(serviceName, version?)Promise<NodeInfo[]>
findNodes()Promise<NodeInfo[]> (filtered)
getAllNodes()Promise<NodeInfo[]> (incl. expired)
isNodeActive(nodeId)Promise<boolean>
nodeExists(nodeId)Promise<boolean>
getNodeInfo(nodeId)Promise<NodeInfo | null>

Events

this.disco.onEvent((event) => {
// { type: 'NODE_REGISTERED' | 'NODE_UPDATED' | 'NODE_DEREGISTERED',
// nodeId, address, services, timestamp }
});
this.disco.offEvent(handler);

When pubSubEnabled: true, registration / update events propagate immediately via Redis pub/sub. With pub/sub disabled, peers see changes only on their next getActiveNodes / findNodesByService call — fine for periodic sync, too slow for fast failover.

Redis key layout

The module owns four namespaces under redisPrefix (default titan:discovery):

KeyTypePurpose
titan:discovery:nodes:{nodeId}HashNode metadata: address, services, timestamp
titan:discovery:heartbeat:{nodeId}StringTTL-keyed marker — its existence means "node alive"
titan:discovery:index:nodesSetIndex of all known nodeIds (cleaned on expiry)
titan:discovery:events (pub/sub)Notification channel for pubSubEnabled: true

A node is considered active iff its heartbeat:{nodeId} key still exists. The Lua script that publishes a heartbeat refreshes both the node hash and the heartbeat marker atomically with the same TTL — no half-state.

How it works

The Lua script makes registration atomic: HSET nodes:{id} + SETEX heartbeat:{id} + SADD index:nodes succeed together or not at all. No partial state in the registry.

Heartbeat retry behaviour

When a heartbeat fails (Redis hiccup, network blip), the service retries with exponential backoff:

AttemptDelay before retry
1retryDelay × 1 (1 s by default)
2retryDelay × 2 (2 s)
3retryDelay × 4 (4 s)
...retryDelay × 2^(attempt-1)

After maxRetries exhaustions, the heartbeat is dropped and a warning is logged. The next periodic tick (5 s later by default) attempts again — so a sustained Redis outage causes node ejection once the TTL elapses without a successful refresh.

TTL tuning guide

ScenarioheartbeatIntervalheartbeatTTLWhy
Default (general use)5 s15 sDetect a dead pod within 15 s
Fast failover (e.g., trading)1 s3 s3 s ejection; ~3× Redis ops/s/pod
Quiet IoT fleet30 s120 sTolerates slow networks; node loss takes 2 min
Aggressive autoscaling2 s6 sScale-down decisions reflect topology in 6 s
Sluggish managed Redis (high p99)10 s60 sSurvives Redis latency spikes without ejection

Rules of thumb:

  • heartbeatTTL ≥ 3 × heartbeatInterval — anything tighter ejects healthy nodes during a single Redis hiccup.
  • heartbeatTTL ≤ 10 × heartbeatInterval — anything looser is barely better than no heartbeat at all.
  • Pods × 1 / heartbeatInterval = baseline Redis ops/s. With 100 pods at 5 s interval, that's 20 ops/s — trivial. With 100 pods at 1 s interval, it's 100 ops/s — still cheap but plan Redis sizing.

Multi-region topology

For multi-region, two patterns work:

  1. One Redis per region, no cross-replication. Each region has its own discovery registry. Clients within a region see only local services. Cross-region calls go through an explicit ingress, not discovery. Simpler, lower latency, smaller blast radius.
  2. Cross-region replicated Redis (or managed Redis Global). Every region sees every node. Watch for clock skew — timestamp on NodeInfo becomes lossy across regions; use findNodesByService not "newest first" sorting.

Pattern (1) is almost always the right answer unless you have a strong reason to want a single global view.

Sentinel / cluster mode

titan-discovery consumes whatever Redis client titan-redis provides — sentinel and cluster modes work transparently. The Lua script and pub/sub channel route correctly in cluster mode because all four keys/channels share the same prefix (titan:discovery:*) and therefore the same hash slot when hashtags are used.

If you operate in cluster mode, configure redisPrefix with a hash-tag wrapper:

DiscoveryModule.forRoot({
redisPrefix: '{titan:discovery}', // Force all keys to one slot
});

This guarantees the Lua script's multi-key atomicity holds in cluster mode.

Netron integration

When enableNetronIntegration: true (default), the module registers a NetronDiscoveryIntegration service that subscribes to two Netron events:

Netron eventDiscovery action
NETRON_EVENT_SERVICE_EXPOSEdiscovery.registerService({ name, version })
NETRON_EVENT_SERVICE_UNEXPOSEdiscovery.unregisterService(name)

This means you don't call registerService manually in normal apps — exposing a @Service via Netron auto-announces it.

@Service('users@1.0.0')
class UsersService {
@Public() async findById(id: string) { /* … */ }
}

// In your app:
await app.expose(usersService);
// → Netron emits NETRON_EVENT_SERVICE_EXPOSE
// → titan-discovery registers `users@1.0.0` on this node
// → other peers' findNodesByService('users','1.0.0') sees this node

The integration also re-registers existing services when itself starts up (registerExisting: true, default).

Client mode

For pods that only consume the registry (don't host services — gateways, admin UIs, schedulers):

DiscoveryModule.forRoot({
clientMode: true,
enableNetronIntegration: true,
})

In client mode:

  • No heartbeat publishes; the pod never appears in the registry.
  • The pub/sub subscriber still runs — events are received as normal.
  • Lookups (findNodesByService, getActiveNodes) work as normal.
  • registerNode / registerService raise no error but do nothing observable to peers.

Failure scenarios — what happens, what to alert on

ScenarioDiscovery behaviourWhat to alert on
Pod crashesHeartbeat stops; TTL expires after heartbeatTTL; eventually evictedEviction in cleanupInactiveNodes log
Pod hangs (no heartbeat work loop)Same as crashProcess-level liveness probe
Redis briefly unavailableHeartbeat retries with backoff; usually recoversSustained heartbeat failed warnings
Redis sustained outage > TTLAll pods get evicted from each other's viewsCritical pager event
Cluster split (network partition)Each side ejects the other; recovers when partition healsSudden registry-size drop
Pub/sub channel mis-configuredEvents don't propagate; discovery still works via pollingpubSubEnabled mismatch in config
Many pods, slow heartbeatIntervalStale registry; new pods take long to be seenSet heartbeatInterval < 5_000

Tokens

Token
DISCOVERY_SERVICE_TOKEN
DISCOVERY_OPTIONS_TOKEN
REDIS_TOKEN (re-exported)
NETRON_DISCOVERY_INTEGRATION_TOKEN

LOGGER_TOKEN re-export from discovery/types.ts is internal- only to avoid circular deps. Always import LOGGER_TOKEN from @omnitron-dev/titan/module/logger.

Lifecycle

DiscoveryModule implements:

  • async onStart(app)DiscoveryService.onStart() opens pub/sub if enabled, registers this node, starts heartbeat; NetronDiscoveryIntegration binds to Netron events.
  • async onStop(app)DiscoveryService.onStop() stops heartbeat, deregisters this node (best-effort), closes pub/sub.

Service-level hooks (run in dependency order from the module hooks):

  • onInit() — resolve dependencies, prepare scripts.
  • onStart() — open pub/sub, register node, start heartbeat timer.
  • onStop() — graceful shutdown: stop timer, attempt deregister, unsubscribe.
  • onDestroy() — final cleanup.

In clientMode: true, onStart still opens pub/sub (so events arrive) but skips registration and heartbeat.

Performance notes

  • Heartbeat cost. One Lua script EVAL per pod per heartbeatInterval. At default 5 s, 100 pods → 20 EVAL/s cluster-wide. Each script touches three keys.
  • Lookup cost. findNodesByService reads index:nodes (SMEMBERS) then getNodeInfo for each (HGETALL). Cache the result client- side if you call it on the hot path — the registry doesn't change every request.
  • Pub/sub cost. Each registration/update event is one PUBLISH
    • N SUBSCRIBE callbacks where N is the number of subscribed pods. Negligible at hundreds of pods.

Anti-patterns

  • Long heartbeatInterval. Slow node-loss detection. Pick intervals such that heartbeatTTL = 3 × heartbeatInterval.
  • Putting business state in services. The ServiceInfo list is identity / version metadata. Don't pack request counts, feature flags, or per-request metadata into it.
  • Treating discovery as authoritative for routing. Discovery tells you who exists; the load balancer / multi-backend client decides which to call. Pair with Netron multi-backend.
  • Manual registerNode from app code. The Netron integration handles it. Manual registration is for unusual topologies (sidecar registrants, gateway-mediated nodes).
  • Sharing Redis with high-cardinality unrelated keyspaces. Discovery touches a small fixed set of keys. If your app's main Redis is hot, dedicate a separate DB index for discovery to keep the keyspace pristine.
  • heartbeatTTL set to less than 2 × interval. Single Redis hiccup ejects healthy nodes — exactly the time your app needs the registry most.

See also