DMS Observability

Overview

The Lightbits DMS service provides a few ways to monitor its services during operation and detect any issues. A rich suite of metrics and logs is provided by the DMS service.

Logging

The DMS solution will emit logs to the following places:

Docker logs for running containers
journald

Docker logs for running containers are not persistent. They can be viewed using the docker logs command, but they will be deleted each time the container is restarted.

During deployment, Ansible configures Docker to use the journald logging driver to stream all logs to journald.

This provides several benefits over the plain log-file driver:

Log persistence across reboots and container recreation.
Advanced filtering capabilities using parameters such as --since, --until, and CONTAINER_NAME=mydms.
Unified log stream view for better issue investigation.

Structured Logs

The DMS service uses structured logs, adding key-value labels on every log message it emits.

Every log message that is generated by a workflow will have wid (Workflow ID) and rid (Run ID) values.

Easy debugging of a single workflow can be done by applying grep to filter the logs that are emitted by the specific workflow-id/run-id of Temporal.

These IDs are consistent across the UI. CLI and logs can help ease the correlation of information between them.

Log Analysis Command Examples

Show all logs of the mydms service:

Bash
    
 
journalctl CONTAINER_NAME=mydms
Copy

Show logs for the mydms service from the past hour, filtering for entries with wid==f723c726-c5db-11ef-8737-5254020627b6:

Bash
    
 
journalctl CONTAINER_NAME=mydms --since="1 hour ago" | grep f723c726-c5db-11ef-8737-5254020627b6
Copy

Examine dms logs from the past 30 minutes to two hours ago (in UTC):

Bash
    
 
journalctl CONTAINER_NAME=mydms --since="1 hour ago" --until="30 minutes ago" --utc
Copy

Inspect logs from dms and discovery-client within the time window of 30 minutes to two hours ago:

Bash
    
 
journalctl CONTAINER_NAME=mydms CONTAINER_NAME=discovery-client --since="2 hour ago" --until="30 minutes ago"
Copy

For more filtering capabilities, see the journalctl docs.

Metrics

The DMS service optionally allows for locally installing and running Prometheus and Grafana services.

prometheus: Collects and stores time series metrics from node-exporter, DMS service, and workflow metrics. Accessible via port 9090.
grafana: Visualizes metrics recorded by Prometheus via dedicated and customizable dashboards. Accessible via port 3000.

DMS Service Metrics

The DMS monitoring suite provides a large host of metrics to monitor performance and health of the system. The following are a few metrics that could be of interest:

up - The DMS service is up and running.
Grpc API: Multiple standard GRPC API metrics. These can be monitored to view the rate of ThickCloneVolume and ThickCloneSnapshotrequests.
dms_service health_state - The health of DMS services. If DMS or any other supporting service like temporal or postgres is down, the value will become 0, indicating a NotServing state.
cluster connectivity (access/data) - The state of DMS connectivity to the Lightbits clusters.
Panic - grpc_req_panics_recovered_total, the total number of gRPC requests recovered from internal panic.

DMS Service Alert Rules

Lightbits suggests the following alert rule to configure to Prometheus:

Service is "not up" - dms, discovery-client, and temporal.

The standard installation of a DMS service. This will run a node exporter service on the machine. It can expose many standard node usage metrics to monitor load, network, and other system metrics. For a complete list of metrics, see this link.

Node-Exporter Alert Rules

The standard installation comes with a pre-configured set of node-exporter alert rules. These can be imported/used with any other Prometheus instantiation used to monitor the DMS service. These alerts will notify of anomalies in CPU, networking, and storage use of the DMS server.

Alerts are defined at: node-exporter.alert.rules.yml

Temporal Services Metrics

Temporal is one of the internal services used by the DMS service, to orchestrate and schedule the thick-clone workflows. Temporal exposes a large number of metrics that enable tracking and debugging the execution of the thick-clone/attach cluster workflows.

Metrics about workflows can be found here:

https://docs.temporal.io/references/cluster-metrics#workflow-metrics

Specifically, it has a breakdown of these workflow states:

https://docs.temporal.io/references/cluster-metrics#workflow_success https://docs.temporal.io/references/cluster-metrics#workflow_failed https://docs.temporal.io/references/cluster-metrics#workflow_timeout https://docs.temporal.io/references/cluster-metrics#workflow_cancel

Metrics about Temporal Database (postgres) can be found here: https://docs.temporal.io/references/cluster-metrics#persistence-metrics

Temporal Metrics of Interest

There are many metrics emitted by Temporal Services (history, frontend, persistence). These are some example metrics that you may want to monitor, which will indicate an issue from the Temporal side (for more information, see this link).

Matching Service metrics

no_poller_tasks Emitted whenever a task is added to a task queue that has no poller, and is a counter metric. This is usually an indicator that either the worker or the starter programs are using the wrong Task Queue. It could mean that there is no worker listening currently for a task. It indicates that the DMS is not connected to the correct task_queue and a task was enqueued. Currently, the only worker we have is embedded inside the DMS service, so this may indicate an issue with the DMS service.

Persistence metrics

persistence_errors Shows all persistence errors. This metric is a good indicator for connection issues between the Temporal Service and the persistence store. For example, a Prometheus query for getting all persistence errors by service (history): sum (rate(persistence_errors{service="$service",service_name="history"}[1m]))

Workflow metrics

workflow_failed The number of workflows that failed before completion.

Last updated on

Was this page helpful?