DMS Observability
Overview
The Lightbits DMS service provides a few ways to monitor its services during operation and detect any issues. A rich suite of metrics and logs is provided by the DMS service.
Logging
The DMS solution will emit logs to the following places:
- Docker logs for running containers
- journald
Docker logs for running containers are not persistent. They can be viewed using the docker logs
command, but they will be deleted each time the container is restarted.
During deployment, Ansible configures Docker to use the journald logging driver to stream all logs to journald.
This provides several benefits over the plain log-file driver:
- Log persistence across reboots and container recreation.
- Advanced filtering capabilities using parameters such as
--since
,--until
, andCONTAINER_NAME=mydms
. - Unified log stream view for better issue investigation.
Structured Logs
The DMS service uses structured logs, adding key-value labels on every log message it emits.
Every log message that is generated by a workflow will have wid
(Workflow ID) and rid
(Run ID) values.
Easy debugging of a single workflow can be done by applying grep
to filter the logs that are emitted by the specific workflow-id/run-id of Temporal.
These IDs are consistent across the UI. CLI and logs can help ease the correlation of information between them.
Log Analysis Command Examples
Show all logs of the mydms
service:
journalctl CONTAINER_NAME=mydms
Show logs for the mydms service from the past hour, filtering for entries with wid==f723c726-c5db-11ef-8737-5254020627b6
:
journalctl CONTAINER_NAME=mydms --since="1 hour ago" | grep f723c726-c5db-11ef-8737-5254020627b6
Examine dms logs from the past 30 minutes to two hours ago (in UTC):
journalctl CONTAINER_NAME=mydms --since="1 hour ago" --until="30 minutes ago" --utc
Inspect logs from dms and discovery-client within the time window of 30 minutes to two hours ago:
journalctl CONTAINER_NAME=mydms CONTAINER_NAME=discovery-client --since="2 hour ago" --until="30 minutes ago"
For more filtering capabilities, see the journalctl
docs.
Metrics
The DMS service optionally allows for locally installing and running Prometheus and Grafana services.
prometheus
: Collects and stores time series metrics from node-exporter, DMS service, and workflow metrics. Accessible via port 9090.grafana
: Visualizes metrics recorded by Prometheus via dedicated and customizable dashboards. Accessible via port 3000.
DMS Service Metrics
The DMS monitoring suite provides a large host of metrics to monitor performance and health of the system. The following are a few metrics that could be of interest:
- up - The DMS service is up and running.
- Grpc API:
Multiple standard GRPC API metrics. These can be monitored to view the rate of
ThickCloneVolume
andThickCloneSnapshot
requests. - dms_service health_state - The health of DMS services. If DMS or any other supporting service like temporal or postgres is down, the value will become
0
, indicating aNotServing
state. - cluster connectivity (access/data) - The state of DMS connectivity to the Lightbits clusters.
- Panic -
grpc_req_panics_recovered_total
, the total number of gRPC requests recovered from internal panic.
DMS Service Alert Rules
Lightbits suggests the following alert rule to configure to Prometheus:
- Service is "not up" - dms, discovery-client, and temporal.
Node-Exporter Metrics
The standard installation of a DMS service. This will run a node exporter service on the machine. It can expose many standard node usage metrics to monitor load, network, and other system metrics. For a complete list of metrics, see this link.
Node-Exporter Alert Rules
The standard installation comes with a pre-configured set of node-exporter alert rules. These can be imported/used with any other Prometheus instantiation used to monitor the DMS service. These alerts will notify of anomalies in CPU, networking, and storage use of the DMS server.
Alerts are defined at: node-exporter.alert.rules.yml
Temporal Services Metrics
Temporal is one of the internal services used by the DMS service, to orchestrate and schedule the thick-clone workflows. Temporal exposes a large number of metrics that enable tracking and debugging the execution of the thick-clone/attach cluster workflows.
Metrics about workflows can be found here:
https://docs.temporal.io/references/cluster-metrics#workflow-metrics
Specifically, it has a breakdown of these workflow states:
https://docs.temporal.io/references/cluster-metrics#workflow_successhttps://docs.temporal.io/references/cluster-metrics#workflow_failedhttps://docs.temporal.io/references/cluster-metrics#workflow_timeouthttps://docs.temporal.io/references/cluster-metrics#workflow_cancel
Metrics about Temporal Database (postgres) can be found here: https://docs.temporal.io/references/cluster-metrics#persistence-metrics
Temporal Metrics of Interest
There are many metrics emitted by Temporal Services (history, frontend, persistence). These are some example metrics that you may want to monitor, which will indicate an issue from the Temporal side (for more information, see this link).
- no_poller_tasks Emitted whenever a task is added to a task queue that has no poller, and is a counter metric. This is usually an indicator that either the worker or the starter programs are using the wrong Task Queue. It could mean that there is no worker listening currently for a task. It indicates that the DMS is not connected to the correct task_queue and a task was enqueued. Currently, the only worker we have is embedded inside the DMS service, so this may indicate an issue with the DMS service.
- persistence_errors
Shows all persistence errors. This metric is a good indicator for connection issues between the Temporal Service and the persistence store. For example, a Prometheus query for getting all persistence errors by service (history):
sum (rate(persistence_errors{service="$service",service_name="history"}[1m]))
- workflow_failed The number of workflows that failed before completion.