Link Search Menu Expand Document

Monitoring using Prometheus

Table of contents

  1. Example prometheus.yml
  2. Metrics exposed by lakeFS
  3. Example queries
    1. 99th percentile of API request latencies
    2. 50th percentile of S3-compatible API latencies
    3. Number of errors in outgoing S3 requests
    4. Number of open connections to the database
    5. Example Grafana dashboard

Example prometheus.yml

lakeFS exposes metrics through the same port used by the lakeFS service, using the standard /metrics path. An example prometheus.yml could look like this:

scrape_configs:
- job_name: lakeFS
  scrape_interval: 10s
  metrics_path: /metrics
  static_configs:
  - targets:
    - lakefs.example.com:8000

Metrics exposed by lakeFS

By default, Prometheus exports metrics with OS process information like memory and CPU. It also includes Go-specific metrics like details about GC and number of goroutines. You can learn about these default metrics in this post.

In addition, lakeFS exposes the following metrics to help monitor your deployment:

Name in Prometheus Description Labels
api_requests_total lakeFS API requests (counter) code: http status
method: http method
api_request_duration_seconds Durations of lakeFS API requests (histogram)
operation: name of API operation
code: http status
gateway_request_duration_seconds lakeFS S3-compatible endpoint request (histogram)
operation: name of gateway operation
code: http status
s3_operation_duration_seconds Outgoing S3 operations (histogram)
operation: name of S3 operation
error: “true” if error, “false” otherwise
go_sql_stats_* Go DB stats metrics have this prefix.
dlmiddlecote/sqlstats is used to expose them.
 

Example queries

Note: when using Prometheus functions like rate or increase, results are extrapolated and may not be exact.

99th percentile of API request latencies

sum by (operation)(histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[1m])))

50th percentile of S3-compatible API latencies

sum by (operation)(histogram_quantile(0.5, rate(gateway_request_duration_seconds_bucket[1m])))

Number of errors in outgoing S3 requests

sum by (operation) (increase(s3_operation_duration_seconds_count{error="true"}[1m]))

Number of open connections to the database

go_sql_stats_connections_open

Example Grafana dashboard

Grafana dashboard example