summary refs log tree commit diff
path: root/docs/metrics-howto.md
blob: 279303a7988efbca9422f47afc1544581a903a07 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
# How to monitor Synapse metrics using Prometheus

1.  Install Prometheus:

    Follow instructions at
    <http://prometheus.io/docs/introduction/install/>

1.  Enable Synapse metrics:

    In `homeserver.yaml`, make sure `enable_metrics` is
    set to `True`.

1.  Enable the `/_synapse/metrics` Synapse endpoint that Prometheus uses to
    collect data:

    There are two methods of enabling the metrics endpoint in Synapse.

    The first serves the metrics as a part of the usual web server and
    can be enabled by adding the \"metrics\" resource to the existing
    listener as such:

    ```yaml
      resources:
        - names:
          - client
          - metrics
    ```

    This provides a simple way of adding metrics to your Synapse
    installation, and serves under `/_synapse/metrics`. If you do not
    wish your metrics be publicly exposed, you will need to either
    filter it out at your load balancer, or use the second method.

    The second method runs the metrics server on a different port, in a
    different thread to Synapse. This can make it more resilient to
    heavy load meaning metrics cannot be retrieved, and can be exposed
    to just internal networks easier. The served metrics are available
    over HTTP only, and will be available at `/_synapse/metrics`.

    Add a new listener to homeserver.yaml:

    ```yaml
      listeners:
        - type: metrics
          port: 9000
          bind_addresses:
            - '0.0.0.0'
    ```

1.  Restart Synapse.

1.  Add a Prometheus target for Synapse.

    It needs to set the `metrics_path` to a non-default value (under
    `scrape_configs`):

    ```yaml
      - job_name: "synapse"
        scrape_interval: 15s
        metrics_path: "/_synapse/metrics"
        static_configs:
          - targets: ["my.server.here:port"]
    ```

    where `my.server.here` is the IP address of Synapse, and `port` is
    the listener port configured with the `metrics` resource.

    If your prometheus is older than 1.5.2, you will need to replace
    `static_configs` in the above with `target_groups`.

1.  Restart Prometheus.

1.  Consider using the [grafana dashboard](https://github.com/matrix-org/synapse/tree/master/contrib/grafana/)
    and required [recording rules](https://github.com/matrix-org/synapse/tree/master/contrib/prometheus/) 

## Monitoring workers

To monitor a Synapse installation using [workers](workers.md),
every worker needs to be monitored independently, in addition to
the main homeserver process. This is because workers don't send
their metrics to the main homeserver process, but expose them
directly (if they are configured to do so).

To allow collecting metrics from a worker, you need to add a
`metrics` listener to its configuration, by adding the following
under `worker_listeners`:

```yaml
  - type: metrics
    bind_address: ''
    port: 9101
```

The `bind_address` and `port` parameters should be set so that
the resulting listener can be reached by prometheus, and they
don't clash with an existing worker.
With this example, the worker's metrics would then be available
on `http://127.0.0.1:9101`.

Example Prometheus target for Synapse with workers:

```yaml
  - job_name: "synapse"
    scrape_interval: 15s
    metrics_path: "/_synapse/metrics"
    static_configs:
      - targets: ["my.server.here:port"]
        labels:
          instance: "my.server"
          job: "master"
          index: 1
      - targets: ["my.workerserver.here:port"]
        labels:
          instance: "my.server"
          job: "generic_worker"
          index: 1
      - targets: ["my.workerserver.here:port"]
        labels:
          instance: "my.server"
          job: "generic_worker"
          index: 2
      - targets: ["my.workerserver.here:port"]
        labels:
          instance: "my.server"
          job: "media_repository"
          index: 1
```

Labels (`instance`, `job`, `index`) can be defined as anything.
The labels are used to group graphs in grafana.

## Renaming of metrics & deprecation of old names in 1.2

Synapse 1.2 updates the Prometheus metrics to match the naming
convention of the upstream `prometheus_client`. The old names are
considered deprecated and will be removed in a future version of
Synapse.

| New Name                                                                     | Old Name                                                               |
| ---------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| python_gc_objects_collected_total                                            | python_gc_objects_collected                                            |
| python_gc_objects_uncollectable_total                                        | python_gc_objects_uncollectable                                        |
| python_gc_collections_total                                                  | python_gc_collections                                                  |
| process_cpu_seconds_total                                                    | process_cpu_seconds                                                    |
| synapse_federation_client_sent_transactions_total                            | synapse_federation_client_sent_transactions                            |
| synapse_federation_client_events_processed_total                             | synapse_federation_client_events_processed                             |
| synapse_event_processing_loop_count_total                                    | synapse_event_processing_loop_count                                    |
| synapse_event_processing_loop_room_count_total                               | synapse_event_processing_loop_room_count                               |
| synapse_util_metrics_block_count_total                                       | synapse_util_metrics_block_count                                       |
| synapse_util_metrics_block_time_seconds_total                                | synapse_util_metrics_block_time_seconds                                |
| synapse_util_metrics_block_ru_utime_seconds_total                            | synapse_util_metrics_block_ru_utime_seconds                            |
| synapse_util_metrics_block_ru_stime_seconds_total                            | synapse_util_metrics_block_ru_stime_seconds                            |
| synapse_util_metrics_block_db_txn_count_total                                | synapse_util_metrics_block_db_txn_count                                |
| synapse_util_metrics_block_db_txn_duration_seconds_total                     | synapse_util_metrics_block_db_txn_duration_seconds                     |
| synapse_util_metrics_block_db_sched_duration_seconds_total                   | synapse_util_metrics_block_db_sched_duration_seconds                   |
| synapse_background_process_start_count_total                                 | synapse_background_process_start_count                                 |
| synapse_background_process_ru_utime_seconds_total                            | synapse_background_process_ru_utime_seconds                            |
| synapse_background_process_ru_stime_seconds_total                            | synapse_background_process_ru_stime_seconds                            |
| synapse_background_process_db_txn_count_total                                | synapse_background_process_db_txn_count                                |
| synapse_background_process_db_txn_duration_seconds_total                     | synapse_background_process_db_txn_duration_seconds                     |
| synapse_background_process_db_sched_duration_seconds_total                   | synapse_background_process_db_sched_duration_seconds                   |
| synapse_storage_events_persisted_events_total                                | synapse_storage_events_persisted_events                                |
| synapse_storage_events_persisted_events_sep_total                            | synapse_storage_events_persisted_events_sep                            |
| synapse_storage_events_state_delta_total                                     | synapse_storage_events_state_delta                                     |
| synapse_storage_events_state_delta_single_event_total                        | synapse_storage_events_state_delta_single_event                        |
| synapse_storage_events_state_delta_reuse_delta_total                         | synapse_storage_events_state_delta_reuse_delta                         |
| synapse_federation_server_received_pdus_total                                | synapse_federation_server_received_pdus                                |
| synapse_federation_server_received_edus_total                                | synapse_federation_server_received_edus                                |
| synapse_handler_presence_notified_presence_total                             | synapse_handler_presence_notified_presence                             |
| synapse_handler_presence_federation_presence_out_total                       | synapse_handler_presence_federation_presence_out                       |
| synapse_handler_presence_presence_updates_total                              | synapse_handler_presence_presence_updates                              |
| synapse_handler_presence_timers_fired_total                                  | synapse_handler_presence_timers_fired                                  |
| synapse_handler_presence_federation_presence_total                           | synapse_handler_presence_federation_presence                           |
| synapse_handler_presence_bump_active_time_total                              | synapse_handler_presence_bump_active_time                              |
| synapse_federation_client_sent_edus_total                                    | synapse_federation_client_sent_edus                                    |
| synapse_federation_client_sent_pdu_destinations_count_total                  | synapse_federation_client_sent_pdu_destinations:count                  |
| synapse_federation_client_sent_pdu_destinations_total                        | synapse_federation_client_sent_pdu_destinations:total                  |
| synapse_handlers_appservice_events_processed_total                           | synapse_handlers_appservice_events_processed                           |
| synapse_notifier_notified_events_total                                       | synapse_notifier_notified_events                                       |
| synapse_push_bulk_push_rule_evaluator_push_rules_invalidation_counter_total  | synapse_push_bulk_push_rule_evaluator_push_rules_invalidation_counter  |
| synapse_push_bulk_push_rule_evaluator_push_rules_state_size_counter_total    | synapse_push_bulk_push_rule_evaluator_push_rules_state_size_counter    |
| synapse_http_httppusher_http_pushes_processed_total                          | synapse_http_httppusher_http_pushes_processed                          |
| synapse_http_httppusher_http_pushes_failed_total                             | synapse_http_httppusher_http_pushes_failed                             |
| synapse_http_httppusher_badge_updates_processed_total                        | synapse_http_httppusher_badge_updates_processed                        |
| synapse_http_httppusher_badge_updates_failed_total                           | synapse_http_httppusher_badge_updates_failed                           |

Removal of deprecated metrics & time based counters becoming histograms in 0.31.0
---------------------------------------------------------------------------------

The duplicated metrics deprecated in Synapse 0.27.0 have been removed.

All time duration-based metrics have been changed to be seconds. This
affects:

| msec -> sec metrics                    |
| -------------------------------------- |
| python_gc_time                         |
| python_twisted_reactor_tick_time       |
| synapse_storage_query_time             |
| synapse_storage_schedule_time          |
| synapse_storage_transaction_time       |

Several metrics have been changed to be histograms, which sort entries
into buckets and allow better analysis. The following metrics are now
histograms:

| Altered metrics                                  |
| ------------------------------------------------ |
| python_gc_time                                   |
| python_twisted_reactor_pending_calls             |
| python_twisted_reactor_tick_time                 |
| synapse_http_server_response_time_seconds        |
| synapse_storage_query_time                       |
| synapse_storage_schedule_time                    |
| synapse_storage_transaction_time                 |

Block and response metrics renamed for 0.27.0
---------------------------------------------

Synapse 0.27.0 begins the process of rationalising the duplicate
`*:count` metrics reported for the resource tracking for code blocks and
HTTP requests.

At the same time, the corresponding `*:total` metrics are being renamed,
as the `:total` suffix no longer makes sense in the absence of a
corresponding `:count` metric.

To enable a graceful migration path, this release just adds new names
for the metrics being renamed. A future release will remove the old
ones.

The following table shows the new metrics, and the old metrics which
they are replacing.

| New name                                                      | Old name                                                   |
| ------------------------------------------------------------- | ---------------------------------------------------------- |
| synapse_util_metrics_block_count                              | synapse_util_metrics_block_timer:count                     |
| synapse_util_metrics_block_count                              | synapse_util_metrics_block_ru_utime:count                  |
| synapse_util_metrics_block_count                              | synapse_util_metrics_block_ru_stime:count                  |
| synapse_util_metrics_block_count                              | synapse_util_metrics_block_db_txn_count:count              |
| synapse_util_metrics_block_count                              | synapse_util_metrics_block_db_txn_duration:count           |
| synapse_util_metrics_block_time_seconds                       | synapse_util_metrics_block_timer:total                     |
| synapse_util_metrics_block_ru_utime_seconds                   | synapse_util_metrics_block_ru_utime:total                  |
| synapse_util_metrics_block_ru_stime_seconds                   | synapse_util_metrics_block_ru_stime:total                  |
| synapse_util_metrics_block_db_txn_count                       | synapse_util_metrics_block_db_txn_count:total              |
| synapse_util_metrics_block_db_txn_duration_seconds            | synapse_util_metrics_block_db_txn_duration:total           |
| synapse_http_server_response_count                            | synapse_http_server_requests                               |
| synapse_http_server_response_count                            | synapse_http_server_response_time:count                    |
| synapse_http_server_response_count                            | synapse_http_server_response_ru_utime:count                |
| synapse_http_server_response_count                            | synapse_http_server_response_ru_stime:count                |
| synapse_http_server_response_count                            | synapse_http_server_response_db_txn_count:count            |
| synapse_http_server_response_count                            | synapse_http_server_response_db_txn_duration:count         |
| synapse_http_server_response_time_seconds                     | synapse_http_server_response_time:total                    |
| synapse_http_server_response_ru_utime_seconds                 | synapse_http_server_response_ru_utime:total                |
| synapse_http_server_response_ru_stime_seconds                 | synapse_http_server_response_ru_stime:total                |
| synapse_http_server_response_db_txn_count                     | synapse_http_server_response_db_txn_count:total            |
| synapse_http_server_response_db_txn_duration_seconds          | synapse_http_server_response_db_txn_duration:total         |

Standard Metric Names
---------------------

As of synapse version 0.18.2, the format of the process-wide metrics has
been changed to fit prometheus standard naming conventions. Additionally
the units have been changed to seconds, from miliseconds.

| New name                                 | Old name                          |
| ---------------------------------------- | --------------------------------- |
| process_cpu_user_seconds_total           | process_resource_utime / 1000     |
| process_cpu_system_seconds_total         | process_resource_stime / 1000     |
| process_open_fds (no \'type\' label)     | process_fds                       |

The python-specific counts of garbage collector performance have been
renamed.

| New name                         | Old name                   |
| -------------------------------- | -------------------------- |
| python_gc_time                   | reactor_gc_time            |
| python_gc_unreachable_total      | reactor_gc_unreachable     |
| python_gc_counts                 | reactor_gc_counts          |

The twisted-specific reactor metrics have been renamed.

| New name                               | Old name                |
| -------------------------------------- | ----------------------- |
| python_twisted_reactor_pending_calls   | reactor_pending_calls   |
| python_twisted_reactor_tick_time       | reactor_tick_time       |