diff options
Diffstat (limited to 'docs/usage/administration')
-rw-r--r-- | docs/usage/administration/admin_api/federation.md | 114 | ||||
-rw-r--r-- | docs/usage/administration/admin_faq.md | 103 | ||||
-rw-r--r-- | docs/usage/administration/database_maintenance_tools.md | 18 | ||||
-rw-r--r-- | docs/usage/administration/state_groups.md | 25 | ||||
-rw-r--r-- | docs/usage/administration/understanding_synapse_through_grafana_graphs.md | 84 | ||||
-rw-r--r-- | docs/usage/administration/useful_sql_for_admins.md | 156 |
6 files changed, 500 insertions, 0 deletions
diff --git a/docs/usage/administration/admin_api/federation.md b/docs/usage/administration/admin_api/federation.md new file mode 100644 index 0000000000..8f9535f57b --- /dev/null +++ b/docs/usage/administration/admin_api/federation.md @@ -0,0 +1,114 @@ +# Federation API + +This API allows a server administrator to manage Synapse's federation with other homeservers. + +Note: This API is new, experimental and "subject to change". + +## List of destinations + +This API gets the current destination retry timing info for all remote servers. + +The list contains all the servers with which the server federates, +regardless of whether an error occurred or not. +If an error occurs, it may take up to 20 minutes for the error to be displayed here, +as a complete retry must have failed. + +The API is: + +A standard request with no filtering: + +``` +GET /_synapse/admin/v1/federation/destinations +``` + +A response body like the following is returned: + +```json +{ + "destinations":[ + { + "destination": "matrix.org", + "retry_last_ts": 1557332397936, + "retry_interval": 3000000, + "failure_ts": 1557329397936, + "last_successful_stream_ordering": null + } + ], + "total": 1 +} +``` + +To paginate, check for `next_token` and if present, call the endpoint again +with `from` set to the value of `next_token`. This will return a new page. + +If the endpoint does not return a `next_token` then there are no more destinations +to paginate through. + +**Parameters** + +The following query parameters are available: + +- `from` - Offset in the returned list. Defaults to `0`. +- `limit` - Maximum amount of destinations to return. Defaults to `100`. +- `order_by` - The method in which to sort the returned list of destinations. + Valid values are: + - `destination` - Destinations are ordered alphabetically by remote server name. + This is the default. + - `retry_last_ts` - Destinations are ordered by time of last retry attempt in ms. + - `retry_interval` - Destinations are ordered by how long until next retry in ms. + - `failure_ts` - Destinations are ordered by when the server started failing in ms. + - `last_successful_stream_ordering` - Destinations are ordered by the stream ordering + of the most recent successfully-sent PDU. +- `dir` - Direction of room order. Either `f` for forwards or `b` for backwards. Setting + this value to `b` will reverse the above sort order. Defaults to `f`. + +*Caution:* The database only has an index on the column `destination`. +This means that if a different sort order is used, +this can cause a large load on the database, especially for large environments. + +**Response** + +The following fields are returned in the JSON response body: + +- `destinations` - An array of objects, each containing information about a destination. + Destination objects contain the following fields: + - `destination` - string - Name of the remote server to federate. + - `retry_last_ts` - integer - The last time Synapse tried and failed to reach the + remote server, in ms. This is `0` if the last attempt to communicate with the + remote server was successful. + - `retry_interval` - integer - How long since the last time Synapse tried to reach + the remote server before trying again, in ms. This is `0` if no further retrying occuring. + - `failure_ts` - nullable integer - The first time Synapse tried and failed to reach the + remote server, in ms. This is `null` if communication with the remote server has never failed. + - `last_successful_stream_ordering` - nullable integer - The stream ordering of the most + recent successfully-sent [PDU](understanding_synapse_through_grafana_graphs.md#federation) + to this destination, or `null` if this information has not been tracked yet. +- `next_token`: string representing a positive integer - Indication for pagination. See above. +- `total` - integer - Total number of destinations. + +# Destination Details API + +This API gets the retry timing info for a specific remote server. + +The API is: + +``` +GET /_synapse/admin/v1/federation/destinations/<destination> +``` + +A response body like the following is returned: + +```json +{ + "destination": "matrix.org", + "retry_last_ts": 1557332397936, + "retry_interval": 3000000, + "failure_ts": 1557329397936, + "last_successful_stream_ordering": null +} +``` + +**Response** + +The response fields are the same like in the `destinations` array in +[List of destinations](#list-of-destinations) response. diff --git a/docs/usage/administration/admin_faq.md b/docs/usage/administration/admin_faq.md new file mode 100644 index 0000000000..3dcad4bbef --- /dev/null +++ b/docs/usage/administration/admin_faq.md @@ -0,0 +1,103 @@ +## Admin FAQ + +How do I become a server admin? +--- +If your server already has an admin account you should use the user admin API to promote other accounts to become admins. See [User Admin API](../../admin_api/user_admin_api.md#Change-whether-a-user-is-a-server-administrator-or-not) + +If you don't have any admin accounts yet you won't be able to use the admin API so you'll have to edit the database manually. Manually editing the database is generally not recommended so once you have an admin account, use the admin APIs to make further changes. + +```sql +UPDATE users SET admin = 1 WHERE name = '@foo:bar.com'; +``` +What servers are my server talking to? +--- +Run this sql query on your db: +```sql +SELECT * FROM destinations; +``` + +What servers are currently participating in this room? +--- +Run this sql query on your db: +```sql +SELECT DISTINCT split_part(state_key, ':', 2) + FROM current_state_events AS c + INNER JOIN room_memberships AS m USING (room_id, event_id) + WHERE room_id = '!cURbafjkfsMDVwdRDQ:matrix.org' AND membership = 'join'; +``` + +What users are registered on my server? +--- +```sql +SELECT NAME from users; +``` + +Manually resetting passwords: +--- +See https://github.com/matrix-org/synapse/blob/master/README.rst#password-reset + +I have a problem with my server. Can I just delete my database and start again? +--- +Deleting your database is unlikely to make anything better. + +It's easy to make the mistake of thinking that you can start again from a clean slate by dropping your database, but things don't work like that in a federated network: lots of other servers have information about your server. + +For example: other servers might think that you are in a room, your server will think that you are not, and you'll probably be unable to interact with that room in a sensible way ever again. + +In general, there are better solutions to any problem than dropping the database. Come and seek help in https://matrix.to/#/#synapse:matrix.org. + +There are two exceptions when it might be sensible to delete your database and start again: +* You have *never* joined any rooms which are federated with other servers. For instance, a local deployment which the outside world can't talk to. +* You are changing the `server_name` in the homeserver configuration. In effect this makes your server a completely new one from the point of view of the network, so in this case it makes sense to start with a clean database. +(In both cases you probably also want to clear out the media_store.) + +I've stuffed up access to my room, how can I delete it to free up the alias? +--- +Using the following curl command: +``` +curl -H 'Authorization: Bearer <access-token>' -X DELETE https://matrix.org/_matrix/client/r0/directory/room/<room-alias> +``` +`<access-token>` - can be obtained in riot by looking in the riot settings, down the bottom is: +Access Token:\<click to reveal\> + +`<room-alias>` - the room alias, eg. #my_room:matrix.org this possibly needs to be URL encoded also, for example %23my_room%3Amatrix.org + +How can I find the lines corresponding to a given HTTP request in my homeserver log? +--- + +Synapse tags each log line according to the HTTP request it is processing. When it finishes processing each request, it logs a line containing the words `Processed request: `. For example: + +``` +2019-02-14 22:35:08,196 - synapse.access.http.8008 - 302 - INFO - GET-37 - ::1 - 8008 - {@richvdh:localhost} Processed request: 0.173sec/0.001sec (0.002sec, 0.000sec) (0.027sec/0.026sec/2) 687B 200 "GET /_matrix/client/r0/sync HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36" [0 dbevts]" +``` + +Here we can see that the request has been tagged with `GET-37`. (The tag depends on the method of the HTTP request, so might start with `GET-`, `PUT-`, `POST-`, `OPTIONS-` or `DELETE-`.) So to find all lines corresponding to this request, we can do: + +``` +grep 'GET-37' homeserver.log +``` + +If you want to paste that output into a github issue or matrix room, please remember to surround it with triple-backticks (```) to make it legible (see https://help.github.com/en/articles/basic-writing-and-formatting-syntax#quoting-code). + + +What do all those fields in the 'Processed' line mean? +--- +See [Request log format](request_log.md). + + +What are the biggest rooms on my server? +--- + +```sql +SELECT s.canonical_alias, g.room_id, count(*) AS num_rows +FROM + state_groups_state AS g, + room_stats_state AS s +WHERE g.room_id = s.room_id +GROUP BY s.canonical_alias, g.room_id +ORDER BY num_rows desc +LIMIT 10; +``` + +You can also use the [List Room API](../../admin_api/rooms.md#list-room-api) +and `order_by` `state_events`. diff --git a/docs/usage/administration/database_maintenance_tools.md b/docs/usage/administration/database_maintenance_tools.md new file mode 100644 index 0000000000..92b805d413 --- /dev/null +++ b/docs/usage/administration/database_maintenance_tools.md @@ -0,0 +1,18 @@ +This blog post by Victor Berger explains how to use many of the tools listed on this page: https://levans.fr/shrink-synapse-database.html + +# List of useful tools and scripts for maintenance Synapse database: + +## [Purge Remote Media API](../../admin_api/media_admin_api.md#purge-remote-media-api) +The purge remote media API allows server admins to purge old cached remote media. + +## [Purge Local Media API](../../admin_api/media_admin_api.md#delete-local-media) +This API deletes the *local* media from the disk of your own server. + +## [Purge History API](../../admin_api/purge_history_api.md) +The purge history API allows server admins to purge historic events from their database, reclaiming disk space. + +## [synapse-compress-state](https://github.com/matrix-org/rust-synapse-compress-state) +Tool for compressing (deduplicating) `state_groups_state` table. + +## [SQL for analyzing Synapse PostgreSQL database stats](useful_sql_for_admins.md) +Some easy SQL that reports useful stats about your Synapse database. \ No newline at end of file diff --git a/docs/usage/administration/state_groups.md b/docs/usage/administration/state_groups.md new file mode 100644 index 0000000000..f1dee7accf --- /dev/null +++ b/docs/usage/administration/state_groups.md @@ -0,0 +1,25 @@ +# How do State Groups work? + +As a general rule, I encourage people who want to understand the deepest darkest secrets of the database schema to drop by #synapse-dev:matrix.org and ask questions. + +However, one question that comes up frequently is that of how "state groups" work, and why the `state_groups_state` table gets so big, so here's an attempt to answer that question. + +We need to be able to relatively quickly calculate the state of a room at any point in that room's history. In other words, we need to know the state of the room at each event in that room. This is done as follows: + +A sequence of events where the state is the same are grouped together into a `state_group`; the mapping is recorded in `event_to_state_groups`. (Technically speaking, since a state event usually changes the state in the room, we are recording the state of the room *after* the given event id: which is to say, to a handwavey simplification, the first event in a state group is normally a state event, and others in the same state group are normally non-state-events.) + +`state_groups` records, for each state group, the id of the room that we're looking at, and also the id of the first event in that group. (I'm not sure if that event id is used much in practice.) + +Now, if we stored all the room state for each `state_group`, that would be a huge amount of data. Instead, for each state group, we normally store the difference between the state in that group and some other state group, and only occasionally (every 100 state changes or so) record the full state. + +So, most state groups have an entry in `state_group_edges` (don't ask me why it's not a column in `state_groups`) which records the previous state group in the room, and `state_groups_state` records the differences in state since that previous state group. + +A full state group just records the event id for each piece of state in the room at that point. + +## Known bugs with state groups + +There are various reasons that we can end up creating many more state groups than we need: see https://github.com/matrix-org/synapse/issues/3364 for more details. + +## Compression tool + +There is a tool at https://github.com/matrix-org/rust-synapse-compress-state which can compress the `state_groups_state` on a room by-room basis (essentially, it reduces the number of "full" state groups). This can result in dramatic reductions of the storage used. \ No newline at end of file diff --git a/docs/usage/administration/understanding_synapse_through_grafana_graphs.md b/docs/usage/administration/understanding_synapse_through_grafana_graphs.md new file mode 100644 index 0000000000..c365cc3923 --- /dev/null +++ b/docs/usage/administration/understanding_synapse_through_grafana_graphs.md @@ -0,0 +1,84 @@ +## Understanding Synapse through Grafana graphs + +It is possible to monitor much of the internal state of Synapse using [Prometheus](https://prometheus.io) +metrics and [Grafana](https://grafana.com/). +A guide for configuring Synapse to provide metrics is available [here](../../metrics-howto.md) +and information on setting up Grafana is [here](https://github.com/matrix-org/synapse/tree/master/contrib/grafana). +In this setup, Prometheus will periodically scrape the information Synapse provides and +store a record of it over time. Grafana is then used as an interface to query and +present this information through a series of pretty graphs. + +Once you have grafana set up, and assuming you're using [our grafana dashboard template](https://github.com/matrix-org/synapse/blob/master/contrib/grafana/synapse.json), look for the following graphs when debugging a slow/overloaded Synapse: + +## Message Event Send Time + +![image](https://user-images.githubusercontent.com/1342360/82239409-a1c8e900-9930-11ea-8081-e4614e0c63f4.png) + +This, along with the CPU and Memory graphs, is a good way to check the general health of your Synapse instance. It represents how long it takes for a user on your homeserver to send a message. + +## Transaction Count and Transaction Duration + +![image](https://user-images.githubusercontent.com/1342360/82239985-8d392080-9931-11ea-80d0-843ab2f22e1e.png) + +![image](https://user-images.githubusercontent.com/1342360/82240050-ab068580-9931-11ea-98f1-f94671cbac9a.png) + +These graphs show the database transactions that are occurring the most frequently, as well as those are that are taking the most amount of time to execute. + +![image](https://user-images.githubusercontent.com/1342360/82240192-e86b1300-9931-11ea-9aac-3e2c9bfa6fdc.png) + +In the first graph, we can see obvious spikes corresponding to lots of `get_user_by_id` transactions. This would be useful information to figure out which part of the Synapse codebase is potentially creating a heavy load on the system. However, be sure to cross-reference this with Transaction Duration, which states that `get_users_by_id` is actually a very quick database transaction and isn't causing as much load as others, like `persist_events`: + +![image](https://user-images.githubusercontent.com/1342360/82240467-62030100-9932-11ea-8db9-917f2d977fe1.png) + +Still, it's probably worth investigating why we're getting users from the database that often, and whether it's possible to reduce the amount of queries we make by adjusting our cache factor(s). + +The `persist_events` transaction is responsible for saving new room events to the Synapse database, so can often show a high transaction duration. + +## Federation + +The charts in the "Federation" section show information about incoming and outgoing federation requests. Federation data can be divided into two basic types: + +- PDU (Persistent Data Unit) - room events: messages, state events (join/leave), etc. These are permanently stored in the database. +- EDU (Ephemeral Data Unit) - other data, which need not be stored permanently, such as read receipts, typing notifications. + +The "Outgoing EDUs by type" chart shows the EDUs within outgoing federation requests by type: `m.device_list_update`, `m.direct_to_device`, `m.presence`, `m.receipt`, `m.typing`. + +If you see a large number of `m.presence` EDUs and are having trouble with too much CPU load, you can disable `presence` in the Synapse config. See also [#3971](https://github.com/matrix-org/synapse/issues/3971). + +## Caches + +![image](https://user-images.githubusercontent.com/1342360/82240572-8b239180-9932-11ea-96ff-6b5f0e57ebe5.png) + +![image](https://user-images.githubusercontent.com/1342360/82240666-b8703f80-9932-11ea-86af-9f663988d8da.png) + +This is quite a useful graph. It shows how many times Synapse attempts to retrieve a piece of data from a cache which the cache did not contain, thus resulting in a call to the database. We can see here that the `_get_joined_profile_from_event_id` cache is being requested a lot, and often the data we're after is not cached. + +Cross-referencing this with the Eviction Rate graph, which shows that entries are being evicted from `_get_joined_profile_from_event_id` quite often: + +![image](https://user-images.githubusercontent.com/1342360/82240766-de95df80-9932-11ea-8c15-5acfc57c48da.png) + +we should probably consider raising the size of that cache by raising its cache factor (a multiplier value for the size of an individual cache). Information on doing so is available [here](https://github.com/matrix-org/synapse/blob/ee421e524478c1ad8d43741c27379499c2f6135c/docs/sample_config.yaml#L608-L642) (note that the configuration of individual cache factors through the configuration file is available in Synapse v1.14.0+, whereas doing so through environment variables has been supported for a very long time). Note that this will increase Synapse's overall memory usage. + +## Forward Extremities + +![image](https://user-images.githubusercontent.com/1342360/82241440-13566680-9934-11ea-8b88-ba468db937ed.png) + +Forward extremities are the leaf events at the end of a DAG in a room, aka events that have no children. The more that exist in a room, the more [state resolution](https://spec.matrix.org/v1.1/server-server-api/#room-state-resolution) that Synapse needs to perform (hint: it's an expensive operation). While Synapse has code to prevent too many of these existing at one time in a room, bugs can sometimes make them crop up again. + +If a room has >10 forward extremities, it's worth checking which room is the culprit and potentially removing them using the SQL queries mentioned in [#1760](https://github.com/matrix-org/synapse/issues/1760). + +## Garbage Collection + +![image](https://user-images.githubusercontent.com/1342360/82241911-da6ac180-9934-11ea-9a0d-a311fe22acd0.png) + +Large spikes in garbage collection times (bigger than shown here, I'm talking in the +multiple seconds range), can cause lots of problems in Synapse performance. It's more an +indicator of problems, and a symptom of other problems though, so check other graphs for what might be causing it. + +## Final Thoughts + +If you're still having performance problems with your Synapse instance and you've +tried everything you can, it may just be a lack of system resources. Consider adding +more CPU and RAM, and make use of [worker mode](../../workers.md) +to make use of multiple CPU cores / multiple machines for your homeserver. + diff --git a/docs/usage/administration/useful_sql_for_admins.md b/docs/usage/administration/useful_sql_for_admins.md new file mode 100644 index 0000000000..d4aada3272 --- /dev/null +++ b/docs/usage/administration/useful_sql_for_admins.md @@ -0,0 +1,156 @@ +## Some useful SQL queries for Synapse Admins + +## Size of full matrix db +`SELECT pg_size_pretty( pg_database_size( 'matrix' ) );` +### Result example: +``` +pg_size_pretty +---------------- + 6420 MB +(1 row) +``` +## Show top 20 larger rooms by state events count +```sql +SELECT r.name, s.room_id, s.current_state_events + FROM room_stats_current s + LEFT JOIN room_stats_state r USING (room_id) + ORDER BY current_state_events DESC + LIMIT 20; +``` + +and by state_group_events count: +```sql +SELECT rss.name, s.room_id, count(s.room_id) FROM state_groups_state s +LEFT JOIN room_stats_state rss USING (room_id) +GROUP BY s.room_id, rss.name +ORDER BY count(s.room_id) DESC +LIMIT 20; +``` +plus same, but with join removed for performance reasons: +```sql +SELECT s.room_id, count(s.room_id) FROM state_groups_state s +GROUP BY s.room_id +ORDER BY count(s.room_id) DESC +LIMIT 20; +``` + +## Show top 20 larger tables by row count +```sql +SELECT relname, n_live_tup as rows + FROM pg_stat_user_tables + ORDER BY n_live_tup DESC + LIMIT 20; +``` +This query is quick, but may be very approximate, for exact number of rows use `SELECT COUNT(*) FROM <table_name>`. +### Result example: +``` +state_groups_state - 161687170 +event_auth - 8584785 +event_edges - 6995633 +event_json - 6585916 +event_reference_hashes - 6580990 +events - 6578879 +received_transactions - 5713989 +event_to_state_groups - 4873377 +stream_ordering_to_exterm - 4136285 +current_state_delta_stream - 3770972 +event_search - 3670521 +state_events - 2845082 +room_memberships - 2785854 +cache_invalidation_stream - 2448218 +state_groups - 1255467 +state_group_edges - 1229849 +current_state_events - 1222905 +users_in_public_rooms - 364059 +device_lists_stream - 326903 +user_directory_search - 316433 +``` + +## Show top 20 rooms by new events count in last 1 day: +```sql +SELECT e.room_id, r.name, COUNT(e.event_id) cnt FROM events e +LEFT JOIN room_stats_state r USING (room_id) +WHERE e.origin_server_ts >= DATE_PART('epoch', NOW() - INTERVAL '1 day') * 1000 GROUP BY e.room_id, r.name ORDER BY cnt DESC LIMIT 20; +``` + +## Show top 20 users on homeserver by sent events (messages) at last month: +```sql +SELECT user_id, SUM(total_events) + FROM user_stats_historical + WHERE TO_TIMESTAMP(end_ts/1000) AT TIME ZONE 'UTC' > date_trunc('day', now() - interval '1 month') + GROUP BY user_id + ORDER BY SUM(total_events) DESC + LIMIT 20; +``` + +## Show last 100 messages from needed user, with room names: +```sql +SELECT e.room_id, r.name, e.event_id, e.type, e.content, j.json FROM events e + LEFT JOIN event_json j USING (room_id) + LEFT JOIN room_stats_state r USING (room_id) + WHERE sender = '@LOGIN:example.com' + AND e.type = 'm.room.message' + ORDER BY stream_ordering DESC + LIMIT 100; +``` + +## Show top 20 larger tables by storage size +```sql +SELECT nspname || '.' || relname AS "relation", + pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size" + FROM pg_class C + LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) + WHERE nspname NOT IN ('pg_catalog', 'information_schema') + AND C.relkind <> 'i' + AND nspname !~ '^pg_toast' + ORDER BY pg_total_relation_size(C.oid) DESC + LIMIT 20; +``` +### Result example: +``` +public.state_groups_state - 27 GB +public.event_json - 9855 MB +public.events - 3675 MB +public.event_edges - 3404 MB +public.received_transactions - 2745 MB +public.event_reference_hashes - 1864 MB +public.event_auth - 1775 MB +public.stream_ordering_to_exterm - 1663 MB +public.event_search - 1370 MB +public.room_memberships - 1050 MB +public.event_to_state_groups - 948 MB +public.current_state_delta_stream - 711 MB +public.state_events - 611 MB +public.presence_stream - 530 MB +public.current_state_events - 525 MB +public.cache_invalidation_stream - 466 MB +public.receipts_linearized - 279 MB +public.state_groups - 160 MB +public.device_lists_remote_cache - 124 MB +public.state_group_edges - 122 MB +``` + +## Show rooms with names, sorted by events in this rooms +`echo "select event_json.room_id,room_stats_state.name from event_json,room_stats_state where room_stats_state.room_id=event_json.room_id" | psql synapse | sort | uniq -c | sort -n` +### Result example: +``` + 9459 !FPUfgzXYWTKgIrwKxW:matrix.org | This Week in Matrix + 9459 !FPUfgzXYWTKgIrwKxW:matrix.org | This Week in Matrix (TWIM) + 17799 !iDIOImbmXxwNngznsa:matrix.org | Linux in Russian + 18739 !GnEEPYXUhoaHbkFBNX:matrix.org | Riot Android + 23373 !QtykxKocfZaZOUrTwp:matrix.org | Matrix HQ + 39504 !gTQfWzbYncrtNrvEkB:matrix.org | ru.[matrix] + 43601 !iNmaIQExDMeqdITdHH:matrix.org | Riot + 43601 !iNmaIQExDMeqdITdHH:matrix.org | Riot Web/Desktop +``` + +## Lookup room state info by list of room_id +```sql +SELECT rss.room_id, rss.name, rss.canonical_alias, rss.topic, rss.encryption, rsc.joined_members, rsc.local_users_in_room, rss.join_rules +FROM room_stats_state rss +LEFT JOIN room_stats_current rsc USING (room_id) +WHERE room_id IN (WHERE room_id IN ( + '!OGEhHVWSdvArJzumhm:matrix.org', + '!YTvKGNlinIzlkMTVRl:matrix.org' +) +``` \ No newline at end of file |