From ec1c709440d99b0d8042b669e17c899cd6fd8b84 Mon Sep 17 00:00:00 2001 From: Patrick Cloke Date: Tue, 26 Sep 2023 09:44:38 -0400 Subject: Add documentation about the user directory search algorithm (#16320) --- changelog.d/16320.doc | 1 + docs/user_directory.md | 136 +++++++++++++++++++++++++++++++++++++++---------- 2 files changed, 111 insertions(+), 26 deletions(-) create mode 100644 changelog.d/16320.doc diff --git a/changelog.d/16320.doc b/changelog.d/16320.doc new file mode 100644 index 0000000000..53e42df56f --- /dev/null +++ b/changelog.d/16320.doc @@ -0,0 +1 @@ +Improve documentation of the user directory search algorithm. diff --git a/docs/user_directory.md b/docs/user_directory.md index c4794b04cf..b33fd2bc2a 100644 --- a/docs/user_directory.md +++ b/docs/user_directory.md @@ -1,49 +1,133 @@ -User Directory API Implementation -================================= +# User Directory API Implementation -The user directory is currently maintained based on the 'visible' users -on this particular server - i.e. ones which your account shares a room with, or -who are present in a publicly viewable room present on the server. +The user directory is maintained based on users that are 'visible' to the homeserver - +i.e. ones which are local to the server and ones which any local user shares a +room with. -The directory info is stored in various tables, which can (typically after -DB corruption) get stale or out of sync. If this happens, for now the +The directory info is stored in various tables, which can sometimes get out of +sync (although this is considered a bug). If this happens, for now the solution to fix it is to use the [admin API](usage/administration/admin_api/background_updates.md#run) and execute the job `regenerate_directory`. This should then start a background task to -flush the current tables and regenerate the directory. +flush the current tables and regenerate the directory. Depending on the size +of your homeserver (number of users and rooms) this can take a while. -Data model ----------- +## Data model There are five relevant tables that collectively form the "user directory". -Three of them track a master list of all the users we could search for. -The last two (collectively called the "search tables") track who can -see who. +Three of them track a list of all known users. The last two (collectively called +the "search tables") track which users are visible to each other. From all of these tables we exclude three types of local user: - - support users - - appservice users - - deactivated users - -* `user_directory`. This contains the user_id, display name and avatar we'll - return when you search the directory. - - Because there's only one directory entry per user, it's important that we only - ever put publicly visible names here. Otherwise we might leak a private + +- support users +- appservice users +- deactivated users + +A description of each table follows: + +* `user_directory`. This contains the user ID, display name and avatar of each user. + - Because there is only one directory entry per user, it is important that it + only contain publicly visible information. Otherwise, this will leak the nickname or avatar used in a private room. - Indexed on rooms. Indexed on users. * `user_directory_search`. To be joined to `user_directory`. It contains an extra - column that enables full text search based on user ids and display names. - Different schemas for SQLite and Postgres with different code paths to match. + column that enables full text search based on user IDs and display names. + Different schemas for SQLite and Postgres are used. - Indexed on the full text search data. Indexed on users. * `user_directory_stream_pos`. When the initial background update to populate the directory is complete, we record a stream position here. This indicates that synapse should now listen for room changes and incrementally update - the directory where necessary. + the directory where necessary. (See [stream positions](development/synapse_architecture/streams.html).) -* `users_in_public_rooms`. Contains associations between users and the public rooms they're in. - Used to determine which users are in public rooms and should be publicly visible in the directory. +* `users_in_public_rooms`. Contains associations between users and the public + rooms they're in. Used to determine which users are in public rooms and should + be publicly visible in the directory. Both local and remote users are tracked. * `users_who_share_private_rooms`. Rows are triples `(L, M, room id)` where `L` is a local user and `M` is a local or remote user. `L` and `M` should be different, but this isn't enforced by a constraint. + + Note that if two local users share a room then there will be two entries: + `(user1, user2, !room_id)` and `(user2, user1, !room_id)`. + +## Configuration options + +The exact way user search works can be tweaked via some server-level +[configuration options](usage/configuration/config_documentation.md#user_directory). + +The information is not repeated here, but the options are mentioned below. + +## Search algorithm + +If `search_all_users` is `false`, then results are limited to users who: + +1. Are found in the `users_in_public_rooms` table, or +2. Are found in the `users_who_share_private_rooms` where `L` is the requesting + user and `M` is the search result. + +Otherwise, if `search_all_users` is `true`, no such limits are placed and all +users known to the server (matching the search query) will be returned. + +By default, locked users are not returned. If `show_locked_users` is `true` then +no filtering on the locked status of a user is done. + +The user provided search term is lowercased and normalized using [NFKC](https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization), +this treats the string as case-insensitive, canonicalizes different forms of the +same text, and maps some "roughly equivalent" characters together. + +The search term is then split into words: + +* If [ICU](https://en.wikipedia.org/wiki/International_Components_for_Unicode) is + available, then the system's [default locale](https://unicode-org.github.io/icu/userguide/locale/#default-locales) + will be used to break the search term into words. (See the + [installation instructions](setup/installation.md) for how to install ICU.) +* If unavailable, then runs of ASCII characters, numbers, underscores, and hypens + are considered words. + +The queries for PostgreSQL and SQLite are detailed below, by their overall goal +is to find matching users, preferring users who are "real" (e.g. not bots, +not deactivated). It is assumed that real users will have an display name and +avatar set. + +### PostgreSQL + +The above words are then transformed into two queries: + +1. "exact" which matches the parsed words exactly (using [`to_tsquery`](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES)); +2. "prefix" which matches the parsed words as prefixes (using `to_tsquery`). + +Results are composed of all rows in the `user_directory_search` table whose information +matches one (or both) of these queries. Results are ordered by calculating a weighted +score for each result, higher scores are returned first: + +* 4x if a user ID exists. +* 1.2x if the user has a display name set. +* 1.2x if the user has an avatar set. +* 0x-3x by the full text search results using the [`ts_rank_cd` function](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING) + against the "exact" search query; this has four variables with the following weightings: + * `D`: 0.1 for the user ID's domain + * `C`: 0.1 for unused + * `B`: 0.9 for the user's display name (or an empty string if it is not set) + * `A`: 0.1 for the user ID's localpart +* 0x-1x by the full text search results using the `ts_rank_cd` function against the + "prefix" search query. (Using the same weightings as above.) +* If `prefer_local_users` is `true`, then 2x if the user is local to the homeserver. + +Note that `ts_rank_cd` returns a weight between 0 and 1. The initial weighting of +all results is 1. + +### SQLite + +Results are composed of all rows in the `user_directory_search` whose information +matches the query. Results are ordered by the following information, with each +subsequent column used as a tiebreaker, for each result: + +1. By the [`rank`](https://www.sqlite.org/windowfunctions.html#built_in_window_functions) + of the full text search results using the [`matchinfo` function](https://www.sqlite.org/fts3.html#matchinfo). Higher + ranks are returned first. +2. If `prefer_local_users` is `true`, then users local to the homeserver are + returned first. +3. Users with a display name set are returned first. +4. Users with an avatar set are returned first. -- cgit 1.4.1