summary refs log tree commit diff
path: root/docs/development/internal_documentation
diff options
context:
space:
mode:
Diffstat (limited to 'docs/development/internal_documentation')
-rw-r--r--docs/development/internal_documentation/auth_chain_diff.dot32
-rw-r--r--docs/development/internal_documentation/auth_chain_diff.dot.pngbin0 -> 42427 bytes
-rw-r--r--docs/development/internal_documentation/auth_chain_difference_algorithm.md141
-rw-r--r--docs/development/internal_documentation/cas.md64
-rw-r--r--docs/development/internal_documentation/media_repository.md78
-rw-r--r--docs/development/internal_documentation/room-dag-concepts.md113
-rw-r--r--docs/development/internal_documentation/room_and_user_statistics.md22
-rw-r--r--docs/development/internal_documentation/saml.md40
8 files changed, 490 insertions, 0 deletions
diff --git a/docs/development/internal_documentation/auth_chain_diff.dot b/docs/development/internal_documentation/auth_chain_diff.dot
new file mode 100644

index 0000000000..978d579ada --- /dev/null +++ b/docs/development/internal_documentation/auth_chain_diff.dot
@@ -0,0 +1,32 @@ +digraph auth { + nodesep=0.5; + rankdir="RL"; + + C [label="Create (1,1)"]; + + BJ [label="Bob's Join (2,1)", color=red]; + BJ2 [label="Bob's Join (2,2)", color=red]; + BJ2 -> BJ [color=red, dir=none]; + + subgraph cluster_foo { + A1 [label="Alice's invite (4,1)", color=blue]; + A2 [label="Alice's Join (4,2)", color=blue]; + A3 [label="Alice's Join (4,3)", color=blue]; + A3 -> A2 -> A1 [color=blue, dir=none]; + color=none; + } + + PL1 [label="Power Level (3,1)", color=darkgreen]; + PL2 [label="Power Level (3,2)", color=darkgreen]; + PL2 -> PL1 [color=darkgreen, dir=none]; + + {rank = same; C; BJ; PL1; A1;} + + A1 -> C [color=grey]; + A1 -> BJ [color=grey]; + PL1 -> C [color=grey]; + BJ2 -> PL1 [penwidth=2]; + + A3 -> PL2 [penwidth=2]; + A1 -> PL1 -> BJ -> C [penwidth=2]; +} diff --git a/docs/development/internal_documentation/auth_chain_diff.dot.png b/docs/development/internal_documentation/auth_chain_diff.dot.png new file mode 100644
index 0000000000..771c07308f --- /dev/null +++ b/docs/development/internal_documentation/auth_chain_diff.dot.png
Binary files differdiff --git a/docs/development/internal_documentation/auth_chain_difference_algorithm.md b/docs/development/internal_documentation/auth_chain_difference_algorithm.md new file mode 100644
index 0000000000..ebc9de25b8 --- /dev/null +++ b/docs/development/internal_documentation/auth_chain_difference_algorithm.md
@@ -0,0 +1,141 @@ +# Auth Chain Difference Algorithm + +The auth chain difference algorithm is used by V2 state resolution, where a +naive implementation can be a significant source of CPU and DB usage. + +### Definitions + +A *state set* is a set of state events; e.g. the input of a state resolution +algorithm is a collection of state sets. + +The *auth chain* of a set of events are all the events' auth events and *their* +auth events, recursively (i.e. the events reachable by walking the graph induced +by an event's auth events links). + +The *auth chain difference* of a collection of state sets is the union minus the +intersection of the sets of auth chains corresponding to the state sets, i.e an +event is in the auth chain difference if it is reachable by walking the auth +event graph from at least one of the state sets but not from *all* of the state +sets. + +## Breadth First Walk Algorithm + +A way of calculating the auth chain difference without calculating the full auth +chains for each state set is to do a parallel breadth first walk (ordered by +depth) of each state set's auth chain. By tracking which events are reachable +from each state set we can finish early if every pending event is reachable from +every state set. + +This can work well for state sets that have a small auth chain difference, but +can be very inefficient for larger differences. However, this algorithm is still +used if we don't have a chain cover index for the room (e.g. because we're in +the process of indexing it). + +## Chain Cover Index + +Synapse computes auth chain differences by pre-computing a "chain cover" index +for the auth chain in a room, allowing us to efficiently make reachability queries +like "is event `A` in the auth chain of event `B`?". We could do this with an index +that tracks all pairs `(A, B)` such that `A` is in the auth chain of `B`. However, this +would be prohibitively large, scaling poorly as the room accumulates more state +events. + +Instead, we break down the graph into *chains*. A chain is a subset of a DAG +with the following property: for any pair of events `E` and `F` in the chain, +the chain contains a path `E -> F` or a path `F -> E`. This forces a chain to be +linear (without forks), e.g. `E -> F -> G -> ... -> H`. Each event in the chain +is given a *sequence number* local to that chain. The oldest event `E` in the +chain has sequence number 1. If `E` has a child `F` in the chain, then `F` has +sequence number 2. If `E` has a grandchild `G` in the chain, then `G` has +sequence number 3; and so on. + +Synapse ensures that each persisted event belongs to exactly one chain, and +tracks how the chains are connected to one another. This allows us to +efficiently answer reachability queries. Doing so uses less storage than +tracking reachability on an event-by-event basis, particularly when we have +fewer and longer chains. See + +> Jagadish, H. (1990). [A compression technique to materialize transitive closure](https://doi.org/10.1145/99935.99944). +> *ACM Transactions on Database Systems (TODS)*, 15*(4)*, 558-598. + +for the original idea or + +> Y. Chen, Y. Chen, [An efficient algorithm for answering graph +> reachability queries](https://doi.org/10.1109/ICDE.2008.4497498), +> in: 2008 IEEE 24th International Conference on Data Engineering, April 2008, +> pp. 893–902. (PDF available via [Google Scholar](https://scholar.google.com/scholar?q=Y.%20Chen,%20Y.%20Chen,%20An%20efficient%20algorithm%20for%20answering%20graph%20reachability%20queries,%20in:%202008%20IEEE%2024th%20International%20Conference%20on%20Data%20Engineering,%20April%202008,%20pp.%20893902.).) + +for a more modern take. + +In practical terms, the chain cover assigns every event a +*chain ID* and *sequence number* (e.g. `(5,3)`), and maintains a map of *links* +between events in chains (e.g. `(5,3) -> (2,4)`) such that `A` is reachable by `B` +(i.e. `A` is in the auth chain of `B`) if and only if either: + +1. `A` and `B` have the same chain ID and `A`'s sequence number is less than `B`'s + sequence number; or +2. there is a link `L` between `B`'s chain ID and `A`'s chain ID such that + `L.start_seq_no` <= `B.seq_no` and `A.seq_no` <= `L.end_seq_no`. + +There are actually two potential implementations, one where we store links from +each chain to every other reachable chain (the transitive closure of the links +graph), and one where we remove redundant links (the transitive reduction of the +links graph) e.g. if we have chains `C3 -> C2 -> C1` then the link `C3 -> C1` +would not be stored. Synapse uses the former implementation so that it doesn't +need to recurse to test reachability between chains. This trades-off extra storage +in order to save CPU cycles and DB queries. + +### Example + +An example auth graph would look like the following, where chains have been +formed based on type/state_key and are denoted by colour and are labelled with +`(chain ID, sequence number)`. Links are denoted by the arrows (links in grey +are those that would be remove in the second implementation described above). + +![Example](auth_chain_diff.dot.png) + +Note that we don't include all links between events and their auth events, as +most of those links would be redundant. For example, all events point to the +create event, but each chain only needs the one link from it's base to the +create event. + +## Using the Index + +This index can be used to calculate the auth chain difference of the state sets +by looking at the chain ID and sequence numbers reachable from each state set: + +1. For every state set lookup the chain ID/sequence numbers of each state event +2. Use the index to find all chains and the maximum sequence number reachable + from each state set. +3. The auth chain difference is then all events in each chain that have sequence + numbers between the maximum sequence number reachable from *any* state set and + the minimum reachable by *all* state sets (if any). + +Note that steps 2 is effectively calculating the auth chain for each state set +(in terms of chain IDs and sequence numbers), and step 3 is calculating the +difference between the union and intersection of the auth chains. + +### Worked Example + +For example, given the above graph, we can calculate the difference between +state sets consisting of: + +1. `S1`: Alice's invite `(4,1)` and Bob's second join `(2,2)`; and +2. `S2`: Alice's second join `(4,3)` and Bob's first join `(2,1)`. + +Using the index we see that the following auth chains are reachable from each +state set: + +1. `S1`: `(1,1)`, `(2,2)`, `(3,1)` & `(4,1)` +2. `S2`: `(1,1)`, `(2,1)`, `(3,2)` & `(4,3)` + +And so, for each the ranges that are in the auth chain difference: +1. Chain 1: None, (since everything can reach the create event). +2. Chain 2: The range `(1, 2]` (i.e. just `2`), as `1` is reachable by all state + sets and the maximum reachable is `2` (corresponding to Bob's second join). +3. Chain 3: Similarly the range `(1, 2]` (corresponding to the second power + level). +4. Chain 4: The range `(1, 3]` (corresponding to both of Alice's joins). + +So the final result is: Bob's second join `(2,2)`, the second power level +`(3,2)` and both of Alice's joins `(4,2)` & `(4,3)`. diff --git a/docs/development/internal_documentation/cas.md b/docs/development/internal_documentation/cas.md new file mode 100644
index 0000000000..7c0668e034 --- /dev/null +++ b/docs/development/internal_documentation/cas.md
@@ -0,0 +1,64 @@ +# How to test CAS as a developer without a server + +The [django-mama-cas](https://github.com/jbittel/django-mama-cas) project is an +easy to run CAS implementation built on top of Django. + +## Prerequisites + +1. Create a new virtualenv: `python3 -m venv <your virtualenv>` +2. Activate your virtualenv: `source /path/to/your/virtualenv/bin/activate` +3. Install Django and django-mama-cas: + ```sh + python -m pip install "django<3" "django-mama-cas==2.4.0" + ``` +4. Create a Django project in the current directory: + ```sh + django-admin startproject cas_test . + ``` +5. Follow the [install directions](https://django-mama-cas.readthedocs.io/en/latest/installation.html#configuring) for django-mama-cas +6. Setup the SQLite database: `python manage.py migrate` +7. Create a user: + ```sh + python manage.py createsuperuser + ``` + 1. Use whatever you want as the username and password. + 2. Leave the other fields blank. +8. Use the built-in Django test server to serve the CAS endpoints on port 8000: + ```sh + python manage.py runserver + ``` + +You should now have a Django project configured to serve CAS authentication with +a single user created. + +## Configure Synapse (and Element) to use CAS + +1. Modify your `homeserver.yaml` to enable CAS and point it to your locally + running Django test server: + ```yaml + cas_config: + enabled: true + server_url: "http://localhost:8000" + service_url: "http://localhost:8081" + #displayname_attribute: name + #required_attributes: + # name: value + ``` +2. Restart Synapse. + +Note that the above configuration assumes the homeserver is running on port 8081 +and that the CAS server is on port 8000, both on localhost. + +## Testing the configuration + +Then in Element: + +1. Visit the login page with a Element pointing at your homeserver. +2. Click the Single Sign-On button. +3. Login using the credentials created with `createsuperuser`. +4. You should be logged in. + +If you want to repeat this process you'll need to manually logout first: + +1. http://localhost:8000/admin/ +2. Click "logout" in the top right. diff --git a/docs/development/internal_documentation/media_repository.md b/docs/development/internal_documentation/media_repository.md new file mode 100644
index 0000000000..23e6da7f31 --- /dev/null +++ b/docs/development/internal_documentation/media_repository.md
@@ -0,0 +1,78 @@ +# Media Repository + +*Synapse implementation-specific details for the media repository* + +The media repository + * stores avatars, attachments and their thumbnails for media uploaded by local + users. + * caches avatars, attachments and their thumbnails for media uploaded by remote + users. + * caches resources and thumbnails used for URL previews. + +All media in Matrix can be identified by a unique +[MXC URI](https://spec.matrix.org/latest/client-server-api/#matrix-content-mxc-uris), +consisting of a server name and media ID: +``` +mxc://<server-name>/<media-id> +``` + +## Local Media +Synapse generates 24 character media IDs for content uploaded by local users. +These media IDs consist of upper and lowercase letters and are case-sensitive. +Other homeserver implementations may generate media IDs differently. + +Local media is recorded in the `local_media_repository` table, which includes +metadata such as MIME types, upload times and file sizes. +Note that this table is shared by the URL cache, which has a different media ID +scheme. + +### Paths +A file with media ID `aabbcccccccccccccccccccc` and its `128x96` `image/jpeg` +thumbnail, created by scaling, would be stored at: +``` +local_content/aa/bb/cccccccccccccccccccc +local_thumbnails/aa/bb/cccccccccccccccccccc/128-96-image-jpeg-scale +``` + +## Remote Media +When media from a remote homeserver is requested from Synapse, it is assigned +a local `filesystem_id`, with the same format as locally-generated media IDs, +as described above. + +A record of remote media is stored in the `remote_media_cache` table, which +can be used to map remote MXC URIs (server names and media IDs) to local +`filesystem_id`s. + +### Paths +A file from `matrix.org` with `filesystem_id` `aabbcccccccccccccccccccc` and its +`128x96` `image/jpeg` thumbnail, created by scaling, would be stored at: +``` +remote_content/matrix.org/aa/bb/cccccccccccccccccccc +remote_thumbnail/matrix.org/aa/bb/cccccccccccccccccccc/128-96-image-jpeg-scale +``` +Older thumbnails may omit the thumbnailing method: +``` +remote_thumbnail/matrix.org/aa/bb/cccccccccccccccccccc/128-96-image-jpeg +``` + +Note that `remote_thumbnail/` does not have an `s`. + +## URL Previews + +When generating previews for URLs, Synapse may download and cache various +resources, including images. These resources are assigned temporary media IDs +of the form `yyyy-mm-dd_aaaaaaaaaaaaaaaa`, where `yyyy-mm-dd` is the current +date and `aaaaaaaaaaaaaaaa` is a random sequence of 16 case-sensitive letters. + +The metadata for these cached resources is stored in the +`local_media_repository` and `local_media_repository_url_cache` tables. + +Resources for URL previews are deleted after a few days. + +### Paths +The file with media ID `yyyy-mm-dd_aaaaaaaaaaaaaaaa` and its `128x96` +`image/jpeg` thumbnail, created by scaling, would be stored at: +``` +url_cache/yyyy-mm-dd/aaaaaaaaaaaaaaaa +url_cache_thumbnails/yyyy-mm-dd/aaaaaaaaaaaaaaaa/128-96-image-jpeg-scale +``` diff --git a/docs/development/internal_documentation/room-dag-concepts.md b/docs/development/internal_documentation/room-dag-concepts.md new file mode 100644
index 0000000000..76709487f8 --- /dev/null +++ b/docs/development/internal_documentation/room-dag-concepts.md
@@ -0,0 +1,113 @@ +# Room DAG concepts + +## Edges + +The word "edge" comes from graph theory lingo. An edge is just a connection +between two events. In Synapse, we connect events by specifying their +`prev_events`. A subsequent event points back at a previous event. + +``` +A (oldest) <---- B <---- C (most recent) +``` + + +## Depth and stream ordering + +Events are normally sorted by `(topological_ordering, stream_ordering)` where +`topological_ordering` is just `depth`. In other words, we first sort by `depth` +and then tie-break based on `stream_ordering`. `depth` is incremented as new +messages are added to the DAG. Normally, `stream_ordering` is an auto +incrementing integer, but backfilled events start with `stream_ordering=-1` and decrement. + +--- + + - `/sync` returns things in the order they arrive at the server (`stream_ordering`). + - `/messages` (and `/backfill` in the federation API) return them in the order determined by the event graph `(topological_ordering, stream_ordering)`. + +The general idea is that, if you're following a room in real-time (i.e. +`/sync`), you probably want to see the messages as they arrive at your server, +rather than skipping any that arrived late; whereas if you're looking at a +historical section of timeline (i.e. `/messages`), you want to see the best +representation of the state of the room as others were seeing it at the time. + +## Outliers + +We mark an event as an `outlier` when we haven't figured out the state for the +room at that point in the DAG yet. They are "floating" events that we haven't +yet correlated to the DAG. + +Outliers typically arise when we fetch the auth chain or state for a given +event. When that happens, we just grab the events in the state/auth chain, +without calculating the state at those events, or backfilling their +`prev_events`. Since we don't have the state at any events fetched in that +way, we mark them as outliers. + +So, typically, we won't have the `prev_events` of an `outlier` in the database, +(though it's entirely possible that we *might* have them for some other +reason). Other things that make outliers different from regular events: + + * We don't have state for them, so there should be no entry in + `event_to_state_groups` for an outlier. (In practice this isn't always + the case, though I'm not sure why: see https://github.com/matrix-org/synapse/issues/12201). + + * We don't record entries for them in the `event_edges`, + `event_forward_extremeties` or `event_backward_extremities` tables. + +Since outliers are not tied into the DAG, they do not normally form part of the +timeline sent down to clients via `/sync` or `/messages`; however there is an +exception: + +### Out-of-band membership events + +A special case of outlier events are some membership events for federated rooms +that we aren't full members of. For example: + + * invites received over federation, before we join the room + * *rejections* for said invites + * knock events for rooms that we would like to join but have not yet joined. + +In all the above cases, we don't have the state for the room, which is why they +are treated as outliers. They are a bit special though, in that they are +proactively sent to clients via `/sync`. + +## Forward extremity + +Most-recent-in-time events in the DAG which are not referenced by any other +events' `prev_events` yet. (In this definition, outliers, rejected events, and +soft-failed events don't count.) + +The forward extremities of a room (or at least, a subset of them, if there are +more than ten) are used as the `prev_events` when the next event is sent. + +The "current state" of a room (ie: the state which would be used if we +generated a new event) is, therefore, the resolution of the room states +at each of the forward extremities. + +## Backward extremity + +The current marker of where we have backfilled up to and will generally be the +`prev_events` of the oldest-in-time events we have in the DAG. This gives a starting point when +backfilling history. + +Note that, unlike forward extremities, we typically don't have any backward +extremity events themselves in the database - or, if we do, they will be "outliers" (see +above). Either way, we don't expect to have the room state at a backward extremity. + +When we persist a non-outlier event, if it was previously a backward extremity, +we clear it as a backward extremity and set all of its `prev_events` as the new +backward extremities if they aren't already persisted as non-outliers. This +therefore keeps the backward extremities up-to-date. + +## State groups + +For every non-outlier event we need to know the state at that event. Instead of +storing the full state for each event in the DB (i.e. a `event_id -> state` +mapping), which is *very* space inefficient when state doesn't change, we +instead assign each different set of state a "state group" and then have +mappings of `event_id -> state_group` and `state_group -> state`. + + +### Stage group edges + +TODO: `state_group_edges` is a further optimization... + notes from @Azrenbeth, https://pastebin.com/seUGVGeT diff --git a/docs/development/internal_documentation/room_and_user_statistics.md b/docs/development/internal_documentation/room_and_user_statistics.md new file mode 100644
index 0000000000..cc38c890bb --- /dev/null +++ b/docs/development/internal_documentation/room_and_user_statistics.md
@@ -0,0 +1,22 @@ +Room and User Statistics +======================== + +Synapse maintains room and user statistics in various tables. These can be used +for administrative purposes but are also used when generating the public room +directory. + + +# Synapse Developer Documentation + +## High-Level Concepts + +### Definitions + +* **subject**: Something we are tracking stats about – currently a room or user. +* **current row**: An entry for a subject in the appropriate current statistics + table. Each subject can have only one. + +### Overview + +Stats correspond to the present values. Current rows contain the most up-to-date +statistics for a room. Each subject can only have one entry. diff --git a/docs/development/internal_documentation/saml.md b/docs/development/internal_documentation/saml.md new file mode 100644
index 0000000000..b08bcb7419 --- /dev/null +++ b/docs/development/internal_documentation/saml.md
@@ -0,0 +1,40 @@ +# How to test SAML as a developer without a server + +https://fujifish.github.io/samling/samling.html (https://github.com/fujifish/samling) is a great resource for being able to tinker with the +SAML options within Synapse without needing to deploy and configure a complicated software stack. + +To make Synapse (and therefore Element) use it: + +1. Use the samling.html URL above or deploy your own and visit the IdP Metadata tab. +2. Copy the XML to your clipboard. +3. On your Synapse server, create a new file `samling.xml` next to your `homeserver.yaml` with + the XML from step 2 as the contents. +4. Edit your `homeserver.yaml` to include: + ```yaml + saml2_config: + sp_config: + allow_unknown_attributes: true # Works around a bug with AVA Hashes: https://github.com/IdentityPython/pysaml2/issues/388 + metadata: + local: ["samling.xml"] + ``` +5. Ensure that your `homeserver.yaml` has a setting for `public_baseurl`: + ```yaml + public_baseurl: http://localhost:8080/ + ``` +6. Run `apt-get install xmlsec1` and `pip install --upgrade --force 'pysaml2>=4.5.0'` to ensure + the dependencies are installed and ready to go. +7. Restart Synapse. + +Then in Element: + +1. Visit the login page and point Element towards your homeserver using the `public_baseurl` above. +2. Click the Single Sign-On button. +3. On the samling page, enter a Name Identifier and add a SAML Attribute for `uid=your_localpart`. + The response must also be signed. +4. Click "Next". +5. Click "Post Response" (change nothing). +6. You should be logged in. + +If you try and repeat this process, you may be automatically logged in using the information you +gave previously. To fix this, open your developer console (`F12` or `Ctrl+Shift+I`) while on the +samling page and clear the site data. In Chrome, this will be a button on the Application tab.