8 files changed, 490 insertions, 0 deletions
diff --git a/docs/development/internal_documentation/auth_chain_diff.dot b/docs/development/internal_documentation/auth_chain_diff.dot
new file mode 100644
index 0000000000..978d579ada
--- /dev/null
+++ b/docs/development/internal_documentation/auth_chain_diff.dot
@@ -0,0 +1,32 @@
+digraph auth {
+    nodesep=0.5;
+    rankdir="RL";
+
+    C [label="Create (1,1)"];
+
+    BJ [label="Bob's Join (2,1)", color=red];
+    BJ2 [label="Bob's Join (2,2)", color=red];
+    BJ2 -> BJ [color=red, dir=none];
+
+    subgraph cluster_foo {
+        A1 [label="Alice's invite (4,1)", color=blue];
+        A2 [label="Alice's Join (4,2)", color=blue];
+        A3 [label="Alice's Join (4,3)", color=blue];
+        A3 -> A2 -> A1 [color=blue, dir=none];
+        color=none;
+    }
+
+    PL1 [label="Power Level (3,1)", color=darkgreen];
+    PL2 [label="Power Level (3,2)", color=darkgreen];
+    PL2 -> PL1 [color=darkgreen, dir=none];
+
+    {rank = same; C; BJ; PL1; A1;}
+
+    A1 -> C [color=grey];
+    A1 -> BJ [color=grey];
+    PL1 -> C [color=grey];
+    BJ2 -> PL1 [penwidth=2];
+
+    A3 -> PL2 [penwidth=2];
+    A1 -> PL1 -> BJ -> C [penwidth=2];
+}
diff --git a/docs/development/internal_documentation/auth_chain_diff.dot.png b/docs/development/internal_documentation/auth_chain_diff.dot.png
new file mode 100644
index 0000000000..771c07308f
--- /dev/null
+++ b/docs/development/internal_documentation/auth_chain_diff.dot.png
Binary files differdiff --git a/docs/development/internal_documentation/auth_chain_difference_algorithm.md b/docs/development/internal_documentation/auth_chain_difference_algorithm.md
new file mode 100644
index 0000000000..ebc9de25b8
--- /dev/null
+++ b/docs/development/internal_documentation/auth_chain_difference_algorithm.md
@@ -0,0 +1,141 @@
+# Auth Chain Difference Algorithm
+
+The auth chain difference algorithm is used by V2 state resolution, where a
+naive implementation can be a significant source of CPU and DB usage.
+
+### Definitions
+
+A *state set* is a set of state events; e.g. the input of a state resolution
+algorithm is a collection of state sets.
+
+The *auth chain* of a set of events are all the events' auth events and *their*
+auth events, recursively (i.e. the events reachable by walking the graph induced
+by an event's auth events links).
+
+The *auth chain difference* of a collection of state sets is the union minus the
+intersection of the sets of auth chains corresponding to the state sets, i.e an
+event is in the auth chain difference if it is reachable by walking the auth
+event graph from at least one of the state sets but not from *all* of the state
+sets.
+
+## Breadth First Walk Algorithm
+
+A way of calculating the auth chain difference without calculating the full auth
+chains for each state set is to do a parallel breadth first walk (ordered by
+depth) of each state set's auth chain. By tracking which events are reachable
+from each state set we can finish early if every pending event is reachable from
+every state set.
+
+This can work well for state sets that have a small auth chain difference, but
+can be very inefficient for larger differences. However, this algorithm is still
+used if we don't have a chain cover index for the room (e.g. because we're in
+the process of indexing it).
+
+## Chain Cover Index
+
+Synapse computes auth chain differences by pre-computing a "chain cover" index
+for the auth chain in a room, allowing us to efficiently make reachability queries
+like "is event `A` in the auth chain of event `B`?". We could do this with an index
+that tracks all pairs `(A, B)` such that `A` is in the auth chain of `B`. However, this
+would be prohibitively large, scaling poorly as the room accumulates more state
+events.
+
+Instead, we break down the graph into *chains*. A chain is a subset of a DAG
+with the following property: for any pair of events `E` and `F` in the chain,
+the chain contains a path `E -> F` or a path `F -> E`. This forces a chain to be
+linear (without forks), e.g. `E -> F -> G -> ... -> H`. Each event in the chain
+is given a *sequence number* local to that chain. The oldest event `E` in the
+chain has sequence number 1. If `E` has a child `F` in the chain, then `F` has
+sequence number 2. If `E` has a grandchild `G` in the chain, then `G` has
+sequence number 3; and so on.
+
+Synapse ensures that each persisted event belongs to exactly one chain, and
+tracks how the chains are connected to one another. This allows us to
+efficiently answer reachability queries. Doing so uses less storage than
+tracking reachability on an event-by-event basis, particularly when we have
+fewer and longer chains. See
+
+> Jagadish, H. (1990). [A compression technique to materialize transitive closure](https://doi.org/10.1145/99935.99944).
+> *ACM Transactions on Database Systems (TODS)*, 15*(4)*, 558-598.
+
+for the original idea or
+
+> Y. Chen, Y. Chen, [An efficient algorithm for answering graph
+> reachability queries](https://doi.org/10.1109/ICDE.2008.4497498),
+> in: 2008 IEEE 24th International Conference on Data Engineering, April 2008,
+> pp. 893–902. (PDF available via [Google Scholar](https://scholar.google.com/scholar?q=Y.%20Chen,%20Y.%20Chen,%20An%20efficient%20algorithm%20for%20answering%20graph%20reachability%20queries,%20in:%202008%20IEEE%2024th%20International%20Conference%20on%20Data%20Engineering,%20April%202008,%20pp.%20893902.).)
+
+for a more modern take.
+
+In practical terms, the chain cover assigns every event a
+*chain ID* and *sequence number* (e.g. `(5,3)`), and maintains a map of *links*
+between events in chains (e.g. `(5,3) -> (2,4)`) such that `A` is reachable by `B`
+(i.e. `A` is in the auth chain of `B`) if and only if either:
+
+1. `A` and `B` have the same chain ID and `A`'s sequence number is less than `B`'s
+   sequence number; or
+2. there is a link `L` between `B`'s chain ID and `A`'s chain ID such that
+   `L.start_seq_no` <= `B.seq_no` and `A.seq_no` <= `L.end_seq_no`.
+
+There are actually two potential implementations, one where we store links from
+each chain to every other reachable chain (the transitive closure of the links
+graph), and one where we remove redundant links (the transitive reduction of the
+links graph) e.g. if we have chains `C3 -> C2 -> C1` then the link `C3 -> C1`
+would not be stored. Synapse uses the former implementation so that it doesn't
+need to recurse to test reachability between chains. This trades-off extra storage
+in order to save CPU cycles and DB queries.
+
+### Example
+
+An example auth graph would look like the following, where chains have been
+formed based on type/state_key and are denoted by colour and are labelled with
+`(chain ID, sequence number)`. Links are denoted by the arrows (links in grey
+are those that would be remove in the second implementation described above).
+
+![Example](auth_chain_diff.dot.png)
+
+Note that we don't include all links between events and their auth events, as
+most of those links would be redundant. For example, all events point to the
+create event, but each chain only needs the one link from it's base to the
+create event.
+
+## Using the Index
+
+This index can be used to calculate the auth chain difference of the state sets
+by looking at the chain ID and sequence numbers reachable from each state set:
+
+1. For every state set lookup the chain ID/sequence numbers of each state event
+2. Use the index to find all chains and the maximum sequence number reachable
+   from each state set.
+3. The auth chain difference is then all events in each chain that have sequence
+   numbers between the maximum sequence number reachable from *any* state set and
+   the minimum reachable by *all* state sets (if any).
+
+Note that steps 2 is effectively calculating the auth chain for each state set
+(in terms of chain IDs and sequence numbers), and step 3 is calculating the
+difference between the union and intersection of the auth chains.
+
+### Worked Example
+
+For example, given the above graph, we can calculate the difference between
+state sets consisting of:
+
+1. `S1`: Alice's invite `(4,1)` and Bob's second join `(2,2)`; and
+2. `S2`: Alice's second join `(4,3)` and Bob's first join `(2,1)`.
+
+Using the index we see that the following auth chains are reachable from each
+state set:
+
+1. `S1`: `(1,1)`, `(2,2)`, `(3,1)` & `(4,1)`
+2. `S2`: `(1,1)`, `(2,1)`, `(3,2)` & `(4,3)`
+
+And so, for each the ranges that are in the auth chain difference:
+1. Chain 1: None, (since everything can reach the create event).
+2. Chain 2: The range `(1, 2]` (i.e. just `2`), as `1` is reachable by all state
+   sets and the maximum reachable is `2` (corresponding to Bob's second join).
+3. Chain 3: Similarly the range `(1, 2]` (corresponding to the second power
+   level).
+4. Chain 4: The range `(1, 3]` (corresponding to both of Alice's joins).
+
+So the final result is: Bob's second join `(2,2)`, the second power level
+`(3,2)` and both of Alice's joins `(4,2)` & `(4,3)`.
diff --git a/docs/development/internal_documentation/cas.md b/docs/development/internal_documentation/cas.md
new file mode 100644
index 0000000000..7c0668e034
--- /dev/null
+++ b/docs/development/internal_documentation/cas.md
@@ -0,0 +1,64 @@
+# How to test CAS as a developer without a server
+
+The [django-mama-cas](https://github.com/jbittel/django-mama-cas) project is an
+easy to run CAS implementation built on top of Django.
+
+## Prerequisites
+
+1. Create a new virtualenv: `python3 -m venv <your virtualenv>`
+2. Activate your virtualenv: `source /path/to/your/virtualenv/bin/activate`
+3. Install Django and django-mama-cas:
+   ```sh
+   python -m pip install "django<3" "django-mama-cas==2.4.0"
+   ```
+4. Create a Django project in the current directory:
+   ```sh
+   django-admin startproject cas_test .
+   ```
+5. Follow the [install directions](https://django-mama-cas.readthedocs.io/en/latest/installation.html#configuring) for django-mama-cas
+6. Setup the SQLite database: `python manage.py migrate`
+7. Create a user:
+   ```sh
+   python manage.py createsuperuser
+   ```
+   1. Use whatever you want as the username and password.
+   2. Leave the other fields blank.
+8. Use the built-in Django test server to serve the CAS endpoints on port 8000:
+   ```sh
+   python manage.py runserver
+   ```
+
+You should now have a Django project configured to serve CAS authentication with
+a single user created.
+
+## Configure Synapse (and Element) to use CAS
+
+1. Modify your `homeserver.yaml` to enable CAS and point it to your locally
+   running Django test server:
+   ```yaml
+   cas_config:
+     enabled: true
+     server_url: "http://localhost:8000"
+     service_url: "http://localhost:8081"
+     #displayname_attribute: name
+     #required_attributes:
+     #    name: value
+   ```
+2. Restart Synapse.
+
+Note that the above configuration assumes the homeserver is running on port 8081
+and that the CAS server is on port 8000, both on localhost.
+
+## Testing the configuration
+
+Then in Element:
+
+1. Visit the login page with a Element pointing at your homeserver.
+2. Click the Single Sign-On button.
+3. Login using the credentials created with `createsuperuser`.
+4. You should be logged in.
+
+If you want to repeat this process you'll need to manually logout first:
+
+1. http://localhost:8000/admin/
+2. Click "logout" in the top right.
diff --git a/docs/development/internal_documentation/media_repository.md b/docs/development/internal_documentation/media_repository.md
new file mode 100644
index 0000000000..23e6da7f31
--- /dev/null
+++ b/docs/development/internal_documentation/media_repository.md
@@ -0,0 +1,78 @@
+# Media Repository 
+
+*Synapse implementation-specific details for the media repository*
+
+The media repository
+ * stores avatars, attachments and their thumbnails for media uploaded by local
+   users.
+ * caches avatars, attachments and their thumbnails for media uploaded by remote
+   users.
+ * caches resources and thumbnails used for URL previews.
+
+All media in Matrix can be identified by a unique
+[MXC URI](https://spec.matrix.org/latest/client-server-api/#matrix-content-mxc-uris),
+consisting of a server name and media ID:
+```
+mxc://<server-name>/<media-id>
+```
+
+## Local Media
+Synapse generates 24 character media IDs for content uploaded by local users.
+These media IDs consist of upper and lowercase letters and are case-sensitive.
+Other homeserver implementations may generate media IDs differently.
+
+Local media is recorded in the `local_media_repository` table, which includes
+metadata such as MIME types, upload times and file sizes.
+Note that this table is shared by the URL cache, which has a different media ID
+scheme.
+
+### Paths
+A file with media ID `aabbcccccccccccccccccccc` and its `128x96` `image/jpeg`
+thumbnail, created by scaling, would be stored at:
+```
+local_content/aa/bb/cccccccccccccccccccc
+local_thumbnails/aa/bb/cccccccccccccccccccc/128-96-image-jpeg-scale
+```
+
+## Remote Media
+When media from a remote homeserver is requested from Synapse, it is assigned
+a local `filesystem_id`, with the same format as locally-generated media IDs,
+as described above.
+
+A record of remote media is stored in the `remote_media_cache` table, which
+can be used to map remote MXC URIs (server names and media IDs) to local
+`filesystem_id`s.
+
+### Paths
+A file from `matrix.org` with `filesystem_id` `aabbcccccccccccccccccccc` and its
+`128x96` `image/jpeg` thumbnail, created by scaling, would be stored at:
+```
+remote_content/matrix.org/aa/bb/cccccccccccccccccccc
+remote_thumbnail/matrix.org/aa/bb/cccccccccccccccccccc/128-96-image-jpeg-scale
+```
+Older thumbnails may omit the thumbnailing method:
+```
+remote_thumbnail/matrix.org/aa/bb/cccccccccccccccccccc/128-96-image-jpeg
+```
+
+Note that `remote_thumbnail/` does not have an `s`.
+
+## URL Previews
+
+When generating previews for URLs, Synapse may download and cache various
+resources, including images. These resources are assigned temporary media IDs
+of the form `yyyy-mm-dd_aaaaaaaaaaaaaaaa`, where `yyyy-mm-dd` is the current
+date and `aaaaaaaaaaaaaaaa` is a random sequence of 16 case-sensitive letters.
+
+The metadata for these cached resources is stored in the
+`local_media_repository` and `local_media_repository_url_cache` tables.
+
+Resources for URL previews are deleted after a few days.
+
+### Paths
+The file with media ID `yyyy-mm-dd_aaaaaaaaaaaaaaaa` and its `128x96`
+`image/jpeg` thumbnail, created by scaling, would be stored at:
+```
+url_cache/yyyy-mm-dd/aaaaaaaaaaaaaaaa
+url_cache_thumbnails/yyyy-mm-dd/aaaaaaaaaaaaaaaa/128-96-image-jpeg-scale
+```
diff --git a/docs/development/internal_documentation/room-dag-concepts.md b/docs/development/internal_documentation/room-dag-concepts.md
new file mode 100644
index 0000000000..76709487f8
--- /dev/null
+++ b/docs/development/internal_documentation/room-dag-concepts.md
@@ -0,0 +1,113 @@
+# Room DAG concepts
+
+## Edges
+
+The word "edge" comes from graph theory lingo. An edge is just a connection
+between two events. In Synapse, we connect events by specifying their
+`prev_events`. A subsequent event points back at a previous event.
+
+```
+A (oldest) <---- B <---- C (most recent)
+```
+
+
+## Depth and stream ordering
+
+Events are normally sorted by `(topological_ordering, stream_ordering)` where
+`topological_ordering` is just `depth`. In other words, we first sort by `depth`
+and then tie-break based on `stream_ordering`. `depth` is incremented as new
+messages are added to the DAG. Normally, `stream_ordering` is an auto
+incrementing integer, but backfilled events start with `stream_ordering=-1` and decrement.
+
+---
+
+ - `/sync` returns things in the order they arrive at the server (`stream_ordering`).
+ - `/messages` (and `/backfill` in the federation API) return them in the order determined by the event graph `(topological_ordering, stream_ordering)`.
+
+The general idea is that, if you're following a room in real-time (i.e.
+`/sync`), you probably want to see the messages as they arrive at your server,
+rather than skipping any that arrived late; whereas if you're looking at a
+historical section of timeline (i.e. `/messages`), you want to see the best
+representation of the state of the room as others were seeing it at the time.
+
+## Outliers
+
+We mark an event as an `outlier` when we haven't figured out the state for the
+room at that point in the DAG yet. They are "floating" events that we haven't
+yet correlated to the DAG.
+
+Outliers typically arise when we fetch the auth chain or state for a given
+event. When that happens, we just grab the events in the state/auth chain,
+without calculating the state at those events, or backfilling their
+`prev_events`. Since we don't have the state at any events fetched in that
+way, we mark them as outliers.
+
+So, typically, we won't have the `prev_events` of an `outlier` in the database,
+(though it's entirely possible that we *might* have them for some other
+reason). Other things that make outliers different from regular events:
+
+ * We don't have state for them, so there should be no entry in
+   `event_to_state_groups` for an outlier. (In practice this isn't always
+   the case, though I'm not sure why: see https://github.com/matrix-org/synapse/issues/12201).
+
+ * We don't record entries for them in the `event_edges`,
+   `event_forward_extremeties` or `event_backward_extremities` tables.
+
+Since outliers are not tied into the DAG, they do not normally form part of the
+timeline sent down to clients via `/sync` or `/messages`; however there is an
+exception:
+
+### Out-of-band membership events
+
+A special case of outlier events are some membership events for federated rooms
+that we aren't full members of. For example:
+
+ * invites received over federation, before we join the room
+ * *rejections* for said invites
+ * knock events for rooms that we would like to join but have not yet joined.
+
+In all the above cases, we don't have the state for the room, which is why they
+are treated as outliers. They are a bit special though, in that they are
+proactively sent to clients via `/sync`.
+
+## Forward extremity
+
+Most-recent-in-time events in the DAG which are not referenced by any other
+events' `prev_events` yet. (In this definition, outliers, rejected events, and
+soft-failed events don't count.)
+
+The forward extremities of a room (or at least, a subset of them, if there are
+more than ten) are used as the `prev_events` when the next event is sent.
+
+The "current state" of a room (ie: the state which would be used if we
+generated a new event) is, therefore, the resolution of the room states
+at each of the forward extremities.
+
+## Backward extremity
+
+The current marker of where we have backfilled up to and will generally be the
+`prev_events` of the oldest-in-time events we have in the DAG. This gives a starting point when
+backfilling history.
+
+Note that, unlike forward extremities, we typically don't have any backward
+extremity events themselves in the database - or, if we do, they will be "outliers" (see
+above). Either way, we don't expect to have the room state at a backward extremity.
+
+When we persist a non-outlier event, if it was previously a backward extremity,
+we clear it as a backward extremity and set all of its `prev_events` as the new
+backward extremities if they aren't already persisted as non-outliers. This
+therefore keeps the backward extremities up-to-date.
+
+## State groups
+
+For every non-outlier event we need to know the state at that event. Instead of
+storing the full state for each event in the DB (i.e. a `event_id -> state`
+mapping), which is *very* space inefficient when state doesn't change, we
+instead assign each different set of state a "state group" and then have
+mappings of `event_id -> state_group` and `state_group -> state`.
+
+
+### Stage group edges
+
+TODO: `state_group_edges` is a further optimization...
+      notes from @Azrenbeth, https://pastebin.com/seUGVGeT
diff --git a/docs/development/internal_documentation/room_and_user_statistics.md b/docs/development/internal_documentation/room_and_user_statistics.md
new file mode 100644
index 0000000000..cc38c890bb
--- /dev/null
+++ b/docs/development/internal_documentation/room_and_user_statistics.md
@@ -0,0 +1,22 @@
+Room and User Statistics
+========================
+
+Synapse maintains room and user statistics in various tables. These can be used
+for administrative purposes but are also used when generating the public room
+directory.
+
+
+# Synapse Developer Documentation
+
+## High-Level Concepts
+
+### Definitions
+
+* **subject**: Something we are tracking stats about – currently a room or user.
+* **current row**: An entry for a subject in the appropriate current statistics
+    table. Each subject can have only one.
+
+### Overview
+
+Stats correspond to the present values. Current rows contain the most up-to-date
+statistics for a room. Each subject can only have one entry.
diff --git a/docs/development/internal_documentation/saml.md b/docs/development/internal_documentation/saml.md
new file mode 100644
index 0000000000..b08bcb7419
--- /dev/null
+++ b/docs/development/internal_documentation/saml.md
@@ -0,0 +1,40 @@
+# How to test SAML as a developer without a server
+
+https://fujifish.github.io/samling/samling.html (https://github.com/fujifish/samling) is a great resource for being able to tinker with the 
+SAML options within Synapse without needing to deploy and configure a complicated software stack.
+
+To make Synapse (and therefore Element) use it:
+
+1. Use the samling.html URL above or deploy your own and visit the IdP Metadata tab.
+2. Copy the XML to your clipboard.
+3. On your Synapse server, create a new file `samling.xml` next to your `homeserver.yaml` with
+   the XML from step 2 as the contents.
+4. Edit your `homeserver.yaml` to include:
+   ```yaml
+   saml2_config:
+     sp_config:
+       allow_unknown_attributes: true  # Works around a bug with AVA Hashes: https://github.com/IdentityPython/pysaml2/issues/388
+       metadata:
+         local: ["samling.xml"]
+   ```
+5. Ensure that your `homeserver.yaml` has a setting for `public_baseurl`:
+   ```yaml
+   public_baseurl: http://localhost:8080/
+   ```
+6. Run `apt-get install xmlsec1` and `pip install --upgrade --force 'pysaml2>=4.5.0'` to ensure
+   the dependencies are installed and ready to go.
+7. Restart Synapse.
+
+Then in Element:
+
+1. Visit the login page and point Element towards your homeserver using the `public_baseurl` above.
+2. Click the Single Sign-On button.
+3. On the samling page, enter a Name Identifier and add a SAML Attribute for `uid=your_localpart`.
+   The response must also be signed.
+4. Click "Next".
+5. Click "Post Response" (change nothing).
+6. You should be logged in.
+
+If you try and repeat this process, you may be automatically logged in using the information you
+gave previously. To fix this, open your developer console (`F12` or `Ctrl+Shift+I`) while on the
+samling page and clear the site data. In Chrome, this will be a button on the Application tab.