From 57e5fd481df1b7eb04e3afff286da66877fb8fa0 Mon Sep 17 00:00:00 2001
From: H-Shay When joining a room for the first time, Synapse 1.76.0rc1 will request a partial join from the other server by default. Previously, server admins had to opt-in to this using an experimental config flag. When joining a room for the first time, Synapse 1.76.0 will request a partial join from the other server by default. Previously, server admins had to opt-in to this using an experimental config flag. Server admins can opt out of this feature for the time being by settingUpgrading to v1.76.0
Faster joins are enabled by default
-experimental:
faster_joins: false
@@ -17325,6 +17325,347 @@ workers understand to mean to expand to invalidate the correct caches.
cs_cache_fake
─ invalidates caches that depend on the current
stateThis is a work-in-progress set of notes with two goals:
+See also MSC3902.
+The key idea is described by MSC706. This allows servers to
+request a lightweight response to the federation /send_join
endpoint.
+This is called a faster join, also known as a partial join. In these
+notes we'll usually use the word "partial" as it matches the database schema.
The response to a partial join consists of
+J
,J
),J
,Synapse marks the room as partially joined by adding a row to the database table
+partial_state_rooms
. It also marks the join event J
as "partially stated",
+meaning that we have neither received nor computed the full state before/after
+J
. This is done by adding a row to partial_state_events
.
matrix=> \d partial_state_events
+Table "matrix.partial_state_events"
+ Column │ Type │ Collation │ Nullable │ Default
+══════════╪══════╪═══════════╪══════════╪═════════
+ room_id │ text │ │ not null │
+ event_id │ text │ │ not null │
+
+matrix=> \d partial_state_rooms
+ Table "matrix.partial_state_rooms"
+ Column │ Type │ Collation │ Nullable │ Default
+════════════════════════╪════════╪═══════════╪══════════╪═════════
+ room_id │ text │ │ not null │
+ device_lists_stream_id │ bigint │ │ not null │ 0
+ join_event_id │ text │ │ │
+ joined_via │ text │ │ │
+
+matrix=> \d partial_state_rooms_servers
+ Table "matrix.partial_state_rooms_servers"
+ Column │ Type │ Collation │ Nullable │ Default
+═════════════╪══════╪═══════════╪══════════╪═════════
+ room_id │ text │ │ not null │
+ server_name │ text │ │ not null │
+
+Indices, foreign-keys and check constraints are omitted for brevity.
+While partially joined to a room, Synapse receives events E
from remote
+homeservers as normal, and can create events at the request of its local users.
+However, we run into trouble when we enforce the checks on an event.
+++
+- Is a valid event, otherwise it is dropped. For an event to be valid, it +must contain a room_id, and it must comply with the event format of that +room version.
+- Passes signature checks, otherwise it is dropped.
+- Passes hash checks, otherwise it is redacted before being processed further.
+- Passes authorization rules based on the event’s auth events, otherwise it +is rejected.
+- Passes authorization rules based on the state before the event, otherwise +it is rejected.
+- Passes authorization rules based on the current state of the room, +otherwise it is “soft failed”.
+
We can enforce checks 1--4 without any problems.
+But we cannot enforce checks 5 or 6 with complete certainty, since Synapse does
+not know the full state before E
, nor that of the room.
Instead, we make a best-effort approximation. +While the room is considered partially joined, Synapse tracks the "partial +state" before events. +This works in a similar way as regular state:
+J
is that given to us by the partial join response.E
is the resolution of the partial states
+after each of E
's prev_event
s.E
is rejected or a message event, the partial state after E
is the
+partial state before E
.E
is the partial state before E
, plus
+E
itself.More concisely, partial state propagates just like full state; the only
+difference is that we "seed" it with an incomplete initial state.
+Synapse records that we have only calculated partial state for this event with
+a row in partial_state_events
.
While the room remains partially stated, check 5 on incoming events to that +room becomes:
++++
+- Passes authorization rules based on the resolution between the partial +state before
+E
andE
's auth events. If the event fails to pass +authorization rules, it is rejected.
Additionally, check 6 is deleted: no soft-failures are enforced.
+While partially joined, the current partial state of the room is defined as the +resolution across the partial states after all forward extremities in the room.
+Remark. Events with partial state are not considered +outliers.
+Using partial state means the auth checks can fail in a few different ways1.
+Is this exhaustive?
+(Note that the discrepancies described in the last two bullets are user-visible.)
+This means that we have to be very careful when we want to lookup pieces of room +state in a partially-joined room. Our approximation of the state may be +incorrect or missing. But we can make some educated guesses. If
+then we proceed as normal, and let the resync process fix up any mistakes (see +below).
+When is our partial state likely to be correct?
+In short, we deem it acceptable to trust the partial state for non-membership +and local membership events. For remote membership events, we wait for the +resync to complete, at which point we have the full state of the room and can +proceed as normal.
+The partial-state approximation is only a temporary affair. In the background,
+synapse beings a "resync" process. This is a continuous loop, starting at the
+partial join event and proceeding downwards through the event graph. For each
+E
seen in the room since partial join, Synapse will fetch
E
, via
+/state_ids
;E
, included in the /state_ids
+response; andThis means Synapse has (or can compute) the full state before E
, which allows
+Synapse to properly authorise or reject E
. At this point ,the event
+is considered to have "full state" rather than "partial state". We record this
+by removing E
from the partial_state_events
table.
[TODO: Does Synapse persist a new state group for the full state
+before E
, or do we alter the (partial-)state group in-place? Are state groups
+ever marked as partially-stated? ]
This scheme means it is possible for us to have accepted and sent an event to +clients, only to reject it during the resync. From a client's perspective, the +effect is similar to a retroactive +state change due to state resolution---i.e. a "state reset".2
+Clients should refresh caches to detect such a change. Rumour has it that +sliding sync will fix this.
+When all events since the join J
have been fully-stated, the room resync
+process is complete. We record this by removing the room from
+partial_state_rooms
.
For the time being, the resync process happens on the master worker.
+A new replication stream un_partial_stated_room
is added. Whenever a resync
+completes and a partial-state room becomes fully stated, a new message is sent
+into that stream containing the room ID.
++NB. The notes below are rough. Some of them are hidden under
+<details>
+disclosures because they have yet to be implemented in mainline Synapse.
When sending out messages during a partial join, we assume our partial state is +accurate and proceed as normal. For this to have any hope of succeeding at all, +our partial state must contain an entry for each of the (type, state key) pairs +specified by the auth rules:
+m.room.create
m.room.join_rules
m.room.power_levels
m.room.third_party_invite
m.room.member
The first four of these should be present in the state before J
that is given
+to us in the partial join response; only membership events are omitted. In order
+for us to consider the user joined, we must have their membership event. That
+means the only possible omission is the target's membership in an invite, kick
+or ban.
The worst possibility is that we locally invite someone who is banned according to +the full state, because we lack their ban in our current partial state. The rest +of the federation---at least, those who are fully joined---should correctly +enforce the membership transition constraints. So any the erroneous invite should be ignored by fully-joined +homeservers and resolved by the resync for partially-joined homeservers.
+In more generality, there are two problems we're worrying about here:
+However we expect such problems to be unlikely in practise, because
+TODO: needs prose fleshing out.
+Normally: send out in a fed txn to all HSes in the room. +We only know that some HSes were in the room at some point. Wat do. +Send it out to the list of servers from the first join. +TODO what do we do here if we have full state? +If the prev event was created by us, we can risk sending it to the wrong HS. (Motivation: privacy concern of the content. Not such a big deal for a public room or an encrypted room. But non-encrypted invite-only...) +But don't want to send out sensitive data in other HS's events in this way.
+Suppose we discover after resync that we shouldn't have sent out one our events (not a prev_event) to a target HS. Not much we can do. +What about if we didn't send them an event but shouldn't've? +E.g. what if someone joined from a new HS shortly after you did? We wouldn't talk to them. +Could imagine sending out the "Missed" events after the resync but... painful to work out what they shuld have seen if they joined/left. +Instead, just send them the latest event (if they're still in the room after resync) and let them backfill.(?)
+NB. Not yet implemented.
+TODO: needs prose fleshing out. Liase with Matthieu. Explain why /send_join +(Rich was surprised we didn't just create it locally. Answer: to try and avoid +a join which then gets rejected after resync.)
+We don't know for sure that any join we create would be accepted. +E.g. the joined user might have been banned; the join rules might have changed in a way that we didn't realise... some way in which the partial state was mistaken. +Instead, do another partial make-join/send-join handshake to confirm that the join works.
+NB. Not yet implemented.
+When you're fully joined to a room, to have U
leave a room their homeserver
+needs to
U
which will be accepted by other homeservers,
+andU
out to the homeservers in the federation.When is a leave event accepted? See +v10 auth rules:
++++
+- If type is m.room.member: [...] +> +> 5. If membership is leave: +> +> 1. If the sender matches state_key, allow if and only if that user’s current membership state is invite, join, or knock. +2. [...]
+
I think this means that (well-formed!) self-leaves are governed entirely by
+4.5.1. This means that if we correctly calculate state which says that U
is
+invited, joined or knocked and include it in the leave's auth events, our event
+is accepted by checks 4 and 5 on incoming events.
+++
+- Passes authorization rules based on the event’s auth events, otherwise +> it is rejected.
+- Passes authorization rules based on the state before the event, otherwise +> it is rejected.
+
The only way to fail check 6 is if the receiving server's current state of the
+room says that U
is banned, has left, or has no membership event. But this is
+fine: the receiving server already thinks that U
isn't in the room.
+++
+- Passes authorization rules based on the current state of the room, +> otherwise it is “soft failed”.
+
For the second point (publishing the leave event), the best thing we can do is +to is publish to all HSes we know to be currently in the room. If they miss that +event, they might send us traffic in the room that we don't care about. This is +a problem with leaving after a "full" join; we don't seek to fix this with +partial joins.
+(With that said: there's nothing machine-readable in the /send response. I don't +think we can deduce "destination has left the room" from a failure to /send an +event into that room?)
+We can create leave events and can choose what gets included in our auth events, +so we can be sure that we pass check 4 on incoming events. For check 5, we might +have an incorrect view of the state before an event. +The only way we might erroneously think a leave is valid is if
+U
joined, invited or knocked, butU
banned, left or not present,in which case the leave doesn't make anything worse: other HSes already consider +us as not in the room, and will continue to do so after seeing the leave.
+The remaining obstacle is then: can we safely broadcast the leave event? We may
+miss servers or incorrectly think that a server is in the room. Or the
+destination server may be offline and miss the transaction containing our leave
+event.This should self-heal when they see an event whose prev_events
descends
+from our leave.
Another option we considered was to use federation /send_leave
to ask a
+fully-joined server to send out the event on our behalf. But that introduces
+complexity without much benefit. Besides, as Rich put it,
++sending out leaves is pretty best-effort currently
+
so this is probably good enough as-is.
+TODO: what cleanup is necessary? Is it all just nice-to-have to save unused +work?
+This section covers implementation documentation for various parts of Synapse.
If a developer is planning to make a change to a feature of Synapse, it can be useful for -- cgit 1.5.1