From 57e5fd481df1b7eb04e3afff286da66877fb8fa0 Mon Sep 17 00:00:00 2001 From: H-Shay Date: Tue, 31 Jan 2023 18:37:46 +0000 Subject: deploy: 41d177ca4a3d617eb82044048211c0d6b276f3aa --- develop/print.html | 345 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 343 insertions(+), 2 deletions(-) (limited to 'develop/print.html') diff --git a/develop/print.html b/develop/print.html index fd9de924d8..1a466291fb 100644 --- a/develop/print.html +++ b/develop/print.html @@ -77,7 +77,7 @@ @@ -1773,7 +1773,7 @@ dpkg -i matrix-synapse-py3_1.3.0+stretch1_amd64.deb

Upgrading to v1.76.0

Faster joins are enabled by default

-

When joining a room for the first time, Synapse 1.76.0rc1 will request a partial join from the other server by default. Previously, server admins had to opt-in to this using an experimental config flag.

+

When joining a room for the first time, Synapse 1.76.0 will request a partial join from the other server by default. Previously, server admins had to opt-in to this using an experimental config flag.

Server admins can opt out of this feature for the time being by setting

experimental:
     faster_joins: false
@@ -17325,6 +17325,347 @@ workers understand to mean to expand to invalidate the correct caches.

  • cs_cache_fake ─ invalidates caches that depend on the current state
  • +

    How do faster joins work?

    +

    This is a work-in-progress set of notes with two goals:

    +
      +
    • act as a reference, explaining how Synapse implements faster joins; and
    • +
    • record the rationale behind our choices.
    • +
    +

    See also MSC3902.

    +

    The key idea is described by MSC706. This allows servers to +request a lightweight response to the federation /send_join endpoint. +This is called a faster join, also known as a partial join. In these +notes we'll usually use the word "partial" as it matches the database schema.

    +

    Overview: processing events in a partially-joined room

    +

    The response to a partial join consists of

    +
      +
    • the requested join event J,
    • +
    • a list of the servers in the room (according to the state before J),
    • +
    • a subset of the state of the room before J,
    • +
    • the full auth chain of that state subset.
    • +
    +

    Synapse marks the room as partially joined by adding a row to the database table +partial_state_rooms. It also marks the join event J as "partially stated", +meaning that we have neither received nor computed the full state before/after +J. This is done by adding a row to partial_state_events.

    +
    DB schema +
    matrix=> \d partial_state_events
    +Table "matrix.partial_state_events"
    +  Column  │ Type │ Collation │ Nullable │ Default
    +══════════╪══════╪═══════════╪══════════╪═════════
    + room_id  │ text │           │ not null │
    + event_id │ text │           │ not null │
    + 
    +matrix=> \d partial_state_rooms
    +                Table "matrix.partial_state_rooms"
    +         Column         │  Type  │ Collation │ Nullable │ Default 
    +════════════════════════╪════════╪═══════════╪══════════╪═════════
    + room_id                │ text   │           │ not null │ 
    + device_lists_stream_id │ bigint │           │ not null │ 0
    + join_event_id          │ text   │           │          │ 
    + joined_via             │ text   │           │          │ 
    +
    +matrix=> \d partial_state_rooms_servers
    +     Table "matrix.partial_state_rooms_servers"
    +   Column    │ Type │ Collation │ Nullable │ Default 
    +═════════════╪══════╪═══════════╪══════════╪═════════
    + room_id     │ text │           │ not null │ 
    + server_name │ text │           │ not null │ 
    +
    +

    Indices, foreign-keys and check constraints are omitted for brevity.

    +
    +

    While partially joined to a room, Synapse receives events E from remote +homeservers as normal, and can create events at the request of its local users. +However, we run into trouble when we enforce the checks on an event.

    +
    +
      +
    1. Is a valid event, otherwise it is dropped. For an event to be valid, it +must contain a room_id, and it must comply with the event format of that +room version.
    2. +
    3. Passes signature checks, otherwise it is dropped.
    4. +
    5. Passes hash checks, otherwise it is redacted before being processed further.
    6. +
    7. Passes authorization rules based on the event’s auth events, otherwise it +is rejected.
    8. +
    9. Passes authorization rules based on the state before the event, otherwise +it is rejected.
    10. +
    11. Passes authorization rules based on the current state of the room, +otherwise it is “soft failed”.
    12. +
    +
    +

    We can enforce checks 1--4 without any problems. +But we cannot enforce checks 5 or 6 with complete certainty, since Synapse does +not know the full state before E, nor that of the room.

    +

    Partial state

    +

    Instead, we make a best-effort approximation. +While the room is considered partially joined, Synapse tracks the "partial +state" before events. +This works in a similar way as regular state:

    +
      +
    • The partial state before J is that given to us by the partial join response.
    • +
    • The partial state before an event E is the resolution of the partial states +after each of E's prev_events.
    • +
    • If E is rejected or a message event, the partial state after E is the +partial state before E.
    • +
    • Otherwise, the partial state after E is the partial state before E, plus +E itself.
    • +
    +

    More concisely, partial state propagates just like full state; the only +difference is that we "seed" it with an incomplete initial state. +Synapse records that we have only calculated partial state for this event with +a row in partial_state_events.

    +

    While the room remains partially stated, check 5 on incoming events to that +room becomes:

    +
    +
      +
    1. Passes authorization rules based on the resolution between the partial +state before E and E's auth events. If the event fails to pass +authorization rules, it is rejected.
    2. +
    +
    +

    Additionally, check 6 is deleted: no soft-failures are enforced.

    +

    While partially joined, the current partial state of the room is defined as the +resolution across the partial states after all forward extremities in the room.

    +

    Remark. Events with partial state are not considered +outliers.

    +

    Approximation error

    +

    Using partial state means the auth checks can fail in a few different ways1.

    +
    1 +

    Is this exhaustive?

    +
    +
      +
    • We may erroneously accept an incoming event in check 5 based on partial state +when it would have been rejected based on full state, or vice versa.
    • +
    • This means that an event could erroneously be added to the current partial +state of the room when it would not be present in the full state of the room, +or vice versa.
    • +
    • Additionally, we may have skipped soft-failing an event that would have been +soft-failed based on full state.
    • +
    +

    (Note that the discrepancies described in the last two bullets are user-visible.)

    +

    This means that we have to be very careful when we want to lookup pieces of room +state in a partially-joined room. Our approximation of the state may be +incorrect or missing. But we can make some educated guesses. If

    +
      +
    • our partial state is likely to be correct, or
    • +
    • the consequences of our partial state being incorrect are minor,
    • +
    +

    then we proceed as normal, and let the resync process fix up any mistakes (see +below).

    +

    When is our partial state likely to be correct?

    +
      +
    • It's more accurate the closer we are to the partial join event. (So we should +ideally complete the resync as soon as possible.)
    • +
    • Non-member events: we will have received them as part of the partial join +response, if they were part of the room state at that point. We may +incorrectly accept or reject updates to that state (at first because we lack +remote membership information; later because of compounding errors), so these +can become incorrect over time.
    • +
    • Local members' memberships: we are the only ones who can create join and +knock events for our users. We can't be completely confident in the +correctness of bans, invites and kicks from other homeservers, but the resync +process should correct any mistakes.
    • +
    • Remote members' memberships: we did not receive these in the /send_join +response, so we have essentially no idea if these are correct or not.
    • +
    +

    In short, we deem it acceptable to trust the partial state for non-membership +and local membership events. For remote membership events, we wait for the +resync to complete, at which point we have the full state of the room and can +proceed as normal.

    +

    Fixing the approximation with a resync

    +

    The partial-state approximation is only a temporary affair. In the background, +synapse beings a "resync" process. This is a continuous loop, starting at the +partial join event and proceeding downwards through the event graph. For each +E seen in the room since partial join, Synapse will fetch

    +
      +
    • the event ids in the state of the room before E, via +/state_ids;
    • +
    • the event ids in the full auth chain of E, included in the /state_ids +response; and
    • +
    • any events from the previous two bullets that Synapse hasn't persisted, via +`/state.
    • +
    +

    This means Synapse has (or can compute) the full state before E, which allows +Synapse to properly authorise or reject E. At this point ,the event +is considered to have "full state" rather than "partial state". We record this +by removing E from the partial_state_events table.

    +

    [TODO: Does Synapse persist a new state group for the full state +before E, or do we alter the (partial-)state group in-place? Are state groups +ever marked as partially-stated? ]

    +

    This scheme means it is possible for us to have accepted and sent an event to +clients, only to reject it during the resync. From a client's perspective, the +effect is similar to a retroactive +state change due to state resolution---i.e. a "state reset".2

    +
    2 +

    Clients should refresh caches to detect such a change. Rumour has it that +sliding sync will fix this.

    +
    +

    When all events since the join J have been fully-stated, the room resync +process is complete. We record this by removing the room from +partial_state_rooms.

    +

    Faster joins on workers

    +

    For the time being, the resync process happens on the master worker. +A new replication stream un_partial_stated_room is added. Whenever a resync +completes and a partial-state room becomes fully stated, a new message is sent +into that stream containing the room ID.

    +

    Notes on specific cases

    +
    +

    NB. The notes below are rough. Some of them are hidden under <details> +disclosures because they have yet to be implemented in mainline Synapse.

    +
    +

    Creating events during a partial join

    +

    When sending out messages during a partial join, we assume our partial state is +accurate and proceed as normal. For this to have any hope of succeeding at all, +our partial state must contain an entry for each of the (type, state key) pairs +specified by the auth rules:

    +
      +
    • m.room.create
    • +
    • m.room.join_rules
    • +
    • m.room.power_levels
    • +
    • m.room.third_party_invite
    • +
    • m.room.member
    • +
    +

    The first four of these should be present in the state before J that is given +to us in the partial join response; only membership events are omitted. In order +for us to consider the user joined, we must have their membership event. That +means the only possible omission is the target's membership in an invite, kick +or ban.

    +

    The worst possibility is that we locally invite someone who is banned according to +the full state, because we lack their ban in our current partial state. The rest +of the federation---at least, those who are fully joined---should correctly +enforce the membership transition constraints. So any the erroneous invite should be ignored by fully-joined +homeservers and resolved by the resync for partially-joined homeservers.

    +

    In more generality, there are two problems we're worrying about here:

    +
      +
    • We might create an event that is valid under our partial state, only to later +find out that is actually invalid according to the full state.
    • +
    • Or: we might refuse to create an event that is invalid under our partial +state, even though it would be perfectly valid under the full state.
    • +
    +

    However we expect such problems to be unlikely in practise, because

    +
      +
    • We trust that the room has sensible power levels, e.g. that bad actors with +high power levels are demoted before their ban.
    • +
    • We trust that the resident server provides us up-to-date power levels, join +rules, etc.
    • +
    • State changes in rooms are relatively infrequent, and the resync period is +relatively quick.
    • +
    +

    Sending out the event over federation

    +

    TODO: needs prose fleshing out.

    +

    Normally: send out in a fed txn to all HSes in the room. +We only know that some HSes were in the room at some point. Wat do. +Send it out to the list of servers from the first join. +TODO what do we do here if we have full state? +If the prev event was created by us, we can risk sending it to the wrong HS. (Motivation: privacy concern of the content. Not such a big deal for a public room or an encrypted room. But non-encrypted invite-only...) +But don't want to send out sensitive data in other HS's events in this way.

    +

    Suppose we discover after resync that we shouldn't have sent out one our events (not a prev_event) to a target HS. Not much we can do. +What about if we didn't send them an event but shouldn't've? +E.g. what if someone joined from a new HS shortly after you did? We wouldn't talk to them. +Could imagine sending out the "Missed" events after the resync but... painful to work out what they shuld have seen if they joined/left. +Instead, just send them the latest event (if they're still in the room after resync) and let them backfill.(?)

    +
      +
    • Don't do this currently.
    • +
    • If anyone who has received our messages sends a message to a HS we missed, they can backfill our messages
    • +
    • Gap: rooms which are infrequently used and take a long time to resync.
    • +
    +

    Joining after a partial join

    +

    NB. Not yet implemented.

    +
    +

    TODO: needs prose fleshing out. Liase with Matthieu. Explain why /send_join +(Rich was surprised we didn't just create it locally. Answer: to try and avoid +a join which then gets rejected after resync.)

    +

    We don't know for sure that any join we create would be accepted. +E.g. the joined user might have been banned; the join rules might have changed in a way that we didn't realise... some way in which the partial state was mistaken. +Instead, do another partial make-join/send-join handshake to confirm that the join works.

    +
      +
    • Probably going to get a bunch of duplicate state events and auth events.... but the point of partial joins is that these should be small. Many are already persisted = good.
    • +
    • What if the second send_join response includes a different list of reisdent HSes? Could ignore it. +
        +
      • Could even have a special flag that says "just make me a join", i.e. don't bother giving me state or servers in room. Deffo want the auth chain tho.
      • +
      +
    • +
    • SQ: wrt device lists it's a lot safer to ignore it!!!!!
    • +
    • What if the state at the second join is inconsistent with what we have? Ignore it?
    • +
    +
    +

    Leaving (and kicks and bans) after a partial join

    +

    NB. Not yet implemented.

    +
    +

    When you're fully joined to a room, to have U leave a room their homeserver +needs to

    +
      +
    • create a new leave event for U which will be accepted by other homeservers, +and
    • +
    • send that event U out to the homeservers in the federation.
    • +
    +

    When is a leave event accepted? See +v10 auth rules:

    +
    +
      +
    1. If type is m.room.member: [...] +> +> 5. If membership is leave: +> +> 1. If the sender matches state_key, allow if and only if that user’s current membership state is invite, join, or knock. +2. [...]
    2. +
    +
    +

    I think this means that (well-formed!) self-leaves are governed entirely by +4.5.1. This means that if we correctly calculate state which says that U is +invited, joined or knocked and include it in the leave's auth events, our event +is accepted by checks 4 and 5 on incoming events.

    +
    +
      +
    1. Passes authorization rules based on the event’s auth events, otherwise +> it is rejected.
    2. +
    3. Passes authorization rules based on the state before the event, otherwise +> it is rejected.
    4. +
    +
    +

    The only way to fail check 6 is if the receiving server's current state of the +room says that U is banned, has left, or has no membership event. But this is +fine: the receiving server already thinks that U isn't in the room.

    +
    +
      +
    1. Passes authorization rules based on the current state of the room, +> otherwise it is “soft failed”.
    2. +
    +
    +

    For the second point (publishing the leave event), the best thing we can do is +to is publish to all HSes we know to be currently in the room. If they miss that +event, they might send us traffic in the room that we don't care about. This is +a problem with leaving after a "full" join; we don't seek to fix this with +partial joins.

    +

    (With that said: there's nothing machine-readable in the /send response. I don't +think we can deduce "destination has left the room" from a failure to /send an +event into that room?)

    +

    Can we still do this during a partial join?

    +

    We can create leave events and can choose what gets included in our auth events, +so we can be sure that we pass check 4 on incoming events. For check 5, we might +have an incorrect view of the state before an event. +The only way we might erroneously think a leave is valid is if

    +
      +
    • the partial state before the leave has U joined, invited or knocked, but
    • +
    • the full state before the leave has U banned, left or not present,
    • +
    +

    in which case the leave doesn't make anything worse: other HSes already consider +us as not in the room, and will continue to do so after seeing the leave.

    +

    The remaining obstacle is then: can we safely broadcast the leave event? We may +miss servers or incorrectly think that a server is in the room. Or the +destination server may be offline and miss the transaction containing our leave +event.This should self-heal when they see an event whose prev_events descends +from our leave.

    +

    Another option we considered was to use federation /send_leave to ask a +fully-joined server to send out the event on our behalf. But that introduces +complexity without much benefit. Besides, as Rich put it,

    +
    +

    sending out leaves is pretty best-effort currently

    +
    +

    so this is probably good enough as-is.

    +

    Cleanup after the last leave

    +

    TODO: what cleanup is necessary? Is it all just nice-to-have to save unused +work?

    +

    Internal Documentation

    This section covers implementation documentation for various parts of Synapse.

    If a developer is planning to make a change to a feature of Synapse, it can be useful for -- cgit 1.5.1