From e543fddc2aac8cc8fa530a09f666a2672739e44c Mon Sep 17 00:00:00 2001 From: sandhose Date: Tue, 9 Jul 2024 09:53:39 +0000 Subject: deploy: abb1384502f66ddde3fd0db844c4e719b01023ff --- .../synapse_architecture/cancellation.html | 551 +++++++++++++++++++++ .../synapse_architecture/faster_joins.html | 545 ++++++++++++++++++++ .../development/synapse_architecture/streams.html | 337 +++++++++++++ 3 files changed, 1433 insertions(+) create mode 100644 v1.111/development/synapse_architecture/cancellation.html create mode 100644 v1.111/development/synapse_architecture/faster_joins.html create mode 100644 v1.111/development/synapse_architecture/streams.html (limited to 'v1.111/development/synapse_architecture') diff --git a/v1.111/development/synapse_architecture/cancellation.html b/v1.111/development/synapse_architecture/cancellation.html new file mode 100644 index 0000000000..5102faf456 --- /dev/null +++ b/v1.111/development/synapse_architecture/cancellation.html @@ -0,0 +1,551 @@ + + + + + + Cancellation - Synapse + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + + + +
+
+ +
+ +
+ +

Cancellation

+

Sometimes, requests take a long time to service and clients disconnect +before Synapse produces a response. To avoid wasting resources, Synapse +can cancel request processing for select endpoints marked with the +@cancellable decorator.

+

Synapse makes use of Twisted's Deferred.cancel() feature to make +cancellation work. The @cancellable decorator does nothing by itself +and merely acts as a flag, signalling to developers and other code alike +that a method can be cancelled.

+

Enabling cancellation for an endpoint

+
    +
  1. Check that the endpoint method, and any async functions in its call +tree handle cancellation correctly. See +Handling cancellation correctly +for a list of things to look out for.
  2. +
  3. Add the @cancellable decorator to the on_GET/POST/PUT/DELETE +method. It's not recommended to make non-GET methods cancellable, +since cancellation midway through some database updates is less +likely to be handled correctly.
  4. +
+

Mechanics

+

There are two stages to cancellation: downward propagation of a +cancel() call, followed by upwards propagation of a CancelledError +out of a blocked await. +Both Twisted and asyncio have a cancellation mechanism.

+ + + +
MethodExceptionException inherits from
TwistedDeferred.cancel()twisted.internet.defer.CancelledErrorException (!)
asyncioTask.cancel()asyncio.CancelledErrorBaseException
+

Deferred.cancel()

+

When Synapse starts handling a request, it runs the async method +responsible for handling it using defer.ensureDeferred, which returns +a Deferred. For example:

+
def do_something() -> Deferred[None]:
+    ...
+
+@cancellable
+async def on_GET() -> Tuple[int, JsonDict]:
+    d = make_deferred_yieldable(do_something())
+    await d
+    return 200, {}
+
+request = defer.ensureDeferred(on_GET())
+
+

When a client disconnects early, Synapse checks for the presence of the +@cancellable decorator on on_GET. Since on_GET is cancellable, +Deferred.cancel() is called on the Deferred from +defer.ensureDeferred, ie. request. Twisted knows which Deferred +request is waiting on and passes the cancel() call on to d.

+

The Deferred being waited on, d, may have its own handling for +cancel() and pass the call on to other Deferreds.

+

Eventually, a Deferred handles the cancel() call by resolving itself +with a CancelledError.

+

CancelledError

+

The CancelledError gets raised out of the await and bubbles up, as +per normal Python exception handling.

+

Handling cancellation correctly

+

In general, when writing code that might be subject to cancellation, two +things must be considered:

+
    +
  • The effect of CancelledErrors raised out of awaits.
  • +
  • The effect of Deferreds being cancel()ed.
  • +
+

Examples of code that handles cancellation incorrectly include:

+
    +
  • try-except blocks which swallow CancelledErrors.
  • +
  • Code that shares the same Deferred, which may be cancelled, between +multiple requests.
  • +
  • Code that starts some processing that's exempt from cancellation, but +uses a logging context from cancellable code. The logging context +will be finished upon cancellation, while the uncancelled processing +is still using it.
  • +
+

Some common patterns are listed below in more detail.

+

async function calls

+

Most functions in Synapse are relatively straightforward from a +cancellation standpoint: they don't do anything with Deferreds and +purely call and await other async functions.

+

An async function handles cancellation correctly if its own code +handles cancellation correctly and all the async function it calls +handle cancellation correctly. For example:

+
async def do_two_things() -> None:
+    check_something()
+    await do_something()
+    await do_something_else()
+
+

do_two_things handles cancellation correctly if do_something and +do_something_else handle cancellation correctly.

+

That is, when checking whether a function handles cancellation +correctly, its implementation and all its async function calls need to +be checked, recursively.

+

As check_something is not async, it does not need to be checked.

+

CancelledErrors

+

Because Twisted's CancelledErrors are Exceptions, it's easy to +accidentally catch and suppress them. Care must be taken to ensure that +CancelledErrors are allowed to propagate upwards.

+ + + + + + + + + +
+

Bad:

+
try:
+    await do_something()
+except Exception:
+    # `CancelledError` gets swallowed here.
+    logger.info(...)
+
+
+

Good:

+
try:
+    await do_something()
+except CancelledError:
+    raise
+except Exception:
+    logger.info(...)
+
+
+

OK:

+
try:
+    check_something()
+    # A `CancelledError` won't ever be raised here.
+except Exception:
+    logger.info(...)
+
+
+

Good:

+
try:
+    await do_something()
+except ValueError:
+    logger.info(...)
+
+
+

defer.gatherResults

+

defer.gatherResults produces a Deferred which:

+
    +
  • broadcasts cancel() calls to every Deferred being waited on.
  • +
  • wraps the first exception it sees in a FirstError.
  • +
+

Together, this means that CancelledErrors will be wrapped in +a FirstError unless unwrapped. Such FirstErrors are liable to be +swallowed, so they must be unwrapped.

+ + + + + +
+

Bad:

+
async def do_something() -> None:
+    await make_deferred_yieldable(
+        defer.gatherResults([...], consumeErrors=True)
+    )
+
+try:
+    await do_something()
+except CancelledError:
+    raise
+except Exception:
+    # `FirstError(CancelledError)` gets swallowed here.
+    logger.info(...)
+
+
+

Good:

+
async def do_something() -> None:
+    await make_deferred_yieldable(
+        defer.gatherResults([...], consumeErrors=True)
+    ).addErrback(unwrapFirstError)
+
+try:
+    await do_something()
+except CancelledError:
+    raise
+except Exception:
+    logger.info(...)
+
+
+

Creation of Deferreds

+

If a function creates a Deferred, the effect of cancelling it must be considered. Deferreds that get shared are likely to have unintended behaviour when cancelled.

+ + + + + + + + +
+

Bad:

+
cache: Dict[str, Deferred[None]] = {}
+
+def wait_for_room(room_id: str) -> Deferred[None]:
+    deferred = cache.get(room_id)
+    if deferred is None:
+        deferred = Deferred()
+        cache[room_id] = deferred
+    # `deferred` can have multiple waiters.
+    # All of them will observe a `CancelledError`
+    # if any one of them is cancelled.
+    return make_deferred_yieldable(deferred)
+
+# Request 1
+await wait_for_room("!aAAaaAaaaAAAaAaAA:matrix.org")
+# Request 2
+await wait_for_room("!aAAaaAaaaAAAaAaAA:matrix.org")
+
+
+

Good:

+
cache: Dict[str, Deferred[None]] = {}
+
+def wait_for_room(room_id: str) -> Deferred[None]:
+    deferred = cache.get(room_id)
+    if deferred is None:
+        deferred = Deferred()
+        cache[room_id] = deferred
+    # `deferred` will never be cancelled now.
+    # A `CancelledError` will still come out of
+    # the `await`.
+    # `delay_cancellation` may also be used.
+    return make_deferred_yieldable(stop_cancellation(deferred))
+
+# Request 1
+await wait_for_room("!aAAaaAaaaAAAaAaAA:matrix.org")
+# Request 2
+await wait_for_room("!aAAaaAaaaAAAaAaAA:matrix.org")
+
+
+ +

Good:

+
cache: Dict[str, List[Deferred[None]]] = {}
+
+def wait_for_room(room_id: str) -> Deferred[None]:
+    if room_id not in cache:
+        cache[room_id] = []
+    # Each request gets its own `Deferred` to wait on.
+    deferred = Deferred()
+    cache[room_id]].append(deferred)
+    return make_deferred_yieldable(deferred)
+
+# Request 1
+await wait_for_room("!aAAaaAaaaAAAaAaAA:matrix.org")
+# Request 2
+await wait_for_room("!aAAaaAaaaAAAaAaAA:matrix.org")
+
+
+

Uncancelled processing

+

Some async functions may kick off some async processing which is +intentionally protected from cancellation, by stop_cancellation or +other means. If the async processing inherits the logcontext of the +request which initiated it, care must be taken to ensure that the +logcontext is not finished before the async processing completes.

+ + + + + + + + + +
+

Bad:

+
cache: Optional[ObservableDeferred[None]] = None
+
+async def do_something_else(
+    to_resolve: Deferred[None]
+) -> None:
+    await ...
+    logger.info("done!")
+    to_resolve.callback(None)
+
+async def do_something() -> None:
+    if not cache:
+        to_resolve = Deferred()
+        cache = ObservableDeferred(to_resolve)
+        # `do_something_else` will never be cancelled and
+        # can outlive the `request-1` logging context.
+        run_in_background(do_something_else, to_resolve)
+
+    await make_deferred_yieldable(cache.observe())
+
+with LoggingContext("request-1"):
+    await do_something()
+
+
+

Good:

+
cache: Optional[ObservableDeferred[None]] = None
+
+async def do_something_else(
+    to_resolve: Deferred[None]
+) -> None:
+    await ...
+    logger.info("done!")
+    to_resolve.callback(None)
+
+async def do_something() -> None:
+    if not cache:
+        to_resolve = Deferred()
+        cache = ObservableDeferred(to_resolve)
+        run_in_background(do_something_else, to_resolve)
+        # We'll wait until `do_something_else` is
+        # done before raising a `CancelledError`.
+        await make_deferred_yieldable(
+            delay_cancellation(cache.observe())
+        )
+    else:
+        await make_deferred_yieldable(cache.observe())
+
+with LoggingContext("request-1"):
+    await do_something()
+
+
+

OK:

+
cache: Optional[ObservableDeferred[None]] = None
+
+async def do_something_else(
+    to_resolve: Deferred[None]
+) -> None:
+    await ...
+    logger.info("done!")
+    to_resolve.callback(None)
+
+async def do_something() -> None:
+    if not cache:
+        to_resolve = Deferred()
+        cache = ObservableDeferred(to_resolve)
+        # `do_something_else` will get its own independent
+        # logging context. `request-1` will not count any
+        # metrics from `do_something_else`.
+        run_as_background_process(
+            "do_something_else",
+            do_something_else,
+            to_resolve,
+        )
+
+    await make_deferred_yieldable(cache.observe())
+
+with LoggingContext("request-1"):
+    await do_something()
+
+
+
+ +
+ + +
+
+ + + +
+ + + + + + + + + + + + + + + diff --git a/v1.111/development/synapse_architecture/faster_joins.html b/v1.111/development/synapse_architecture/faster_joins.html new file mode 100644 index 0000000000..ce7c0cb5af --- /dev/null +++ b/v1.111/development/synapse_architecture/faster_joins.html @@ -0,0 +1,545 @@ + + + + + + Faster remote joins - Synapse + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + + + +
+
+ +
+ +
+ +

How do faster joins work?

+

This is a work-in-progress set of notes with two goals:

+
    +
  • act as a reference, explaining how Synapse implements faster joins; and
  • +
  • record the rationale behind our choices.
  • +
+

See also MSC3902.

+

The key idea is described by MSC3706. This allows servers to +request a lightweight response to the federation /send_join endpoint. +This is called a faster join, also known as a partial join. In these +notes we'll usually use the word "partial" as it matches the database schema.

+

Overview: processing events in a partially-joined room

+

The response to a partial join consists of

+
    +
  • the requested join event J,
  • +
  • a list of the servers in the room (according to the state before J),
  • +
  • a subset of the state of the room before J,
  • +
  • the full auth chain of that state subset.
  • +
+

Synapse marks the room as partially joined by adding a row to the database table +partial_state_rooms. It also marks the join event J as "partially stated", +meaning that we have neither received nor computed the full state before/after +J. This is done by adding a row to partial_state_events.

+
DB schema +
matrix=> \d partial_state_events
+Table "matrix.partial_state_events"
+  Column  │ Type │ Collation │ Nullable │ Default
+══════════╪══════╪═══════════╪══════════╪═════════
+ room_id  │ text │           │ not null │
+ event_id │ text │           │ not null │
+ 
+matrix=> \d partial_state_rooms
+                Table "matrix.partial_state_rooms"
+         Column         │  Type  │ Collation │ Nullable │ Default 
+════════════════════════╪════════╪═══════════╪══════════╪═════════
+ room_id                │ text   │           │ not null │ 
+ device_lists_stream_id │ bigint │           │ not null │ 0
+ join_event_id          │ text   │           │          │ 
+ joined_via             │ text   │           │          │ 
+
+matrix=> \d partial_state_rooms_servers
+     Table "matrix.partial_state_rooms_servers"
+   Column    │ Type │ Collation │ Nullable │ Default 
+═════════════╪══════╪═══════════╪══════════╪═════════
+ room_id     │ text │           │ not null │ 
+ server_name │ text │           │ not null │ 
+
+

Indices, foreign-keys and check constraints are omitted for brevity.

+
+

While partially joined to a room, Synapse receives events E from remote +homeservers as normal, and can create events at the request of its local users. +However, we run into trouble when we enforce the checks on an event.

+
+
    +
  1. Is a valid event, otherwise it is dropped. For an event to be valid, it +must contain a room_id, and it must comply with the event format of that +room version.
  2. +
  3. Passes signature checks, otherwise it is dropped.
  4. +
  5. Passes hash checks, otherwise it is redacted before being processed further.
  6. +
  7. Passes authorization rules based on the event’s auth events, otherwise it +is rejected.
  8. +
  9. Passes authorization rules based on the state before the event, otherwise +it is rejected.
  10. +
  11. Passes authorization rules based on the current state of the room, +otherwise it is “soft failed”.
  12. +
+
+

We can enforce checks 1--4 without any problems. +But we cannot enforce checks 5 or 6 with complete certainty, since Synapse does +not know the full state before E, nor that of the room.

+

Partial state

+

Instead, we make a best-effort approximation. +While the room is considered partially joined, Synapse tracks the "partial +state" before events. +This works in a similar way as regular state:

+
    +
  • The partial state before J is that given to us by the partial join response.
  • +
  • The partial state before an event E is the resolution of the partial states +after each of E's prev_events.
  • +
  • If E is rejected or a message event, the partial state after E is the +partial state before E.
  • +
  • Otherwise, the partial state after E is the partial state before E, plus +E itself.
  • +
+

More concisely, partial state propagates just like full state; the only +difference is that we "seed" it with an incomplete initial state. +Synapse records that we have only calculated partial state for this event with +a row in partial_state_events.

+

While the room remains partially stated, check 5 on incoming events to that +room becomes:

+
+
    +
  1. Passes authorization rules based on the resolution between the partial +state before E and E's auth events. If the event fails to pass +authorization rules, it is rejected.
  2. +
+
+

Additionally, check 6 is deleted: no soft-failures are enforced.

+

While partially joined, the current partial state of the room is defined as the +resolution across the partial states after all forward extremities in the room.

+

Remark. Events with partial state are not considered +outliers.

+

Approximation error

+

Using partial state means the auth checks can fail in a few different ways1.

+
1 +

Is this exhaustive?

+
+
    +
  • We may erroneously accept an incoming event in check 5 based on partial state +when it would have been rejected based on full state, or vice versa.
  • +
  • This means that an event could erroneously be added to the current partial +state of the room when it would not be present in the full state of the room, +or vice versa.
  • +
  • Additionally, we may have skipped soft-failing an event that would have been +soft-failed based on full state.
  • +
+

(Note that the discrepancies described in the last two bullets are user-visible.)

+

This means that we have to be very careful when we want to lookup pieces of room +state in a partially-joined room. Our approximation of the state may be +incorrect or missing. But we can make some educated guesses. If

+
    +
  • our partial state is likely to be correct, or
  • +
  • the consequences of our partial state being incorrect are minor,
  • +
+

then we proceed as normal, and let the resync process fix up any mistakes (see +below).

+

When is our partial state likely to be correct?

+
    +
  • It's more accurate the closer we are to the partial join event. (So we should +ideally complete the resync as soon as possible.)
  • +
  • Non-member events: we will have received them as part of the partial join +response, if they were part of the room state at that point. We may +incorrectly accept or reject updates to that state (at first because we lack +remote membership information; later because of compounding errors), so these +can become incorrect over time.
  • +
  • Local members' memberships: we are the only ones who can create join and +knock events for our users. We can't be completely confident in the +correctness of bans, invites and kicks from other homeservers, but the resync +process should correct any mistakes.
  • +
  • Remote members' memberships: we did not receive these in the /send_join +response, so we have essentially no idea if these are correct or not.
  • +
+

In short, we deem it acceptable to trust the partial state for non-membership +and local membership events. For remote membership events, we wait for the +resync to complete, at which point we have the full state of the room and can +proceed as normal.

+

Fixing the approximation with a resync

+

The partial-state approximation is only a temporary affair. In the background, +synapse beings a "resync" process. This is a continuous loop, starting at the +partial join event and proceeding downwards through the event graph. For each +E seen in the room since partial join, Synapse will fetch

+
    +
  • the event ids in the state of the room before E, via +/state_ids;
  • +
  • the event ids in the full auth chain of E, included in the /state_ids +response; and
  • +
  • any events from the previous two bullets that Synapse hasn't persisted, via +`/state.
  • +
+

This means Synapse has (or can compute) the full state before E, which allows +Synapse to properly authorise or reject E. At this point ,the event +is considered to have "full state" rather than "partial state". We record this +by removing E from the partial_state_events table.

+

[TODO: Does Synapse persist a new state group for the full state +before E, or do we alter the (partial-)state group in-place? Are state groups +ever marked as partially-stated? ]

+

This scheme means it is possible for us to have accepted and sent an event to +clients, only to reject it during the resync. From a client's perspective, the +effect is similar to a retroactive +state change due to state resolution---i.e. a "state reset".2

+
2 +

Clients should refresh caches to detect such a change. Rumour has it that +sliding sync will fix this.

+
+

When all events since the join J have been fully-stated, the room resync +process is complete. We record this by removing the room from +partial_state_rooms.

+

Faster joins on workers

+

For the time being, the resync process happens on the master worker. +A new replication stream un_partial_stated_room is added. Whenever a resync +completes and a partial-state room becomes fully stated, a new message is sent +into that stream containing the room ID.

+

Notes on specific cases

+
+

NB. The notes below are rough. Some of them are hidden under <details> +disclosures because they have yet to be implemented in mainline Synapse.

+
+

Creating events during a partial join

+

When sending out messages during a partial join, we assume our partial state is +accurate and proceed as normal. For this to have any hope of succeeding at all, +our partial state must contain an entry for each of the (type, state key) pairs +specified by the auth rules:

+
    +
  • m.room.create
  • +
  • m.room.join_rules
  • +
  • m.room.power_levels
  • +
  • m.room.third_party_invite
  • +
  • m.room.member
  • +
+

The first four of these should be present in the state before J that is given +to us in the partial join response; only membership events are omitted. In order +for us to consider the user joined, we must have their membership event. That +means the only possible omission is the target's membership in an invite, kick +or ban.

+

The worst possibility is that we locally invite someone who is banned according to +the full state, because we lack their ban in our current partial state. The rest +of the federation---at least, those who are fully joined---should correctly +enforce the membership transition constraints. So any the erroneous invite should be ignored by fully-joined +homeservers and resolved by the resync for partially-joined homeservers.

+

In more generality, there are two problems we're worrying about here:

+
    +
  • We might create an event that is valid under our partial state, only to later +find out that is actually invalid according to the full state.
  • +
  • Or: we might refuse to create an event that is invalid under our partial +state, even though it would be perfectly valid under the full state.
  • +
+

However we expect such problems to be unlikely in practise, because

+
    +
  • We trust that the room has sensible power levels, e.g. that bad actors with +high power levels are demoted before their ban.
  • +
  • We trust that the resident server provides us up-to-date power levels, join +rules, etc.
  • +
  • State changes in rooms are relatively infrequent, and the resync period is +relatively quick.
  • +
+

Sending out the event over federation

+

TODO: needs prose fleshing out.

+

Normally: send out in a fed txn to all HSes in the room. +We only know that some HSes were in the room at some point. Wat do. +Send it out to the list of servers from the first join. +TODO what do we do here if we have full state? +If the prev event was created by us, we can risk sending it to the wrong HS. (Motivation: privacy concern of the content. Not such a big deal for a public room or an encrypted room. But non-encrypted invite-only...) +But don't want to send out sensitive data in other HS's events in this way.

+

Suppose we discover after resync that we shouldn't have sent out one our events (not a prev_event) to a target HS. Not much we can do. +What about if we didn't send them an event but shouldn't've? +E.g. what if someone joined from a new HS shortly after you did? We wouldn't talk to them. +Could imagine sending out the "Missed" events after the resync but... painful to work out what they should have seen if they joined/left. +Instead, just send them the latest event (if they're still in the room after resync) and let them backfill.(?)

+
    +
  • Don't do this currently.
  • +
  • If anyone who has received our messages sends a message to a HS we missed, they can backfill our messages
  • +
  • Gap: rooms which are infrequently used and take a long time to resync.
  • +
+

Joining after a partial join

+

NB. Not yet implemented.

+
+

TODO: needs prose fleshing out. Liase with Matthieu. Explain why /send_join +(Rich was surprised we didn't just create it locally. Answer: to try and avoid +a join which then gets rejected after resync.)

+

We don't know for sure that any join we create would be accepted. +E.g. the joined user might have been banned; the join rules might have changed in a way that we didn't realise... some way in which the partial state was mistaken. +Instead, do another partial make-join/send-join handshake to confirm that the join works.

+
    +
  • Probably going to get a bunch of duplicate state events and auth events.... but the point of partial joins is that these should be small. Many are already persisted = good.
  • +
  • What if the second send_join response includes a different list of reisdent HSes? Could ignore it. +
      +
    • Could even have a special flag that says "just make me a join", i.e. don't bother giving me state or servers in room. Deffo want the auth chain tho.
    • +
    +
  • +
  • SQ: wrt device lists it's a lot safer to ignore it!!!!!
  • +
  • What if the state at the second join is inconsistent with what we have? Ignore it?
  • +
+
+

Leaving (and kicks and bans) after a partial join

+

NB. Not yet implemented.

+
+

When you're fully joined to a room, to have U leave a room their homeserver +needs to

+
    +
  • create a new leave event for U which will be accepted by other homeservers, +and
  • +
  • send that event U out to the homeservers in the federation.
  • +
+

When is a leave event accepted? See +v10 auth rules:

+
+
    +
  1. If type is m.room.member: [...] +> +> 5. If membership is leave: +> +> 1. If the sender matches state_key, allow if and only if that user’s current membership state is invite, join, or knock. +2. [...]
  2. +
+
+

I think this means that (well-formed!) self-leaves are governed entirely by +4.5.1. This means that if we correctly calculate state which says that U is +invited, joined or knocked and include it in the leave's auth events, our event +is accepted by checks 4 and 5 on incoming events.

+
+
    +
  1. Passes authorization rules based on the event’s auth events, otherwise +> it is rejected.
  2. +
  3. Passes authorization rules based on the state before the event, otherwise +> it is rejected.
  4. +
+
+

The only way to fail check 6 is if the receiving server's current state of the +room says that U is banned, has left, or has no membership event. But this is +fine: the receiving server already thinks that U isn't in the room.

+
+
    +
  1. Passes authorization rules based on the current state of the room, +> otherwise it is “soft failed”.
  2. +
+
+

For the second point (publishing the leave event), the best thing we can do is +to is publish to all HSes we know to be currently in the room. If they miss that +event, they might send us traffic in the room that we don't care about. This is +a problem with leaving after a "full" join; we don't seek to fix this with +partial joins.

+

(With that said: there's nothing machine-readable in the /send response. I don't +think we can deduce "destination has left the room" from a failure to /send an +event into that room?)

+

Can we still do this during a partial join?

+

We can create leave events and can choose what gets included in our auth events, +so we can be sure that we pass check 4 on incoming events. For check 5, we might +have an incorrect view of the state before an event. +The only way we might erroneously think a leave is valid is if

+
    +
  • the partial state before the leave has U joined, invited or knocked, but
  • +
  • the full state before the leave has U banned, left or not present,
  • +
+

in which case the leave doesn't make anything worse: other HSes already consider +us as not in the room, and will continue to do so after seeing the leave.

+

The remaining obstacle is then: can we safely broadcast the leave event? We may +miss servers or incorrectly think that a server is in the room. Or the +destination server may be offline and miss the transaction containing our leave +event.This should self-heal when they see an event whose prev_events descends +from our leave.

+

Another option we considered was to use federation /send_leave to ask a +fully-joined server to send out the event on our behalf. But that introduces +complexity without much benefit. Besides, as Rich put it,

+
+

sending out leaves is pretty best-effort currently

+
+

so this is probably good enough as-is.

+

Cleanup after the last leave

+

TODO: what cleanup is necessary? Is it all just nice-to-have to save unused +work?

+
+ +
+ + +
+
+ + + +
+ + + + + + + + + + + + + + + diff --git a/v1.111/development/synapse_architecture/streams.html b/v1.111/development/synapse_architecture/streams.html new file mode 100644 index 0000000000..9b8fb5f8b7 --- /dev/null +++ b/v1.111/development/synapse_architecture/streams.html @@ -0,0 +1,337 @@ + + + + + + Streams - Synapse + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + + + +
+
+ +
+ +
+ +

Streams

+

Synapse has a concept of "streams", which are roughly described in id_generators.py. +Generally speaking, streams are a series of notifications that something in Synapse's database has changed that the application might need to respond to. +For example:

+
    +
  • The events stream reports new events (PDUs) that Synapse creates, or that Synapse accepts from another homeserver.
  • +
  • The account data stream reports changes to users' account data.
  • +
  • The to-device stream reports when a device has a new to-device message.
  • +
+

See synapse.replication.tcp.streams for the full list of streams.

+

It is very helpful to understand the streams mechanism when working on any part of Synapse that needs to respond to changes—especially if those changes are made by different workers. +To that end, let's describe streams formally, paraphrasing from the docstring of AbstractStreamIdGenerator.

+

Definition

+

A stream is an append-only log T1, T2, ..., Tn, ... of facts1 which grows over time. +Only "writers" can add facts to a stream, and there may be multiple writers.

+

Each fact has an ID, called its "stream ID". +Readers should only process facts in ascending stream ID order.

+

Roughly speaking, each stream is backed by a database table. +It should have a stream_id (or similar) bigint column holding stream IDs, plus additional columns as necessary to describe the fact. +Typically, a fact is expressed with a single row in its backing table.2 +Within a stream, no two facts may have the same stream_id.

+
+

Aside. Some additional notes on streams' backing tables.

+
    +
  1. Rich would like to ditch the backing tables.
  2. +
  3. The backing tables may have other uses. +> For example, the events table serves backs the events stream, and is read when processing new events. +> But old rows are read from the table all the time, whenever Synapse needs to lookup some facts about an event.
  4. +
  5. Rich suspects that sometimes the stream is backed by multiple tables, so the stream proper is the union of those tables.
  6. +
+
+

Stream writers can "reserve" a stream ID, and then later mark it as having being completed. +Stream writers need to track the completion of each stream fact. +In the happy case, completion means a fact has been written to the stream table. +But unhappy cases (e.g. transaction rollback due to an error) also count as completion. +Once completed, the rows written with that stream ID are fixed, and no new rows +will be inserted with that ID.

+

Current stream ID

+

For any given stream reader (including writers themselves), we may define a per-writer current stream ID:

+
+

A current stream ID for a writer W is the largest stream ID such that +all transactions added by W with equal or smaller ID have completed.

+
+

Similarly, there is a "linear" notion of current stream ID:

+
+

A "linear" current stream ID is the largest stream ID such that +all facts (added by any writer) with equal or smaller ID have completed.

+
+

Because different stream readers A and B learn about new facts at different times, A and B may disagree about current stream IDs. +Put differently: we should think of stream readers as being independent of each other, proceeding through a stream of facts at different rates.

+

The above definition does not give a unique current stream ID, in fact there can +be a range of current stream IDs. Synapse uses both the minimum and maximum IDs +for different purposes. Most often the maximum is used, as its generally +beneficial for workers to advance their IDs as soon as possible. However, the +minimum is used in situations where e.g. another worker is going to wait until +the stream advances past a position.

+

NB. For both senses of "current", that if a writer opens a transaction that never completes, the current stream ID will never advance beyond that writer's last written stream ID.

+

For single-writer streams, the per-writer current ID and the linear current ID are the same. +Both senses of current ID are monotonic, but they may "skip" or jump over IDs because facts complete out of order.

+

Example. +Consider a single-writer stream which is initially at ID 1.

+ + + + + + + + + + + + +
ActionCurrent stream IDNotes
1
Reserve 21
Reserve 31
Complete 31current ID unchanged, waiting for 2 to complete
Complete 23current ID jumps from 1 -> 3
Reserve 43
Reserve 53
Reserve 63
Complete 53
Complete 45current ID jumps 3->5, even though 6 is pending
Complete 66
+

Multi-writer streams

+

There are two ways to view a multi-writer stream.

+
    +
  1. Treat it as a collection of distinct single-writer streams, one +for each writer.
  2. +
  3. Treat it as a single stream.
  4. +
+

The single stream (option 2) is conceptually simpler, and easier to represent (a single stream id). +However, it requires each reader to know about the entire set of writers, to ensures that readers don't erroneously advance their current stream position too early and miss a fact from an unknown writer. +In contrast, multiple parallel streams (option 1) are more complex, requiring more state to represent (map from writer to stream id). +The payoff for doing so is that readers can "peek" ahead to facts that completed on one writer no matter the state of the others, reducing latency.

+

Note that a multi-writer stream can be viewed in both ways. +For example, the events stream is treated as multiple single-writer streams (option 1) by the sync handler, so that events are sent to clients as soon as possible. +But the background process that works through events treats them as a single linear stream.

+

Another useful example is the cache invalidation stream. +The facts this stream holds are instructions to "you should now invalidate these cache entries". +We only ever treat this as a multiple single-writer streams as there is no important ordering between cache invalidations. +(Invalidations are self-contained facts; and the invalidations commute/are idempotent).

+

Writing to streams

+

Writers need to track:

+
    +
  • track their current position (i.e. its own per-writer stream ID).
  • +
  • their facts currently awaiting completion.
  • +
+

At startup,

+
    +
  • the current position of that writer can be found by querying the database (which suggests that facts need to be written to the database atomically, in a transaction); and
  • +
  • there are no facts awaiting completion.
  • +
+

To reserve a stream ID, call nextval on the appropriate postgres sequence.

+

To write a fact to the stream: insert the appropriate rows to the appropriate backing table.

+

To complete a fact, first remove it from your map of facts currently awaiting completion. +Then, if no earlier fact is awaiting completion, the writer can advance its current position in that stream. +Upon doing so it should emit an RDATA message3, once for every fact between the old and the new stream ID.

+

Subscribing to streams

+

Readers need to track the current position of every writer.

+

At startup, they can find this by contacting each writer with a REPLICATE message, +requesting that all writers reply describing their current position in their streams. +Writers reply with a POSITION message.

+

To learn about new facts, readers should listen for RDATA messages and process them to respond to the new fact. +The RDATA itself is not a self-contained representation of the fact; +readers will have to query the stream tables for the full details. +Readers must also advance their record of the writer's current position for that stream.

+

Summary

+

In a nutshell: we have an append-only log with a "buffer/scratchpad" at the end where we have to wait for the sequence to be linear and contiguous.

+
+
1 +

we use the word fact here for two reasons. +Firstly, the word "event" is already heavily overloaded (PDUs, EDUs, account data, ...) and we don't need to make that worse. +Secondly, "fact" emphasises that the things we append to a stream cannot change after the fact.

+
+
2 +

A fact might be expressed with 0 rows, e.g. if we opened a transaction to persist an event, but failed and rolled the transaction back before marking the fact as completed. +In principle a fact might be expressed with 2 or more rows; if so, each of those rows should share the fact's stream ID.

+
+
3 +

This communication used to happen directly with the writers over TCP; +nowadays it's done via Redis's Pubsub.

+
+ +
+ + +
+
+ + + +
+ + + + + + + + + + + + + + + -- cgit 1.5.1