From 6ec5e13ec94c484a573bad11e3d80fc6b4b2b943 Mon Sep 17 00:00:00 2001 From: Eric Eastwood Date: Mon, 27 Jan 2025 11:21:10 -0600 Subject: Fix join being denied after being invited over federation (#18075) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This also happens for rejecting an invite. Basically, any out-of-band membership transition where we first get the membership as an `outlier` and then rely on federation filling us in to de-outlier it. This PR mainly addresses automated test flakiness, bots/scripts, and options within Synapse like [`auto_accept_invites`](https://element-hq.github.io/synapse/v1.122/usage/configuration/config_documentation.html#auto_accept_invites) that are able to react quickly (before federation is able to push us events), but also helps in generic scenarios where federation is lagging. I initially thought this might be a Synapse consistency issue (see issues labeled with [`Z-Read-After-Write`](https://github.com/matrix-org/synapse/labels/Z-Read-After-Write)) but it seems to be an event auth logic problem. Workers probably do increase the number of possible race condition scenarios that make this visible though (replication and cache invalidation lag). Fix https://github.com/element-hq/synapse/issues/15012 (probably fixes https://github.com/matrix-org/synapse/issues/15012 (https://github.com/element-hq/synapse/issues/15012)) Related to https://github.com/matrix-org/matrix-spec/issues/2062 Problems: 1. We don't consider [out-of-band membership](https://github.com/element-hq/synapse/blob/develop/docs/development/room-dag-concepts.md#out-of-band-membership-events) (outliers) in our `event_auth` logic even though we expose them in `/sync`. 1. (This PR doesn't address this point) Perhaps we should consider authing events in the persistence queue as events already in the queue could allow subsequent events to be allowed (events come through many channels: federation transaction, remote invite, remote join, local send). But this doesn't save us in the case where the event is more delayed over federation. ### What happened before? I wrote some Complement test that stresses this exact scenario and reproduces the problem: https://github.com/matrix-org/complement/pull/757 ``` COMPLEMENT_ALWAYS_PRINT_SERVER_LOGS=1 COMPLEMENT_DIR=../complement ./scripts-dev/complement.sh -run TestSynapseConsistency ``` We have `hs1` and `hs2` running in monolith mode (no workers): 1. `@charlie1:hs2` is invited and joins the room: 1. `hs1` invites `@charlie1:hs2` to a room which we receive on `hs2` as `PUT /_matrix/federation/v1/invite/{roomId}/{eventId}` (`on_invite_request(...)`) and the invite membership is persisted as an outlier. The `room_memberships` and `local_current_membership` database tables are also updated which means they are visible down `/sync` at this point. 1. `@charlie1:hs2` decides to join because it saw the invite down `/sync`. Because `hs2` is not yet in the room, this happens as a remote join `make_join`/`send_join` which comes back with all of the auth events needed to auth successfully and now `@charlie1:hs2` is successfully joined to the room. 1. `@charlie2:hs2` is invited and and tries to join the room: 1. `hs1` invites `@charlie2:hs2` to the room which we receive on `hs2` as `PUT /_matrix/federation/v1/invite/{roomId}/{eventId}` (`on_invite_request(...)`) and the invite membership is persisted as an outlier. The `room_memberships` and `local_current_membership` database tables are also updated which means they are visible down `/sync` at this point. 1. Because `hs2` is already participating in the room, we also see the invite come over federation in a transaction and we start processing it (not done yet, see below) 1. `@charlie2:hs2` decides to join because it saw the invite down `/sync`. Because `hs2`, is already in the room, this happens as a local join but we deny the event because our `event_auth` logic thinks that we have no membership in the room :x: (expected to be able to join because we saw the invite down `/sync`) 1. We finally finish processing the `@charlie2:hs2` invite event from and de-outlier it. - If this finished before we tried to join we would have been fine but this is the race condition that makes this situation visible. Logs for `hs2`: ``` 🗳️ on_invite_request: handling event 🔦 _store_room_members_txn update room_memberships: 🔦 _store_room_members_txn update local_current_membership: 📨 Notifying about new event ✅ on_invite_request: handled event 🧲 do_invite_join for @user-2-charlie1:hs2 in !sfZVBdLUezpPWetrol:hs1 🔦 _store_room_members_txn update room_memberships: 🔦 _store_room_members_txn update room_memberships: 📨 Notifying about new event ... 🗳️ on_invite_request: handling event 🔦 _store_room_members_txn update room_memberships: 🔦 _store_room_members_txn update local_current_membership: 📨 Notifying about new event ✅ on_invite_request: handled event 📬 handling received PDU in room !sfZVBdLUezpPWetrol:hs1: 📮 handle_new_client_event: handling ❌ Denying new event because 403: You are not invited to this room. synapse.http.server - 130 - INFO - POST-16 - SynapseError: 403 - You are not invited to this room. 📨 Notifying about new event ✅ handled received PDU in room !sfZVBdLUezpPWetrol:hs1: ``` --- tests/utils.py | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) (limited to 'tests/utils.py') diff --git a/tests/utils.py b/tests/utils.py index 3f4a7bb560..d4aebc3069 100644 --- a/tests/utils.py +++ b/tests/utils.py @@ -400,11 +400,24 @@ class TestTimeout(Exception): class test_timeout: + """ + FIXME: This implementation is not robust against other code tight-looping and + preventing the signals propagating and timing out the test. You may need to add + `time.sleep(0.1)` to your code in order to allow this timeout to work correctly. + + ```py + with test_timeout(3): + while True: + my_checking_func() + time.sleep(0.1) + ``` + """ + def __init__(self, seconds: int, error_message: Optional[str] = None) -> None: - if error_message is None: - error_message = "test timed out after {}s.".format(seconds) + self.error_message = f"Test timed out after {seconds}s" + if error_message is not None: + self.error_message += f": {error_message}" self.seconds = seconds - self.error_message = error_message def handle_timeout(self, signum: int, frame: Optional[FrameType]) -> None: raise TestTimeout(self.error_message) -- cgit 1.5.1