summary refs log tree commit diff
diff options
context:
space:
mode:
authorGerrit Gogel <gerrit@gogel.me>2024-03-12 16:07:36 +0100
committerGitHub <noreply@github.com>2024-03-12 15:07:36 +0000
commit1f887907649c068993d63d32cde5e18b391dc15e (patch)
treef8ab0363bacfc52d53e1ea73f21ad11d4de48f12
parentdeactivated flag refactored to filter deactivated users. (#16874) (diff)
downloadsynapse-1f887907649c068993d63d32cde5e18b391dc15e.tar.xz
Prevent locking up while processing batched_auth_events (#16968)
This PR aims to fix #16895, caused by a regression in #7 and not fixed
by #16903. The PR #16903 only fixes a starvation issue, where the CPU
isn't released. There is a second issue, where the execution is blocked.
This theory is supported by the flame graphs provided in #16895 and the
fact that I see the CPU usage reducing and far below the limit.

Since the changes in #7, the method `check_state_independent_auth_rules`
is called with the additional parameter `batched_auth_events`:


https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/handlers/federation_event.py#L1741-L1743


It makes the execution enter this if clause, introduced with #15195


https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/event_auth.py#L178-L189

There are two issues in the above code snippet.

First, there is the blocking issue. I'm not entirely sure if this is a
deadlock, starvation, or something different. In the beginning, I
thought the copy operation was responsible. It wasn't. Then I
investigated the nested `store.get_events` inside the function `update`.
This was also not causing the blocking issue. Only when I replaced the
set difference operation (`-` ) with a list comprehension, the blocking
was resolved. Creating and comparing sets with a very large amount of
events seems to be problematic.

This is how the flamegraph looks now while persisting outliers. As you
can see, the execution no longer locks up in the above function.

![output_2024-02-28_13-59-40](https://github.com/element-hq/synapse/assets/13143850/6db9c9ac-484f-47d0-bdde-70abfbd773ec)

Second, the copying here doesn't serve any purpose, because only a
shallow copy is created. This means the same objects from the original
dict are referenced. This fails the intention of protecting these
objects from mutation. The review of the original PR
https://github.com/matrix-org/synapse/pull/15195 had an extensive
discussion about this matter.

Various approaches to copying the auth_events were attempted:
1) Implementing a deepcopy caused issues due to
builtins.EventInternalMetadata not being pickleable.
2) Creating a dict with new objects akin to a deepcopy.
3) Creating a dict with new objects containing only necessary
attributes.

Concluding, there is no easy way to create an actual copy of the
objects. Opting for a deepcopy can significantly strain memory and CPU
resources, making it an inefficient choice. I don't see why the copy is
necessary in the first place. Therefore I'm proposing to remove it
altogether.

After these changes, I was able to successfully join these rooms,
without the main worker locking up:
- #synapse:matrix.org
- #element-android:matrix.org
- #element-web:matrix.org
- #ecips:matrix.org
- #ipfs-chatter:ipfs.io
- #python:matrix.org
- #matrix:matrix.org
-rw-r--r--changelog.d/16968.bugfix1
-rw-r--r--synapse/event_auth.py43
2 files changed, 35 insertions, 9 deletions
diff --git a/changelog.d/16968.bugfix b/changelog.d/16968.bugfix
new file mode 100644
index 0000000000..57ed851178
--- /dev/null
+++ b/changelog.d/16968.bugfix
@@ -0,0 +1 @@
+Prevent locking up when checking auth rules that are independent of room state for batched auth events. Contributed by @ggogel.
\ No newline at end of file
diff --git a/synapse/event_auth.py b/synapse/event_auth.py
index d922c8dc35..c8b06f760e 100644
--- a/synapse/event_auth.py
+++ b/synapse/event_auth.py
@@ -23,7 +23,20 @@
 import collections.abc
 import logging
 import typing
-from typing import Any, Dict, Iterable, List, Mapping, Optional, Set, Tuple, Union
+from typing import (
+    Any,
+    ChainMap,
+    Dict,
+    Iterable,
+    List,
+    Mapping,
+    MutableMapping,
+    Optional,
+    Set,
+    Tuple,
+    Union,
+    cast,
+)
 
 from canonicaljson import encode_canonical_json
 from signedjson.key import decode_verify_key_bytes
@@ -175,12 +188,22 @@ async def check_state_independent_auth_rules(
         return
 
     # 2. Reject if event has auth_events that: ...
+    auth_events: ChainMap[str, EventBase] = ChainMap()
     if batched_auth_events:
-        # Copy the batched auth events to avoid mutating them.
-        auth_events = dict(batched_auth_events)
-        needed_auth_event_ids = set(event.auth_event_ids()) - batched_auth_events.keys()
+        # batched_auth_events can become very large. To avoid repeatedly copying it, which
+        # would significantly impact performance, we use a ChainMap.
+        # batched_auth_events must be cast to MutableMapping because .new_child() requires
+        # this type. This casting is safe as the mapping is never mutated.
+        auth_events = auth_events.new_child(
+            cast(MutableMapping[str, "EventBase"], batched_auth_events)
+        )
+        needed_auth_event_ids = [
+            event_id
+            for event_id in event.auth_event_ids()
+            if event_id not in batched_auth_events
+        ]
         if needed_auth_event_ids:
-            auth_events.update(
+            auth_events = auth_events.new_child(
                 await store.get_events(
                     needed_auth_event_ids,
                     redact_behaviour=EventRedactBehaviour.as_is,
@@ -188,10 +211,12 @@ async def check_state_independent_auth_rules(
                 )
             )
     else:
-        auth_events = await store.get_events(
-            event.auth_event_ids(),
-            redact_behaviour=EventRedactBehaviour.as_is,
-            allow_rejected=True,
+        auth_events = auth_events.new_child(
+            await store.get_events(
+                event.auth_event_ids(),
+                redact_behaviour=EventRedactBehaviour.as_is,
+                allow_rejected=True,
+            )
         )
 
     room_id = event.room_id