summary refs log tree commit diff
path: root/synapse/replication (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge remote-tracking branch 'origin/release-v1.91' into release-v1.92Patrick Cloke2023-09-061-12/+0
|\
| * Revert MSC3861 introspection cache, admin impersonation and account lock ↵Quentin Gliech2023-09-061-12/+0
| | | | | | | | (#16258)
* | Don't wake up destination transaction queue if they're not due for retry. ↵Erik Johnston2023-09-041-5/+3
| | | | | | | | (#16223)
* | Cache device resync requests over replication (#16241)David Robertson2023-09-041-1/+1
| |
* | Track currently syncing users by device for presence (#16172)Patrick Cloke2023-08-292-8/+28
| | | | | | | | | | | | | | Refactoring to use both the user ID & the device ID when tracking the currently syncing users in the presence handler. This is done both locally and over replication. Note that the device ID is discarded but will be used in a future change.
* | Pass the device ID around in the presence handler (#16171)Patrick Cloke2023-08-281-4/+7
| | | | | | | | | | | | Refactoring to pass the device ID (in addition to the user ID) through the presence handler (specifically the `user_syncing`, `set_state`, and `bump_presence_active_time` methods and their replication versions).
* | Combine logic about not overriding BUSY presence. (#16170)Patrick Cloke2023-08-281-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | Simplify some of the presence code by reducing duplicated code between worker & non-worker modes. The main change is to push some of the logic from `user_syncing` into `set_state`. This is done by passing whether the user is setting the presence via a `/sync` with a new `is_sync` flag to `set_state`. If this is `true` some additional logic is performed: * Don't override `busy` presence. * Update the `last_user_sync_ts`. * Never update the status message.
* | Task scheduler: add replication notify for new task to launch ASAP (#16184)Mathieu Velten2023-08-282-0/+30
|/
* Fix perf of `wait_for_stream_positions` (#16148)Erik Johnston2023-08-221-7/+12
|
* Add an admin endpoint to allow authorizing server to signal token ↵Shay2023-08-221-0/+12
| | | | revocations (#16125)
* Run pyupgrade for python 3.7 & 3.8. (#16110)Patrick Cloke2023-08-151-1/+1
|
* Add ability to wait for locks and add locks to purge history / room deletion ↵Erik Johnston2023-07-312-0/+55
| | | | | (#15791) c.f. #13476
* Clarify comment on key uploads over replication (#16016)Shay2023-07-271-2/+2
|
* Add Unix socket support for Redis connections (#15644)Jason Little2023-05-262-9/+63
| | | | Adds a new configuration setting to connect to Redis via a Unix socket instead of over TCP. Disabled by default.
* Use a custom scheme & the worker name for replication requests. (#15578)Jason Little2023-05-231-12/+6
| | | | | | | | All the information needed is already in the `instance_map`, so use that instead of passing the hostname / IP & port manually for each replication request. This consolidates logic for future improvements of using e.g. UNIX sockets for workers.
* Update code to refer to "workers". (#15606)Patrick Cloke2023-05-161-2/+2
| | | | A bunch of comments and variables are out of date and use obsolete terms.
* Add redis SSL configuration options (#15312)Roel ter Maat2023-05-113-14/+76
| | | | | | | | | | | | | | | | | * Add SSL options to redis config * fix lint issues * Add documentation and changelog file * add missing . at the end of the changelog * Move client context factory to new file * Rename ssl to tls and fix typo * fix lint issues * Added when redis attributes were added
* Remove `worker_replication_*` settings (#15491)Jason Little2023-05-111-11/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Add master to the instance_map as part of Complement, have ReplicationEndpoint look at instance_map for master. * Fix typo in drive by. * Remove unnecessary worker_replication_* bits from unit tests and add master to instance_map(hopefully in the right place) * Several updates: 1. Switch from master to main for naming the main process in the instance_map. Add useful constants for easier adjustment of names in the future. 2. Add backwards compatibility for worker_replication_* to allow time to transition to new style. Make sure to prioritize declaring main directly on the instance_map. 3. Clean up old comments/commented out code. 4. Adjust unit tests to match with new code. 5. Adjust Complement setup infrastructure to only add main to the instance_map if workers are used and remove now unused options from the worker.yaml template. * Initial Docs upload * Changelog * Missed some commented out code that can go now * Remove TODO comment that no longer holds true. * Fix links in docs * More docs * Remove debug logging * Apply suggestions from code review Co-authored-by: reivilibre <olivier@librepush.net> * Apply suggestions from code review Co-authored-by: reivilibre <olivier@librepush.net> * Update version to latest, include completeish before/after examples in upgrade notes. * Fix up and docs too --------- Co-authored-by: reivilibre <olivier@librepush.net>
* HTTP Replication Client (#15470)Jason Little2023-05-091-1/+1
| | | | | | Separate out a HTTP client for replication in preparation for also supporting using UNIX sockets. The major difference from the base class is that this does not use treq to handle HTTP requests.
* Remove legacy code of single user device resync api (#15418)Alok Kumar Singh2023-04-211-57/+0
| | | | | * Removed single-user resync usage and updated it to use multi-user counterpart Signed-off-by: Alok Kumar Singh alokaks601@gmail.com
* Add some clarification to the doc/comments regarding TCP replication (#15354)Mathieu Velten2023-03-302-32/+3
|
* Have replication clients remove _INT_STREAM_POS (#15309)David Robertson2023-03-221-1/+1
| | | | | | | | | | | | | | | | | | | * Have replication clients remove _INT_STREAM_POS Suppose worker A makes an internal http request from worker B. B may make changes that A later learns about over replication. We want A's request to block until it has seen those changes—mainly to ensure A's caches are invalidated promptly. This helps provide read-after-write consistency, eliminating entire categories of races and test flakes. To implement this, B includes a top-level field `_INT_STREAM_POS` in its response JSON. Roughly speaking, the field's value tells A what to wait for. But we weren't removing that internal field before A's request completed! Introduced in https://github.com/matrix-org/synapse/pull/14820. Fixes #15308. * Changelog
* Remove no-op send_command for Redis replication. (#15274)Patrick Cloke2023-03-161-25/+1
| | | | | With Redis commands do not need to be re-issued by the main process (they fan-out to all processes at once) and thus it is no longer necessary to worry about them reflecting recursively forever.
* Remove unused class: DirectTcpReplicationClientFactory. (#15272)Patrick Cloke2023-03-151-51/+0
|
* Add support for knocking to workers. (#15133)Dirk Klimpel2023-03-021-11/+4
|
* Merge branch 'master' into developH. Shay2023-02-281-0/+18
|\
| * Fix bug where 5s delays would occasionally happen. (#15150)Erik Johnston2023-02-241-0/+18
| | | | | | This only affects deployments using workers.
* | Bump black from 22.12.0 to 23.1.0 (#15103)dependabot[bot]2023-02-224-4/+0
|/
* Tweak logging for when a worker waits for its view of a replication stream ↵reivilibre2023-02-211-2/+10
| | | | | | | | | | | | | | | | | | to catch up. (#15120)Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> * Improve logging messages for the 'wait for repl stream' read-after-write consistency feature * Newsfile Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org> * Update synapse/replication/tcp/client.py Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> --------- Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org> Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Fix bug in replication where response is cached (#15024)Erik Johnston2023-02-081-0/+2
|
* Faster joins: omit partial rooms from eager syncs until the resync completes ↵David Robertson2023-01-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (#14870) * Allow `AbstractSet` in `StrCollection` Or else frozensets are excluded. This will be useful in an upcoming commit where I plan to change a function that accepts `List[str]` to accept `StrCollection` instead. * `rooms_to_exclude` -> `rooms_to_exclude_globally` I am about to make use of this exclusion mechanism to exclude rooms for a specific user and a specific sync. This rename helps to clarify the distinction between the global config and the rooms to exclude for a specific sync. * Better function names for internal sync methods * Track a list of excluded rooms on SyncResultBuilder I plan to feed a list of partially stated rooms for this sync to ignore * Exclude partial state rooms during eager sync using the mechanism established in the previous commit * Track un-partial-state stream in sync tokens So that we can work out which rooms have become fully-stated during a given sync period. * Fix mutation of `@cached` return value This was fouling up a complement test added alongside this PR. Excluding a room would mean the set of forgotten rooms in the cache would be extended. This means that room could be erroneously considered forgotten in the future. Introduced in #12310, Synapse 1.57.0. I don't think this had any user-visible side effects (until now). * SyncResultBuilder: track rooms to force as newly joined Similar plan as before. We've omitted rooms from certain sync responses; now we establish the mechanism to reintroduce them into future syncs. * Read new field, to present rooms as newly joined * Force un-partial-stated rooms to be newly-joined for eager incremental syncs only, provided they're still fully stated * Notify user stream listeners to wake up long polling syncs * Changelog * Typo fix Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> * Unnecessary list cast Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> * Rephrase comment Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> * Another comment Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com> * Fixup merge(?) * Poke notifier when receiving un-partial-stated msg over replication * Fixup merge whoops Thanks MV :) Co-authored-by: Mathieu Velen <mathieuv@matrix.org> Co-authored-by: Mathieu Velten <mathieuv@matrix.org> Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Faster joins: Update room stats and the user directory on workers when ↵Sean Quah2023-01-231-0/+6
| | | | | | | | | | | | | | | | | | | | | | | finishing join (#14874) * Faster joins: Update room stats and user directory on workers when done When finishing a partial state join to a room, we update the current state of the room without persisting additional events. Workers receive notice of the current state update over replication, but neglect to wake the room stats and user directory updaters, which then get incidentally triggered the next time an event is persisted or an unrelated event persister sends out a stream position update. We wake the room stats and user directory updaters at the appropriate time in this commit. Part of #12814 and #12815. Signed-off-by: Sean Quah <seanq@matrix.org> * fixup comment Signed-off-by: Sean Quah <seanq@matrix.org>
* Enable Faster Remote Room Joins against worker-mode Synapse. (#14752)reivilibre2023-01-221-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Enable Complement tests for Faster Remote Room Joins on worker-mode * (dangerous) Add an override to allow Complement to use FRRJ under workers * Newsfile Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org> * Fix race where we didn't send out replication notification * MORE HACKS * Fix get_un_partial_stated_rooms_token to take instance_name * Fix bad merge * Remove warning * Correctly advance un_partial_stated_room_stream * Fix merge * Add another notify_replication * Fixups * Create a separate ReplicationNotifier * Fix test * Fix portdb * Create a separate ReplicationNotifier * Fix test * Fix portdb * Fix presence test * Newsfile * Apply suggestions from code review * Update changelog.d/14752.misc Co-authored-by: Erik Johnston <erik@matrix.org> * lint Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org> Co-authored-by: Erik Johnston <erik@matrix.org>
* Reduce max time we wait for stream positions (#14881)Erik Johnston2023-01-202-12/+11
| | | | | | Now that we wait for stream positions whenever we do a HTTP replication hit, we need to be less brutal in the case where we do timeout (as we have bugs around this).
* Fix bug in wait for stream position (#14872)Erik Johnston2023-01-191-10/+19
| | | | | | | This caused some requests to fail. This caused some requests to fail. This really only started causing issues due to #14856
* Wait for streams to catch up when processing HTTP replication. (#14820)Erik Johnston2023-01-1815-115/+182
| | | | This should hopefully mitigate a class of races where data gets out of sync due a HTTP replication request racing with the replication streams.
* Fix bug in `wait_for_stream_position` (#14856)Erik Johnston2023-01-171-1/+1
| | | | | We were incorrectly checking if the *local* token had been advanced, rather than the token for the remote instance. In practice, I don't think this has caused any bugs due to where we use `wait_for_stream_position`, as critically we don't use it on instances that also write to the given streams (and so the local token will lag behind all remote tokens).
* Merge device list replication streams (#14833)Erik Johnston2023-01-173-27/+58
|
* Merge account data streams (#14826)Erik Johnston2023-01-134-32/+26
|
* Batch up replication requests to request the resyncing of remote users's ↵reivilibre2023-01-101-1/+73
| | | | devices. (#14716)
* Update all stream IDs after processing replication rows (#14723)Nick Mills-Barrett2023-01-041-0/+3
| | | | | | | | | | | | | | This creates a new store method, `process_replication_position` that is called after `process_replication_rows`. By moving stream ID advances here this guarantees any relevant cache invalidations will have been applied before the stream is advanced. This avoids race conditions where Python switches between threads mid way through processing the `process_replication_rows` method where stream IDs may be advanced before caches are invalidated due to class resolution ordering. See this comment/issue for further discussion: https://github.com/matrix-org/synapse/issues/14158#issuecomment-1344048703
* Add experimental support for MSC3391: deleting account data (#14714)Andrew Morgan2023-01-011-8/+84
|
* Faster remote room joins: invalidate caches and unblock requests when ↵reivilibre2022-12-191-1/+13
| | | | receiving un-partial-stated event notifications over replication. [rei:frrj/streams/unpsr] (#14546)
* Faster remote room joins: stream the un-partial-stating of events over ↵reivilibre2022-12-142-1/+34
| | | | replication. [rei:frrj/streams/unpsr] (#14545)
* Faster remote room joins: unblock tasks waiting for full room state when the ↵reivilibre2022-12-061-0/+11
| | | | un-partial-stating of that room is received over the replication stream. [rei:frrj/streams/unpsr] (#14474)
* Faster remote room joins: stream the un-partial-stating of rooms over ↵reivilibre2022-12-052-0/+51
| | | | replication. [rei:frrj/streams/unpsr] (#14473)
* Add a type hint for `get_device_handler()` and fix incorrect types. (#14055)Patrick Cloke2022-11-221-3/+8
| | | | | This was the last untyped handler from the HomeServer object. Since it was being treated as Any (and thus unchecked) it was being used incorrectly in a few places.
* Fix check to ignore blank lines in incoming TCP replication (#14449)Andrew Morgan2022-11-171-1/+1
|
* Reintroduce #14376, with bugfix for monoliths (#14468)David Robertson2022-11-163-76/+0
| | | | | | | | | | | | | | | | | | | | | | * Add tests for StreamIdGenerator * Drive-by: annotate all defs * Revert "Revert "Remove slaved id tracker (#14376)" (#14463)" This reverts commit d63814fd736fed5d3d45ff3af5e6d3bfae50c439, which in turn reverted 36097e88c4da51fce6556a58c49bd675f4cf20ab. This restores the latter. * Fix StreamIdGenerator not handling unpersisted IDs Spotted by @erikjohnston. Closes #14456. * Changelog Co-authored-by: Nick Mills-Barrett <nick@fizzadar.com> Co-authored-by: Erik Johnston <erik@matrix.org>
* Remove need for `worker_main_http_uri` setting to use /keys/upload. (#14400)realtyem2022-11-161-0/+67
|
* Remove redundant types from comments. (#14412)Patrick Cloke2022-11-161-1/+1
| | | | | | | Remove type hints from comments which have been added as Python type hints. This helps avoid drift between comments and reality, as well as removing redundant information. Also adds some missing type hints which were simple to fill in.
* Revert "Remove slaved id tracker (#14376)" (#14463)Erik Johnston2022-11-163-0/+76
| | | This reverts commit 36097e88c4da51fce6556a58c49bd675f4cf20ab.
* Support using SSL on worker endpoints. (#14128)Tuomas Ojamies2022-11-151-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Fix missing SSL support in worker endpoints. * Add changelog * SSL for Replication endpoint * Remove unit test change * Refactor listener creation to reduce duplicated code * Fix the logger message * Update synapse/app/_base.py Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com> * Update synapse/app/_base.py Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com> * Update synapse/app/_base.py Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com> * Add config documentation for new TLS option Co-authored-by: Tuomas Ojamies <tojamies@palantir.com> Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com> Co-authored-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>
* Remove slaved id tracker (#14376)Nick Mills-Barrett2022-11-143-76/+0
| | | | | This matches the multi instance writer ID generator class which can both handle advancing the current token over replication and by calling the database.
* Merge/remove `Slaved*` stores into `WorkerStores` (#14375)Nick Mills-Barrett2022-11-116-295/+0
|
* Merge remote-tracking branch 'origin/release-v1.69' into developPatrick Cloke2022-10-141-1/+17
|\
| * Fallback if 'approved' isn't included in a registration replication request ↵Brendan Abolivier2022-10-111-1/+17
| | | | | | | | (#14135)
* | Batch up notifications after event persistence (#14033)Shay2022-10-051-9/+10
|/
* Allow admins to require a manual approval process before new accounts can be ↵Brendan Abolivier2022-09-291-0/+5
| | | | used (using MSC3866) (#13556)
* Persist CreateRoom events to DB in a batch (#13800)Shay2022-09-283-2/+175
|
* Accept & store thread IDs for receipts (implement MSC3771). (#13782)Patrick Cloke2022-09-232-1/+3
| | | | Updates the `/receipts` endpoint and receipt EDU handler to parse a `thread_id` from the body and insert it in the database.
* Support enabling/disabling pushers (from MSC3881) (#13799)Brendan Abolivier2022-09-211-3/+7
| | | Partial implementation of MSC3881
* Remove configuration options for direct TCP replication. (#13647)Patrick Cloke2022-09-061-37/+21
| | | Removes the ability to configure legacy direct TCP replication. Workers now require Redis to run.
* Remove support for unstable private read receipts (#13653)Šimon Brandner2022-09-011-4/+1
| | | Signed-off-by: Šimon Brandner <simon.bra.ag@gmail.com>
* Generalise the `@cancellable` annotation so it can be used on functions ↵reivilibre2022-08-311-3/+4
| | | | other than just servlet methods. (#13662)
* Speed up fetching large numbers of push rules (#13592)Erik Johnston2022-08-231-1/+0
|
* Support stable identifiers for MSC2285: private read receipts. (#13273)Šimon Brandner2022-08-051-1/+4
| | | | | This adds support for the stable identifiers of MSC2285 while continuing to support the unstable identifiers behind the configuration flag. These will be removed in a future version.
* Remove old empty/redundant slaved stores. (#13349)Nick Mills-Barrett2022-07-217-142/+0
|
* Use cache store remove base slaved (#13329)Nick Mills-Barrett2022-07-2111-83/+10
| | | This comes from two identical definitions in each of the base stores, and means the base slaved store is now empty and can be removed.
* Add type annotations to `trace` decorator. (#13328)Patrick Cloke2022-07-191-2/+2
| | | | Functions that are decorated with `trace` are now properly typed and the type hints for them are fixed.
* Rate limit joins per-room (#13276)David Robertson2022-07-192-1/+17
|
* Revert "Make all `process_replication_rows` methods async (#13304)" (#13312)Erik Johnston2022-07-184-16/+8
| | | This reverts commit 5d4028f217f178fcd384d5bfddd92225b4e78c51.
* Make all `process_replication_rows` methods async (#13304)Nick Mills-Barrett2022-07-174-8/+16
| | | | | More prep work for asyncronous caching, also makes all process_replication_rows methods consistent (presence handler already is so). Signed off by Nick @ Beeper (@Fizzadar)
* Faster room joins: fix race in recalculation of current room state (#13151)Sean Quah2022-07-072-0/+77
| | | | | | | | | | | Bounce recalculation of current state to the correct event persister and move recalculation of current state into the event persistence queue, to avoid concurrent updates to a room's current state. Also give recalculation of a room's current state a real stream ordering. Signed-off-by: Sean Quah <seanq@matrix.org>
* Handle race between persisting an event and un-partial stating a room (#13100)Sean Quah2022-07-052-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever we want to persist an event, we first compute an event context, which includes the state at the event and a flag indicating whether the state is partial. After a lot of processing, we finally try to store the event in the database, which can fail for partial state events when the containing room has been un-partial stated in the meantime. We detect the race as a foreign key constraint failure in the data store layer and turn it into a special `PartialStateConflictError` exception, which makes its way up to the method in which we computed the event context. To make things difficult, the exception needs to cross a replication request: `/fed_send_events` for events coming over federation and `/send_event` for events from clients. We transport the `PartialStateConflictError` as a `409 Conflict` over replication and turn `409`s back into `PartialStateConflictError`s on the worker making the request. All client events go through `EventCreationHandler.handle_new_client_event`, which is called in *a lot* of places. Instead of trying to update all the code which creates client events, we turn the `PartialStateConflictError` into a `429 Too Many Requests` in `EventCreationHandler.handle_new_client_event` and hope that clients take it as a hint to retry their request. On the federation event side, there are 7 places which compute event contexts. 4 of them use outlier event contexts: `FederationEventHandler._auth_and_persist_outliers_inner`, `FederationHandler.do_knock`, `FederationHandler.on_invite_request` and `FederationHandler.do_remotely_reject_invite`. These events won't have the partial state flag, so we do not need to do anything for then. The remaining 3 paths which create events are `FederationEventHandler.process_remote_join`, `FederationEventHandler.on_send_membership_event` and `FederationEventHandler._process_received_pdu`. We can't experience the race in `process_remote_join`, unless we're handling an additional join into a partial state room, which currently blocks, so we make no attempt to handle it correctly. `on_send_membership_event` is only called by `FederationServer._on_send_membership_event`, so we catch the `PartialStateConflictError` there and retry just once. `_process_received_pdu` is called by `on_receive_pdu` for incoming events and `_process_pulled_event` for backfill. The latter should never try to persist partial state events, so we ignore it. We catch the `PartialStateConflictError` in `on_receive_pdu` and retry just once. Refering to the graph of code paths in https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648 may make the above make more sense. Signed-off-by: Sean Quah <seanq@matrix.org>
* Type annotations in `synapse.databases.main.devices` (#13025)David Robertson2022-06-151-2/+1
| | | Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
* Remove groups replication code. (#12900)Patrick Cloke2022-05-314-86/+0
| | | | The replication logic for groups is no longer used, so the message passing infrastructure can be removed.
* Rename storage classes (#12913)Erik Johnston2022-05-312-4/+6
|
* Send `USER_IP` commands on a different Redis channel, in order to reduce ↵reivilibre2022-05-202-3/+15
| | | | traffic to workers that do not process these commands. (#12809)
* Lay some foundation work to allow workers to only subscribe to some kinds of ↵reivilibre2022-05-192-12/+57
| | | | messages, reducing replication traffic. (#12672)
* Add `StreamKeyType` class and replace string literals with constants (#12567)Andrew Morgan2022-05-161-7/+11
|
* Respect the `@cancellable` flag for `ReplicationEndpoint`s (#12700)Sean Quah2022-05-111-2/+19
| | | | | | | | | While `ReplicationEndpoint`s register themselves via `JsonResource`, they pass a method that calls the handler, instead of the handler itself, to `register_paths`. As a result, `JsonResource` will not correctly pick up the `@cancellable` flag and we have to apply it ourselves. Signed-off-by: Sean Quah <seanq@element.io>
* Update `replication.md` with info on TCP module structure (#12621)Shay2022-05-091-1/+1
|
* Update `_on_new_receipts()` to work with MSC2285 changes. (#12636)Šimon Brandner2022-05-051-5/+3
|
* Reduce log spam when running multiple event persisters (#12610)Erik Johnston2022-05-052-2/+16
|
* Add opentracing spans to calls to external cache (#12380)Erik Johnston2022-04-071-11/+20
|
* Refactor and convert `Linearizer` to async (#12357)Sean Quah2022-04-051-1/+1
| | | | | | | | | | | Refactor and convert `Linearizer` to async. This makes a `Linearizer` cancellation bug easier to fix. Also refactor to use an async context manager, which eliminates an unlikely footgun where code that doesn't immediately use the context manager could forget to release the lock. Signed-off-by: Sean Quah <seanq@element.io>
* Prefill more stream change caches. (#12372)Erik Johnston2022-04-051-23/+2
|
* Prefill the device_list_stream_cache (#12367)Erik Johnston2022-04-041-1/+11
| | | | | | | * Prefill the device_list_stream_cache * Newsfile * Newsfile
* Track device list updates per room. (#12321)Erik Johnston2022-04-041-0/+1
| | | | | | | | | | | | | | This is a first step in dealing with #7721. The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things: 1. we can defer calculating the set of remote servers that need to be poked about the change; and 2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed. However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled. There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly). Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.
* Move `update_client_ip` background job from the main process to the ↵reivilibre2022-04-013-73/+42
| | | | background worker. (#12251)
* Bump `black` and `click` versions (#12320)David Robertson2022-03-291-1/+1
|
* Improve code documentation for the typing stream over replication. (#12211)reivilibre2022-03-113-4/+16
|
* Rename get_tcp_replication to get_replication_command_handler. (#12192)Patrick Cloke2022-03-105-8/+8
| | | | | | Since the object it returns is a ReplicationCommandHandler. This is clean-up from adding support to Redis where the command handler was added as an additional layer of abstraction from the TCP protocol.
* Retry some http replication failures (#12182)Nick Mills-Barrett2022-03-091-11/+36
| | | | | | | | This allows for the target process to be down for around a minute which provides time for restarts during synapse upgrades/config updates. Closes: #12178 Signed off by Nick Mills-Barrett nick@beeper.com
* Fix incorrect type hints for txredis. (#12042)Patrick Cloke2022-03-082-5/+5
| | | | Some properties were marked as RedisProtocol instead of ConnectionHandler, which wraps RedisProtocol instance(s).
* Spread out sending device lists to remote hosts (#12132)Erik Johnston2022-03-041-1/+1
|
* Remove `HomeServer.get_datastore()` (#12031)Richard van der Hoff2022-02-2310-31/+31
| | | | | | | The presence of this method was confusing, and mostly present for backwards compatibility. Let's get rid of it. Part of #11733
* Better error message when failing to request from another process (#12060)Erik Johnston2022-02-221-1/+3
|
* Add missing type hints to synapse.replication. (#11938)Patrick Cloke2022-02-0814-141/+196
|
* Remove unnecessary ignores due to Twisted upgrade. (#11939)Patrick Cloke2022-02-082-3/+3
| | | | Twisted 22.1.0 fixed some internal type hints, allowing Synapse to remove ignore calls for parameters to connectTCP.
* Add missing type hints to synapse.replication.http. (#11856)Patrick Cloke2022-02-0812-162/+257
|
* Stop reading from `event_reference_hashes` (#11794)Richard van der Hoff2022-01-211-1/+1
| | | | Preparation for dropping this table altogether. Part of #6574.
* Use auto_attribs/native type hints for attrs classes. (#11692)Patrick Cloke2022-01-131-17/+17
|
* Remove redundant `get_current_events_token` (#11643)Richard van der Hoff2022-01-041-9/+0
| | | | | | | | | | | | | | | | | * Push `get_room_{min,max_stream_ordering}` into StreamStore Both implementations of this are identical, so we may as well push it down and get rid of the abstract base class nonsense. * Remove redundant `StreamStore` class This is empty now * Remove redundant `get_current_events_token` This was an exact duplicate of `get_room_max_stream_ordering`, so let's get rid of it. * newsfile
* Convert all namedtuples to attrs. (#11665)Patrick Cloke2021-12-302-70/+74
| | | To improve type hints throughout the code.
* Type hint the constructors of the data store classes (#11555)Sean Quah2021-12-136-12/+42
|
* Save the OIDC session ID (sid) with the device on login (#11482)Quentin Gliech2021-12-061-0/+8
| | | As a step towards allowing back-channel logout for OIDC.
* Add type hints to `synapse/storage/databases/main/events_worker.py` (#11411)Sean Quah2021-11-263-19/+13
| | | | Also refactor the stream ID trackers/generators a bit and try to document them better.
* Add missing type hints to `synapse.app`. (#11287)Patrick Cloke2021-11-101-2/+2
|
* Enable passing typing stream writers as a list. (#11237)Nick Barrett2021-11-032-3/+2
| | | | This makes the typing stream writer config match the other stream writers that only currently support a single worker.
* Implement an `on_new_event` callback (#11126)Brendan Abolivier2021-10-261-1/+2
| | | Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
* Add type hints for most `HomeServer` parameters (#11095)Sean Quah2021-10-2222-53/+130
|
* Fix logging context warnings when losing replication connection (#10984)Sean Quah2021-10-152-10/+26
| | | | | | Instead of triggering `__exit__` manually on the replication handler's logging context, use it as a context manager so that there is an `__enter__` call to balance the `__exit__`.
* Fix opentracing and Prometheus metrics for replication requests (#10996)Sean Quah2021-10-121-76/+78
| | | | | | | | | | | | | | | | | | | | | | | | This commit fixes two bugs to do with decorators not instrumenting `ReplicationEndpoint`'s `send_request` correctly. There are two decorators on `send_request`: Prometheus' `Gauge.track_inprogress()` and Synapse's `opentracing.trace`. `Gauge.track_inprogress()` does not have any support for async functions when used as a decorator. Since async functions behave like regular functions that return coroutines, only the creation of the coroutine was covered by the metric and none of the actual body of `send_request`. `Gauge.track_inprogress()` returns a regular, non-async function wrapping `send_request`, which is the source of the next bug. The `opentracing.trace` decorator would normally handle async functions correctly, but since the wrapped `send_request` is a non-async function, the decorator ends up suffering from the same issue as `Gauge.track_inprogress()`: the opentracing span only measures the creation of the coroutine and none of the actual function body. Using `Gauge.track_inprogress()` as a context manager instead of a decorator resolves both bugs.
* Annotate synapse.storage.util (#10892)David Robertson2021-10-082-5/+9
| | | | | Also mark `synapse.streams` as having has no untyped defs Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Require direct references to configuration variables. (#10985)Patrick Cloke2021-10-062-3/+6
| | | | | | This removes the magic allowing accessing configurable variables directly from the config object. It is now required that a specific configuration class is used (e.g. `config.foo` must be replaced with `config.server.foo`).
* Pass str to twisted's IReactorTCP (#10895)David Robertson2021-09-302-3/+13
| | | | | | | This follows a correction made in twisted/twisted#1664 and should fix our Twisted Trial CI job. Until that change is in a twisted release, we'll have to ignore the type of the `host` argument. I've raised #10899 to remind us to review the issue in a few months' time.
* Use direct references for configuration variables (part 6). (#10916)Patrick Cloke2021-09-291-1/+1
|
* Use direct references for configuration variables (part 5). (#10897)Patrick Cloke2021-09-242-4/+4
|
* Use direct references for some configuration variables (#10798)Patrick Cloke2021-09-134-5/+5
| | | | Instead of proxying through the magic getter of the RootConfig object. This should be more performant (and is more explicit).
* Split `FederationHandler` in half (#10692)Richard van der Hoff2021-08-261-2/+2
| | | The idea here is to take anything to do with incoming events and move it out to a separate handler, as a way of making FederationHandler smaller.
* Remove the unused public_room_list_stream (#10565)Andrew Morgan2021-08-173-65/+0
| | | Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
* Fix up type hints for Twisted 21.7 (#10490)Richard van der Hoff2021-07-281-1/+1
| | | Mostly this involves decorating a few Deferred declarations with extra type hints. We wrap the types in quotes to avoid runtime errors when running against older versions of Twisted that don't have generics on Deferred.
* Support for MSC2285 (hidden read receipts) (#10413)Šimon Brandner2021-07-281-0/+5
| | | Implementation of matrix-org/matrix-doc#2285
* Use inline type hints in various other places (in `synapse/`) (#10380)Jonathan de Jong2021-07-1511-61/+61
|
* MSC2918 Refresh tokens implementation (#9450)Quentin Gliech2021-06-241-1/+12
| | | | | | | | | | This implements refresh tokens, as defined by MSC2918 This MSC has been implemented client side in Hydrogen Web: vector-im/hydrogen-web#235 The basics of the MSC works: requesting refresh tokens on login, having the access tokens expire, and using the refresh token to get a new one. Signed-off-by: Quentin Gliech <quentingliech@gmail.com>
* update black to 21.6b0 (#10197)Marcus2021-06-171-1/+1
| | | | | Reformat all files with the new version. Signed-off-by: Marcus Hoffmann <bubu@bubu1.eu>
* Extend `ResponseCache` to pass a context object into the callback (#10157)Richard van der Hoff2021-06-142-4/+4
| | | | | This is the first of two PRs which seek to address #8518. This first PR lays the groundwork by extending ResponseCache; a second PR (#10158) will update the SyncHandler to actually use it, and fix the bug. The idea here is that we allow the callback given to ResponseCache.wrap to decide whether its result should be cached or not. We do that by (optionally) passing a ResponseCacheContext into it, which it can modify.
* Implement knock feature (#6739)Sorunome2021-06-091-0/+139
| | | | | | This PR aims to implement the knock feature as proposed in https://github.com/matrix-org/matrix-doc/pull/2403 Signed-off-by: Sorunome mail@sorunome.de Signed-off-by: Andrew Morgan andrewm@element.io
* Clean up the interface for injecting opentracing over HTTP (#10143)Richard van der Hoff2021-06-091-2/+3
| | | | | | | * Remove unused helper functions * Clean up the interface for injecting opentracing over HTTP * changelog
* Combine `LruCache.invalidate` and `invalidate_many` (#9973)Richard van der Hoff2021-05-271-1/+1
| | | | | | | | | | * Make `invalidate` and `invalidate_many` do the same thing ... so that we can do either over the invalidation replication stream, and also because they always confused me a bit. * Kill off `invalidate_many` * changelog
* Remove `keylen` from `LruCache`. (#9993)Richard van der Hoff2021-05-241-1/+1
| | | | | | | `keylen` seems to be a thing that is frequently incorrectly set, and we don't really need it. The only time it was used was to figure out if we had removed a subtree in `del_multi`, which we can do better by changing `TreeCache.pop` to return a different type (`TreeCacheNode`). Commits should be independently reviewable.
* Don't hammer the database for destination retry timings every ~5mins (#10036)Erik Johnston2021-05-211-21/+0
|
* Use a database table to hold the users that should have full presence sent ↵Andrew Morgan2021-05-181-2/+9
| | | | to them, instead of something in-memory (#9823)
* Add debug logging for issue #9533 (#9959)Richard van der Hoff2021-05-111-1/+0
| | | | | Hopefully this will help us track down where to-device messages are getting lost/delayed.
* Time external cache response time (#9904)Erik Johnston2021-05-041-10/+26
|
* Split presence out of master (#9820)Erik Johnston2021-04-234-58/+32
|
* Remove `synapse.types.Collection` (#9856)Richard van der Hoff2021-04-221-2/+1
| | | This is no longer required, since we have dropped support for Python 3.5.
* Merge branch 'master' into developAndrew Morgan2021-04-211-1/+1
|\
| * Stop BackgroundProcessLoggingContext making new prometheus timeseries (#9854)Richard van der Hoff2021-04-211-1/+1
| | | | | | | | This undoes part of b076bc276e881b262048307b6a226061d96c4a8d.
* | Merge branch 'master' into developAndrew Morgan2021-04-201-1/+1
|\|
| * Always use the name as the log ID. (#9829)Patrick Cloke2021-04-201-1/+1
| | | | | | | | | | As far as I can tell our logging contexts are meant to log the request ID, or sometimes the request ID followed by a suffix (this is generally stored in the name field of LoggingContext). There's also code to log the name@memory location, but I'm not sure this is ever used. This simplifies the code paths to require every logging context to have a name and use that in logging. For sub-contexts (created via nested_logging_contexts, defer_to_threadpool, Measure) we use the current context's str (which becomes their name or the string "sentinel") and then potentially modify that (e.g. add a suffix).
* | Add presence federation stream (#9819)Erik Johnston2021-04-203-3/+31
| |
* | Move some replication processing out of generic_worker (#9796)Erik Johnston2021-04-141-7/+224
| | | | | | Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
* | Remove redundant "coding: utf-8" lines (#9786)Jonathan de Jong2021-04-1447-47/+0
|/ | | | | | | Part of #9744 Removes all redundant `# -*- coding: utf-8 -*-` lines from files, as python 3 automatically reads source code as utf-8 now. `Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>`
* Record more information into structured logs. (#9654)Patrick Cloke2021-04-081-2/+3
| | | | Records additional request information into the structured logs, e.g. the requester, IP address, etc.
* Update mypy configuration: `no_implicit_optional = True` (#9742)Jonathan de Jong2021-04-051-1/+1
|
* Make RateLimiter class check for ratelimit overrides (#9711)Erik Johnston2021-03-301-1/+1
| | | | | | | This should fix a class of bug where we forget to check if e.g. the appservice shouldn't be ratelimited. We also check the `ratelimit_override` table to check if the user has ratelimiting disabled. That table is really only meant to override the event sender ratelimiting, so we don't use any values from it (as they might not make sense for different rate limits), but we do infer that if ratelimiting is disabled for the user we should disabled all ratelimits. Fixes #9663
* Add type hints for the federation sender. (#9681)Patrick Cloke2021-03-292-6/+14
| | | | Includes an abstract base class which both the FederationSender and the FederationRemoteSendQueue must implement.
* Make it possible to use dmypy (#9692)Erik Johnston2021-03-261-1/+1
| | | | | | | | | Running `dmypy run` will do a `mypy` check while spinning up a daemon that makes rerunning `dmypy run` a lot faster. `dmypy` doesn't support `follow_imports = silent` and has `local_partial_types` enabled, so this PR enables those options and fixes the issues that were newly raised. Note that `local_partial_types` will be enabled by default in upcoming mypy releases.
* Import HomeServer from the proper module. (#9665)Patrick Cloke2021-03-232-2/+2
|
* Fix up types for the typing handler. (#9638)Patrick Cloke2021-03-171-7/+10
| | | | By splitting this to two separate methods the callers know what methods they can expect on the handler.
* Prep work for removing `outlier` from `internal_metadata` (#9411)Richard van der Hoff2021-03-172-1/+6
| | | | | | | | | | | | * Populate `internal_metadata.outlier` based on `events` table Rather than relying on `outlier` being in the `internal_metadata` column, populate it based on the `events.outlier` column. * Move `outlier` out of InternalMetadata._dict Ultimately, this will allow us to stop writing it to the database. For now, we have to grandfather it back in so as to maintain compatibility with older versions of Synapse.
* Fix remaining mypy issues due to Twisted upgrade. (#9608)Patrick Cloke2021-03-153-3/+12
|
* Fix additional type hints from Twisted 21.2.0. (#9591)Patrick Cloke2021-03-123-38/+38
|
* Add logging for redis connection setup (#9590)Richard van der Hoff2021-03-111-0/+35
|
* Fix the auth provider on the logins metric (#9573)Richard van der Hoff2021-03-101-2/+2
| | | | | We either need to pass the auth provider over the replication api, or make sure we report the auth provider on the worker that received the request. I've gone with the latter.
* Add ResponseCache tests. (#9458)Jonathan de Jong2021-03-081-3/+6
|
* Create a SynapseReactor type which incorporates the necessary reactor ↵Patrick Cloke2021-03-081-1/+1
| | | | | interfaces. (#9528) This helps fix some type hints when running with Twisted 21.2.0.
* Fix additional type hints from Twisted upgrade. (#9518)Patrick Cloke2021-03-031-3/+1
|
* Bump the mypy and mypy-zope versions. (#9529)Patrick Cloke2021-03-031-1/+1
|
* Use the proper Request in type hints. (#9515)Patrick Cloke2021-03-011-5/+4
| | | | This also pins the Twisted version in the mypy job for CI until proper type hints are fixed throughout Synapse.
* Fix deleting pushers when using sharded pushers. (#9465)Erik Johnston2021-02-224-50/+74
|
* Add configs to make profile data more private (#9203)AndrewFerr2021-02-191-1/+2
| | | | | | | Add off-by-default configuration settings to: - disable putting an invitee's profile info in invite events - disable profile lookup via federation Signed-off-by: Andrew Ferrazzutti <fair@miscworks.net>
* Update black, and run auto formatting over the codebase (#9381)Eric Eastwood2021-02-1612-71/+62
| | | | | | | - Update black version to the latest - Run black auto formatting over the codebase - Run autoformatting according to [`docs/code_style.md `](https://github.com/matrix-org/synapse/blob/80d6dc9783aa80886a133756028984dbf8920168/docs/code_style.md) - Update `code_style.md` docs around installing black to use the correct version
* Ensure that we never stop reconnecting to redis (#9391)Erik Johnston2021-02-111-2/+24
|
* Precompute joined hosts and store in Redis (#9198)Erik Johnston2021-01-262-14/+106
|
* Periodically send pings to detect dead Redis connections (#9218)Erik Johnston2021-01-262-53/+98
| | | | | | | | This is done by creating a custom `RedisFactory` subclass that periodically pings all connections in its pool. We also ensure that the `replyTimeout` param is non-null, so that we timeout waiting for the reply to those pings (and thus triggering a reconnect).
* Allow moving account data and receipts streams off master (#9104)Erik Johnston2021-01-186-76/+217
|
* Enforce all replication HTTP clients calls use kwargs (#9144)Erik Johnston2021-01-181-1/+1
|
* Allow running sendToDevice on workers (#9044)Erik Johnston2021-01-072-31/+10
|
* Some cleanups to device inbox store. (#9041)Erik Johnston2021-01-071-8/+0
|
* Merge remote-tracking branch 'origin/erikj/as_mau_block' into developErik Johnston2020-12-181-2/+10
|\
| * Correctly handle AS registerations and add testErik Johnston2020-12-171-2/+10
| |
* | Convert internal pusher dicts to attrs classes. (#8940)Patrick Cloke2020-12-162-10/+27
| | | | | | This improves type hinting and should use less memory.
* | Various clean-ups to the logging context code (#8935)Patrick Cloke2020-12-141-2/+1
| |
* | Add authentication to replication endpoints. (#8853)Patrick Cloke2020-12-041-6/+41
|/ | | | Authentication is done by checking a shared secret provided in the Synapse configuration file.
* Add typing to membership Replication class methods (#8809)Andrew Morgan2020-11-271-22/+44
| | | | | This PR grew out of #6739, and adds typing to some method arguments You'll notice that there are a lot of `# type: ignores` in here. This is due to the base methods not matching the overloads here. This is necessary to stop mypy complaining, but a better solution is #8828.
* Generalise _maybe_store_room_on_invite (#8754)Andrew Morgan2020-11-131-5/+5
| | | | | | | | | There's a handy function called maybe_store_room_on_invite which allows us to create an entry in the rooms table for a room and its version for which we aren't joined to yet, but we can reference when ingesting events about. This is currently used for invites where we receive some stripped state about the room and pass it down via /sync to the client, without us being in the room yet. There is a similar requirement for knocking, where we will eventually do the same thing, and need an entry in the rooms table as well. Thus, reusing this function works, however its name needs to be generalised a bit. Separated out from #6739.
* Add ability for access tokens to belong to one user but grant access to ↵Erik Johnston2020-10-292-6/+3
| | | | | | | | | | another user. (#8616) We do it this way round so that only the "owner" can delete the access token (i.e. `/logout/all` by the "owner" also deletes that token, but `/logout/all` by the "target user" doesn't). A future PR will add an API for creating such a token. When the target user and authenticated entity are different the `Processed request` log line will be logged with a: `{@admin:server as @bob:server} ...`. I'm not convinced by that format (especially since it adds spaces in there, making it harder to use `cut -d ' '` to chop off the start of log lines). Suggestions welcome.
* Don't pull event from DB when handling replication traffic. (#8669)Erik Johnston2020-10-282-16/+25
| | | | | I was trying to make it so that we didn't have to start a background task when handling RDATA, but that is a bigger job (due to all the code in `generic_worker`). However I still think not pulling the event from the DB may help reduce some DB usage due to replication, even if most workers will simply go and pull that event from the DB later anyway. Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
* Don't unnecessarily start bg process in replication sending loop. (#8670)Erik Johnston2020-10-271-0/+10
|
* Start fewer opentracing spans (#8640)Erik Johnston2020-10-261-1/+3
| | | | | | | #8567 started a span for every background process. This is good as it means all Synapse code that gets run should be in a span (unless in the sentinel logging context), but it means we generate about 15x the number of spans as we did previously. This PR attempts to reduce that number by a) not starting one for send commands to Redis, and b) deferring starting background processes until after we're sure they're necessary. I don't really know how much this will help.
* Replace DeferredCache with LruCache where possible (#8563)Richard van der Hoff2020-10-191-5/+5
| | | Most of these uses don't need a full-blown DeferredCache; LruCache is lighter and more appropriate.
* move DeferredCache into its own moduleRichard van der Hoff2020-10-141-1/+1
|
* Rename Cache->DeferredCacheRichard van der Hoff2020-10-141-3/+3
|
* Add some more type annotations to CacheRichard van der Hoff2020-10-141-1/+1
|
* Fix message duplication if something goes wrong after persisting the event ↵Erik Johnston2020-10-131-2/+14
| | | | | (#8476) Should fix #3365.
* Make event persisters periodically announce position over replication. (#8499)Erik Johnston2020-10-124-21/+90
| | | | | Currently background proccesses stream the events stream use the "minimum persisted position" (i.e. `get_current_token()`) rather than the vector clock style tokens. This is broadly fine as it doesn't matter if the background processes lag a small amount. However, in extreme cases (i.e. SyTests) where we only write to one event persister the background processes will never make progress. This PR changes it so that the `MultiWriterIDGenerator` keeps the current position of a given instance as up to date as possible (i.e using the latest token it sees if its not in the process of persisting anything), and then periodically announces that over replication. This then allows the "minimum persisted position" to advance, albeit with a small lag.
* Add type hints to response cache. (#8507)Patrick Cloke2020-10-091-1/+1
|
* Only send RDATA for instance local events. (#8496)Erik Johnston2020-10-092-6/+11
| | | | | When pulling events out of the DB to send over replication we were not filtering by instance name, and so we were sending events for other instances.
* Remove the deprecated Handlers object (#8494)Patrick Cloke2020-10-092-2/+2
| | | All handlers now available via get_*_handler() methods on the HomeServer.
* Add unit test for event persister sharding (#8433)Erik Johnston2020-10-022-4/+42
|
* Enable mypy checking for unreachable code and fix instances. (#8432)Patrick Cloke2020-10-011-4/+6
|
* Various clean ups to room stream tokens. (#8423)Erik Johnston2020-09-291-4/+2
|
* Add metrics to track success/otherwise of replication requests (#8406)Richard van der Hoff2020-09-291-12/+28
| | | One hope is that this might provide some insights into #3365.
* Fix MultiWriteIdGenerator's handling of restarts. (#8374)Erik Johnston2020-09-241-0/+2
| | | | | | | | | | | | | | | | | | | On startup `MultiWriteIdGenerator` fetches the maximum stream ID for each instance from the table and uses that as its initial "current position" for each writer. This is problematic as a) it involves either a scan of events table or an index (neither of which is ideal), and b) if rows are being persisted out of order elsewhere while the process restarts then using the maximum stream ID is not correct. This could theoretically lead to race conditions where e.g. events that are persisted out of order are not sent down sync streams. We fix this by creating a new table that tracks the current positions of each writer to the stream, and update it each time we finish persisting a new entry. This is a relatively small overhead when persisting events. However for the cache invalidation stream this is a much bigger relative overhead, so instead we note that for invalidation we don't actually care about reliability over restarts (as there's no caches to invalidate) and simply don't bother reading and writing to the new table in that particular case.
* Add EventStreamPosition type (#8388)Erik Johnston2020-09-241-3/+9
| | | | | | | | | | | | | | The idea is to remove some of the places we pass around `int`, where it can represent one of two things: 1. the position of an event in the stream; or 2. a token that partitions the stream, used as part of the stream tokens. The valid operations are then: 1. did a position happen before or after a token; 2. get all events that happened before or after a token; and 3. get all events between two tokens. (Note that we don't want to allow other operations as we want to change the tokens to be vector clocks rather than simple ints)
* Simplify super() calls to Python 3 syntax. (#8344)Patrick Cloke2020-09-1819-25/+25
| | | | | | | This converts calls like super(Foo, self) -> super(). Generated with: sed -i "" -Ee 's/super\([^\(]+\)/super()/g' **/*.py
* Switch metaclass initialization to python 3-compatible syntax (#8326)Jonathan de Jong2020-09-161-3/+1
|
* Use slots in attrs classes where possible (#8296)Patrick Cloke2020-09-141-2/+2
| | | | | slots use less memory (and attribute access is faster) while slightly limiting the flexibility of the class attributes. This focuses on objects which are instantiated "often" and for short periods of time.
* Fix typos in comments.Patrick Cloke2020-09-141-1/+1
|
* Add experimental support for sharding event persister. Again. (#8294)Erik Johnston2020-09-143-6/+12
| | | | | | This is *not* ready for production yet. Caveats: 1. We should write some tests... 2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
* Clean up `Notifier.on_new_room_event` code path (#8288)Erik Johnston2020-09-101-6/+3
| | | | | | | | | | | | | The idea here is that we pass the `max_stream_id` to everything, and only use the stream ID of the particular event to figure out *when* the max stream position has caught up to the event and we can notify people about it. This is to maintain the distinction between the position of an item in the stream (i.e. event A has stream ID 513) and a token that can be used to partition the stream (i.e. give me all events after stream ID 352). This distinction becomes important when the tokens are more complicated than a single number, which they will be once we start tracking the position of multiple writers in the tokens. The valid operations here are: 1. Is a position before or after a token 2. Fetching all events between two tokens 3. Merging multiple tokens to get the "max", i.e. `C = max(A, B)` means that for all positions P where P is before A *or* before B, then P is before C. Future PR will change the token type to a dedicated type.
* Remove some unused distributor signals (#8216)Patrick Cloke2020-09-091-6/+4
| | | | | Removes the `user_joined_room` and stops calling it since there are no observers. Also cleans-up some other unused signals and related code.
* Fixup pusher pool notifications (#8287)Erik Johnston2020-09-091-1/+2
| | | | | `pusher_pool.on_new_notifications` expected a min and max stream ID, however that was not what we were passing in. Instead, let's just pass it the current max stream ID and have it track the last stream ID it got passed. I believe that it mostly worked as we called the function for every event. However, it would break for events that got persisted out of order, i.e, that were persisted but the max stream ID wasn't incremented as not all preceding events had finished persisting, and push for that event would be delayed until another event got pushed to the effected users.
* Revert "Fixup pusher pool notifications"Erik Johnston2020-09-091-2/+1
| | | | This reverts commit e7fd336a53a4ca489cdafc389b494d5477019dc0.
* Fixup pusher pool notificationsErik Johnston2020-09-091-1/+2
|
* Stop sub-classing object (#8249)Patrick Cloke2020-09-046-7/+7
|
* Revert "Add experimental support for sharding event persister. (#8170)" (#8242)Brendan Abolivier2020-09-043-12/+6
| | | | | | | * Revert "Add experimental support for sharding event persister. (#8170)" This reverts commit 82c1ee1c22a87b9e6e3179947014b0f11c0a1ac3. * Changelog
* Add experimental support for sharding event persister. (#8170)Erik Johnston2020-09-023-6/+12
| | | | | | This is *not* ready for production yet. Caveats: 1. We should write some tests... 2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
* Move and rename `get_devices_with_keys_by_user` (#8204)Richard van der Hoff2020-09-011-0/+3
| | | | | | | | | | | | | | | | | | * Move `get_devices_with_keys_by_user` to `EndToEndKeyWorkerStore` this seems a better fit for it. This commit simply moves the existing code: no other changes at all. * Rename `get_devices_with_keys_by_user` to better reflect what it does. * get_device_stream_token abstract method To avoid referencing fields which are declared in the derived classes, make `get_device_stream_token` abstract, and define that in the classes which define `_device_list_id_gen`.
* Fix `wait_for_stream_position` for multiple waiters. (#8196)Erik Johnston2020-08-281-4/+2
| | | | | | This fixes a bug where having multiple callers waiting on the same stream and position will cause it to try and compare two deferreds, which fails (due to the sorted list having an entry of `Tuple[int, Deferred]`).
* Make SlavedIdTracker.advance have same interface as MultiWriterIDGenerator ↵Erik Johnston2020-08-2610-13/+13
| | | | (#8171)
* Remove `ChainedIdGenerator`. (#8123)Erik Johnston2020-08-192-7/+5
| | | | | It's just a thin wrapper around two ID gens to make `get_current_token` and `get_next` return tuples. This can easily be replaced by calling the appropriate methods on the underlying ID gens directly.
* Be stricter about JSON that is accepted by Synapse (#8106)Patrick Cloke2020-08-191-7/+5
|
* Separate `get_current_token` into two. (#8113)Erik Johnston2020-08-192-1/+9
| | | | | | | | | | | | The function is used for two purposes: 1) for subscribers of streams to get a token they can use to get further updates with, and 2) for replication to track position of the writers of the stream. For streams with a single writer the two scenarios produce the same result, however the situation becomes complicated for streams with multiple writers. The current `MultiWriterIdGenerator` does not correctly handle the first case (which is not an issue as its only used for the `caches` stream which nothing subscribes to outside of replication).
* Add a shadow-banned flag to users. (#8092)Patrick Cloke2020-08-141-0/+4
|
* Reduce unnecessary whitespace in JSON. (#7372)David Vo2020-08-071-2/+3
|
* Convert synapse.api to async/await (#8031)Patrick Cloke2020-08-061-1/+1
|
* Rename database classes to make some sense (#8033)Erik Johnston2020-08-0519-54/+54
|
* Convert replication code to async/await. (#7987)Patrick Cloke2020-08-039-37/+27
|
* Merge tag 'v1.18.0rc2' into developRichard van der Hoff2020-07-284-87/+112
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Synapse 1.18.0rc2 (2020-07-28) ============================== Bugfixes -------- - Fix an `AssertionError` exception introduced in v1.18.0rc1. ([\#7876](https://github.com/matrix-org/synapse/issues/7876)) - Fix experimental support for moving typing off master when worker is restarted, which is broken in v1.18.0rc1. ([\#7967](https://github.com/matrix-org/synapse/issues/7967)) Internal Changes ---------------- - Further optimise queueing of inbound replication commands. ([\#7876](https://github.com/matrix-org/synapse/issues/7876))
| * Typing worker needs to handle stream update requests (#7967)Erik Johnston2020-07-281-1/+1
| | | | | | | | | | IIRC this doesn't break tests because its only hit on reconnection, or something. Basically, when a process needs to fetch missing updates for the `typing` stream it needs to query the writer instance via HTTP (as we don't write typing notifications to the DB), the problem was that the endpoint (`streams`) was only registered on master and specifically not on the typing writer worker.
| * Handle replication commands synchronously where possible (#7876)Richard van der Hoff2020-07-273-86/+111
| | | | | | Most of the stuff we do for replication commands can be done synchronously. There's no point spinning up background processes if we're not going to need them.
* | Convert a synapse.events to async/await. (#7949)Patrick Cloke2020-07-272-2/+4
|/
* Fix typing replication not being handled on master (#7959)Erik Johnston2020-07-271-0/+8
| | | | | | | | | | | | | | | | Handling of incoming typing stream updates from replication was not hooked up on master, effecting set ups where typing was handled on a different worker. This is really only a problem if the master process is also handling sync requests, which is unlikely for those that are at the stage of moving typing off. The other observable effect is that if a worker restarts or a replication connect drops then the typing worker will issue a `POSITION typing`, triggering master process to try and stream *all* typing updates from position 0. Fixes #7907
* Remove an unused prometheus metric (#7878)Richard van der Hoff2020-07-221-3/+1
|
* Track command processing as a background process (#7879)Richard van der Hoff2020-07-222-3/+38
| | | | I'm going to be doing more stuff synchronously, and I don't want to lose the CPU metrics down the sofa.
* Fix deprecation warning: import ABC from collections.abc (#7892)Karthikeyan Singaravelan2020-07-201-1/+1
|
* Stop using 'device_max_stream_id' (#7882)Erik Johnston2020-07-171-1/+1
| | | | | It serves no purpose and updating everytime we write to the device inbox stream means all such transactions will conflict, causing lots of transaction failures and retries.
* Optimise queueing of inbound replication commands (#7861)Richard van der Hoff2020-07-161-116/+215
| | | | | | | | | | | When we get behind on replication, we tend to stack up background processes behind a linearizer. Bg processes are heavy (particularly with respect to prometheus metrics) and linearizers aren't terribly efficient once the queue gets long either. A better approach is to maintain a queue of requests to be processed, and nominate a single process to work its way through the queue. Fixes: #7444
* Allow moving typing off master (#7869)Erik Johnston2020-07-162-3/+13
|
* Add ability to shard the federation sender (#7798)Erik Johnston2020-07-102-6/+8
|
* Fix some spelling mistakes / typos. (#7811)Patrick Cloke2020-07-096-7/+7
|
* Generate real events when we reject invites (#7804)Richard van der Hoff2020-07-091-67/+25
| | | | | | | | Fixes #2181. The basic premise is that, when we fail to reject an invite via the remote server, we can generate our own out-of-band leave event and persist it as an outlier, so that we have something to send to the client.
* Do not use simplejson in Synapse. (#7800)Patrick Cloke2020-07-081-9/+2
|
* Refactor getting replication updates from database v2. (#7740)Erik Johnston2020-07-071-46/+10
|
* isort 5 compatibility (#7786)Will Hunt2020-07-053-5/+3
| | | The CI appears to use the latest version of isort, which is a problem when isort gets a major version bump. Rather than try to pin the version, I've done the necessary to make isort5 happy with synapse.
* Merge different Resource implementation classes (#7732)Erik Johnston2020-07-032-10/+4
|
* Use symbolic names for replication stream names (#7768)Richard van der Hoff2020-07-018-17/+17
| | | This makes it much easier to find where streams are referenced.
* Refactor getting replication updates from database. (#7636)Erik Johnston2020-06-161-21/+8
| | | The aim here is to make it easier to reason about when streams are limited and when they're not, by moving the logic into the database functions themselves. This should mean we can kill of `db_query_to_update_function` function.
* Replace all remaining six usage with native Python 3 equivalents (#7704)Dagfinn Ilmari Mannsåker2020-06-161-4/+2
|
* Discard RDATA from already seen positions. (#7648)Patrick Cloke2020-06-152-6/+28
|
* Fix bug in account data replication stream. (#7656)Erik Johnston2020-06-092-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | * Ensure account data stream IDs are unique. The account data stream is shared between three tables, and the maximum allocated ID was tracked in a dedicated table. Updating the max ID happened outside the transaction that allocated the ID, leading to a race where if the server was restarted then the same ID could be allocated but the max ID failed to be updated, leading it to be reused. The ID generators have support for tracking across multiple tables, so we may as well use that instead of a dedicated table. * Fix bug in account data replication stream. If the same stream ID was used in both global and room account data then the getting updates for the replication stream would fail due to `heapq.merge(..)` trying to compare a `str` with a `None`. (This is because you'd have two rows like `(534, '!room')` and `(534, None)` from the room and global account data tables). Fix is just to order by stream ID, since we don't rely on the ordering beyond that. The bug where stream IDs can be reused should be fixed now, so this case shouldn't happen going forward. Fixes #7617
* Typo fixes.Patrick Cloke2020-06-051-1/+1
|
* Ensure ReplicationStreamer is always started when replication enabled. (#7579)Erik Johnston2020-05-271-0/+3
| | | Fixes #7566.
* Add option to move event persistence off master (#7517)Erik Johnston2020-05-225-2/+171
|
* Add ability to wait for replication streams (#7542)Erik Johnston2020-05-225-18/+108
| | | | | | | The idea here is that if an instance persists an event via the replication HTTP API it can return before we receive that event over replication, which can lead to races where code assumes that persisting an event immediately updates various caches (e.g. current state of the room). Most of Synapse doesn't hit such races, so we don't do the waiting automagically, instead we do so where necessary to avoid unnecessary delays. We may decide to change our minds here if it turns out there are a lot of subtle races going on. People probably want to look at this commit by commit.