summary refs log tree commit diff
path: root/synapse/replication (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Remove old empty/redundant slaved stores. (#13349)Nick Mills-Barrett2022-07-217-142/+0
|
* Use cache store remove base slaved (#13329)Nick Mills-Barrett2022-07-2111-83/+10
| | | This comes from two identical definitions in each of the base stores, and means the base slaved store is now empty and can be removed.
* Add type annotations to `trace` decorator. (#13328)Patrick Cloke2022-07-191-2/+2
| | | | Functions that are decorated with `trace` are now properly typed and the type hints for them are fixed.
* Rate limit joins per-room (#13276)David Robertson2022-07-192-1/+17
|
* Revert "Make all `process_replication_rows` methods async (#13304)" (#13312)Erik Johnston2022-07-184-16/+8
| | | This reverts commit 5d4028f217f178fcd384d5bfddd92225b4e78c51.
* Make all `process_replication_rows` methods async (#13304)Nick Mills-Barrett2022-07-174-8/+16
| | | | | More prep work for asyncronous caching, also makes all process_replication_rows methods consistent (presence handler already is so). Signed off by Nick @ Beeper (@Fizzadar)
* Faster room joins: fix race in recalculation of current room state (#13151)Sean Quah2022-07-072-0/+77
| | | | | | | | | | | Bounce recalculation of current state to the correct event persister and move recalculation of current state into the event persistence queue, to avoid concurrent updates to a room's current state. Also give recalculation of a room's current state a real stream ordering. Signed-off-by: Sean Quah <seanq@matrix.org>
* Handle race between persisting an event and un-partial stating a room (#13100)Sean Quah2022-07-052-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever we want to persist an event, we first compute an event context, which includes the state at the event and a flag indicating whether the state is partial. After a lot of processing, we finally try to store the event in the database, which can fail for partial state events when the containing room has been un-partial stated in the meantime. We detect the race as a foreign key constraint failure in the data store layer and turn it into a special `PartialStateConflictError` exception, which makes its way up to the method in which we computed the event context. To make things difficult, the exception needs to cross a replication request: `/fed_send_events` for events coming over federation and `/send_event` for events from clients. We transport the `PartialStateConflictError` as a `409 Conflict` over replication and turn `409`s back into `PartialStateConflictError`s on the worker making the request. All client events go through `EventCreationHandler.handle_new_client_event`, which is called in *a lot* of places. Instead of trying to update all the code which creates client events, we turn the `PartialStateConflictError` into a `429 Too Many Requests` in `EventCreationHandler.handle_new_client_event` and hope that clients take it as a hint to retry their request. On the federation event side, there are 7 places which compute event contexts. 4 of them use outlier event contexts: `FederationEventHandler._auth_and_persist_outliers_inner`, `FederationHandler.do_knock`, `FederationHandler.on_invite_request` and `FederationHandler.do_remotely_reject_invite`. These events won't have the partial state flag, so we do not need to do anything for then. The remaining 3 paths which create events are `FederationEventHandler.process_remote_join`, `FederationEventHandler.on_send_membership_event` and `FederationEventHandler._process_received_pdu`. We can't experience the race in `process_remote_join`, unless we're handling an additional join into a partial state room, which currently blocks, so we make no attempt to handle it correctly. `on_send_membership_event` is only called by `FederationServer._on_send_membership_event`, so we catch the `PartialStateConflictError` there and retry just once. `_process_received_pdu` is called by `on_receive_pdu` for incoming events and `_process_pulled_event` for backfill. The latter should never try to persist partial state events, so we ignore it. We catch the `PartialStateConflictError` in `on_receive_pdu` and retry just once. Refering to the graph of code paths in https://github.com/matrix-org/synapse/issues/12988#issuecomment-1156857648 may make the above make more sense. Signed-off-by: Sean Quah <seanq@matrix.org>
* Type annotations in `synapse.databases.main.devices` (#13025)David Robertson2022-06-151-2/+1
| | | Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
* Remove groups replication code. (#12900)Patrick Cloke2022-05-314-86/+0
| | | | The replication logic for groups is no longer used, so the message passing infrastructure can be removed.
* Rename storage classes (#12913)Erik Johnston2022-05-312-4/+6
|
* Send `USER_IP` commands on a different Redis channel, in order to reduce ↵reivilibre2022-05-202-3/+15
| | | | traffic to workers that do not process these commands. (#12809)
* Lay some foundation work to allow workers to only subscribe to some kinds of ↵reivilibre2022-05-192-12/+57
| | | | messages, reducing replication traffic. (#12672)
* Add `StreamKeyType` class and replace string literals with constants (#12567)Andrew Morgan2022-05-161-7/+11
|
* Respect the `@cancellable` flag for `ReplicationEndpoint`s (#12700)Sean Quah2022-05-111-2/+19
| | | | | | | | | While `ReplicationEndpoint`s register themselves via `JsonResource`, they pass a method that calls the handler, instead of the handler itself, to `register_paths`. As a result, `JsonResource` will not correctly pick up the `@cancellable` flag and we have to apply it ourselves. Signed-off-by: Sean Quah <seanq@element.io>
* Update `replication.md` with info on TCP module structure (#12621)Shay2022-05-091-1/+1
|
* Update `_on_new_receipts()` to work with MSC2285 changes. (#12636)Šimon Brandner2022-05-051-5/+3
|
* Reduce log spam when running multiple event persisters (#12610)Erik Johnston2022-05-052-2/+16
|
* Add opentracing spans to calls to external cache (#12380)Erik Johnston2022-04-071-11/+20
|
* Refactor and convert `Linearizer` to async (#12357)Sean Quah2022-04-051-1/+1
| | | | | | | | | | | Refactor and convert `Linearizer` to async. This makes a `Linearizer` cancellation bug easier to fix. Also refactor to use an async context manager, which eliminates an unlikely footgun where code that doesn't immediately use the context manager could forget to release the lock. Signed-off-by: Sean Quah <seanq@element.io>
* Prefill more stream change caches. (#12372)Erik Johnston2022-04-051-23/+2
|
* Prefill the device_list_stream_cache (#12367)Erik Johnston2022-04-041-1/+11
| | | | | | | * Prefill the device_list_stream_cache * Newsfile * Newsfile
* Track device list updates per room. (#12321)Erik Johnston2022-04-041-0/+1
| | | | | | | | | | | | | | This is a first step in dealing with #7721. The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things: 1. we can defer calculating the set of remote servers that need to be poked about the change; and 2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed. However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled. There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly). Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.
* Move `update_client_ip` background job from the main process to the ↵reivilibre2022-04-013-73/+42
| | | | background worker. (#12251)
* Bump `black` and `click` versions (#12320)David Robertson2022-03-291-1/+1
|
* Improve code documentation for the typing stream over replication. (#12211)reivilibre2022-03-113-4/+16
|
* Rename get_tcp_replication to get_replication_command_handler. (#12192)Patrick Cloke2022-03-105-8/+8
| | | | | | Since the object it returns is a ReplicationCommandHandler. This is clean-up from adding support to Redis where the command handler was added as an additional layer of abstraction from the TCP protocol.
* Retry some http replication failures (#12182)Nick Mills-Barrett2022-03-091-11/+36
| | | | | | | | This allows for the target process to be down for around a minute which provides time for restarts during synapse upgrades/config updates. Closes: #12178 Signed off by Nick Mills-Barrett nick@beeper.com
* Fix incorrect type hints for txredis. (#12042)Patrick Cloke2022-03-082-5/+5
| | | | Some properties were marked as RedisProtocol instead of ConnectionHandler, which wraps RedisProtocol instance(s).
* Spread out sending device lists to remote hosts (#12132)Erik Johnston2022-03-041-1/+1
|
* Remove `HomeServer.get_datastore()` (#12031)Richard van der Hoff2022-02-2310-31/+31
| | | | | | | The presence of this method was confusing, and mostly present for backwards compatibility. Let's get rid of it. Part of #11733
* Better error message when failing to request from another process (#12060)Erik Johnston2022-02-221-1/+3
|
* Add missing type hints to synapse.replication. (#11938)Patrick Cloke2022-02-0814-141/+196
|
* Remove unnecessary ignores due to Twisted upgrade. (#11939)Patrick Cloke2022-02-082-3/+3
| | | | Twisted 22.1.0 fixed some internal type hints, allowing Synapse to remove ignore calls for parameters to connectTCP.
* Add missing type hints to synapse.replication.http. (#11856)Patrick Cloke2022-02-0812-162/+257
|
* Stop reading from `event_reference_hashes` (#11794)Richard van der Hoff2022-01-211-1/+1
| | | | Preparation for dropping this table altogether. Part of #6574.
* Use auto_attribs/native type hints for attrs classes. (#11692)Patrick Cloke2022-01-131-17/+17
|
* Remove redundant `get_current_events_token` (#11643)Richard van der Hoff2022-01-041-9/+0
| | | | | | | | | | | | | | | | | * Push `get_room_{min,max_stream_ordering}` into StreamStore Both implementations of this are identical, so we may as well push it down and get rid of the abstract base class nonsense. * Remove redundant `StreamStore` class This is empty now * Remove redundant `get_current_events_token` This was an exact duplicate of `get_room_max_stream_ordering`, so let's get rid of it. * newsfile
* Convert all namedtuples to attrs. (#11665)Patrick Cloke2021-12-302-70/+74
| | | To improve type hints throughout the code.
* Type hint the constructors of the data store classes (#11555)Sean Quah2021-12-136-12/+42
|
* Save the OIDC session ID (sid) with the device on login (#11482)Quentin Gliech2021-12-061-0/+8
| | | As a step towards allowing back-channel logout for OIDC.
* Add type hints to `synapse/storage/databases/main/events_worker.py` (#11411)Sean Quah2021-11-263-19/+13
| | | | Also refactor the stream ID trackers/generators a bit and try to document them better.
* Add missing type hints to `synapse.app`. (#11287)Patrick Cloke2021-11-101-2/+2
|
* Enable passing typing stream writers as a list. (#11237)Nick Barrett2021-11-032-3/+2
| | | | This makes the typing stream writer config match the other stream writers that only currently support a single worker.
* Implement an `on_new_event` callback (#11126)Brendan Abolivier2021-10-261-1/+2
| | | Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
* Add type hints for most `HomeServer` parameters (#11095)Sean Quah2021-10-2222-53/+130
|
* Fix logging context warnings when losing replication connection (#10984)Sean Quah2021-10-152-10/+26
| | | | | | Instead of triggering `__exit__` manually on the replication handler's logging context, use it as a context manager so that there is an `__enter__` call to balance the `__exit__`.
* Fix opentracing and Prometheus metrics for replication requests (#10996)Sean Quah2021-10-121-76/+78
| | | | | | | | | | | | | | | | | | | | | | | | This commit fixes two bugs to do with decorators not instrumenting `ReplicationEndpoint`'s `send_request` correctly. There are two decorators on `send_request`: Prometheus' `Gauge.track_inprogress()` and Synapse's `opentracing.trace`. `Gauge.track_inprogress()` does not have any support for async functions when used as a decorator. Since async functions behave like regular functions that return coroutines, only the creation of the coroutine was covered by the metric and none of the actual body of `send_request`. `Gauge.track_inprogress()` returns a regular, non-async function wrapping `send_request`, which is the source of the next bug. The `opentracing.trace` decorator would normally handle async functions correctly, but since the wrapped `send_request` is a non-async function, the decorator ends up suffering from the same issue as `Gauge.track_inprogress()`: the opentracing span only measures the creation of the coroutine and none of the actual function body. Using `Gauge.track_inprogress()` as a context manager instead of a decorator resolves both bugs.
* Annotate synapse.storage.util (#10892)David Robertson2021-10-082-5/+9
| | | | | Also mark `synapse.streams` as having has no untyped defs Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Require direct references to configuration variables. (#10985)Patrick Cloke2021-10-062-3/+6
| | | | | | This removes the magic allowing accessing configurable variables directly from the config object. It is now required that a specific configuration class is used (e.g. `config.foo` must be replaced with `config.server.foo`).
* Pass str to twisted's IReactorTCP (#10895)David Robertson2021-09-302-3/+13
| | | | | | | This follows a correction made in twisted/twisted#1664 and should fix our Twisted Trial CI job. Until that change is in a twisted release, we'll have to ignore the type of the `host` argument. I've raised #10899 to remind us to review the issue in a few months' time.
* Use direct references for configuration variables (part 6). (#10916)Patrick Cloke2021-09-291-1/+1
|
* Use direct references for configuration variables (part 5). (#10897)Patrick Cloke2021-09-242-4/+4
|
* Use direct references for some configuration variables (#10798)Patrick Cloke2021-09-134-5/+5
| | | | Instead of proxying through the magic getter of the RootConfig object. This should be more performant (and is more explicit).
* Split `FederationHandler` in half (#10692)Richard van der Hoff2021-08-261-2/+2
| | | The idea here is to take anything to do with incoming events and move it out to a separate handler, as a way of making FederationHandler smaller.
* Remove the unused public_room_list_stream (#10565)Andrew Morgan2021-08-173-65/+0
| | | Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
* Fix up type hints for Twisted 21.7 (#10490)Richard van der Hoff2021-07-281-1/+1
| | | Mostly this involves decorating a few Deferred declarations with extra type hints. We wrap the types in quotes to avoid runtime errors when running against older versions of Twisted that don't have generics on Deferred.
* Support for MSC2285 (hidden read receipts) (#10413)Šimon Brandner2021-07-281-0/+5
| | | Implementation of matrix-org/matrix-doc#2285
* Use inline type hints in various other places (in `synapse/`) (#10380)Jonathan de Jong2021-07-1511-61/+61
|
* MSC2918 Refresh tokens implementation (#9450)Quentin Gliech2021-06-241-1/+12
| | | | | | | | | | This implements refresh tokens, as defined by MSC2918 This MSC has been implemented client side in Hydrogen Web: vector-im/hydrogen-web#235 The basics of the MSC works: requesting refresh tokens on login, having the access tokens expire, and using the refresh token to get a new one. Signed-off-by: Quentin Gliech <quentingliech@gmail.com>
* update black to 21.6b0 (#10197)Marcus2021-06-171-1/+1
| | | | | Reformat all files with the new version. Signed-off-by: Marcus Hoffmann <bubu@bubu1.eu>
* Extend `ResponseCache` to pass a context object into the callback (#10157)Richard van der Hoff2021-06-142-4/+4
| | | | | This is the first of two PRs which seek to address #8518. This first PR lays the groundwork by extending ResponseCache; a second PR (#10158) will update the SyncHandler to actually use it, and fix the bug. The idea here is that we allow the callback given to ResponseCache.wrap to decide whether its result should be cached or not. We do that by (optionally) passing a ResponseCacheContext into it, which it can modify.
* Implement knock feature (#6739)Sorunome2021-06-091-0/+139
| | | | | | This PR aims to implement the knock feature as proposed in https://github.com/matrix-org/matrix-doc/pull/2403 Signed-off-by: Sorunome mail@sorunome.de Signed-off-by: Andrew Morgan andrewm@element.io
* Clean up the interface for injecting opentracing over HTTP (#10143)Richard van der Hoff2021-06-091-2/+3
| | | | | | | * Remove unused helper functions * Clean up the interface for injecting opentracing over HTTP * changelog
* Combine `LruCache.invalidate` and `invalidate_many` (#9973)Richard van der Hoff2021-05-271-1/+1
| | | | | | | | | | * Make `invalidate` and `invalidate_many` do the same thing ... so that we can do either over the invalidation replication stream, and also because they always confused me a bit. * Kill off `invalidate_many` * changelog
* Remove `keylen` from `LruCache`. (#9993)Richard van der Hoff2021-05-241-1/+1
| | | | | | | `keylen` seems to be a thing that is frequently incorrectly set, and we don't really need it. The only time it was used was to figure out if we had removed a subtree in `del_multi`, which we can do better by changing `TreeCache.pop` to return a different type (`TreeCacheNode`). Commits should be independently reviewable.
* Don't hammer the database for destination retry timings every ~5mins (#10036)Erik Johnston2021-05-211-21/+0
|
* Use a database table to hold the users that should have full presence sent ↵Andrew Morgan2021-05-181-2/+9
| | | | to them, instead of something in-memory (#9823)
* Add debug logging for issue #9533 (#9959)Richard van der Hoff2021-05-111-1/+0
| | | | | Hopefully this will help us track down where to-device messages are getting lost/delayed.
* Time external cache response time (#9904)Erik Johnston2021-05-041-10/+26
|
* Split presence out of master (#9820)Erik Johnston2021-04-234-58/+32
|
* Remove `synapse.types.Collection` (#9856)Richard van der Hoff2021-04-221-2/+1
| | | This is no longer required, since we have dropped support for Python 3.5.
* Merge branch 'master' into developAndrew Morgan2021-04-211-1/+1
|\
| * Stop BackgroundProcessLoggingContext making new prometheus timeseries (#9854)Richard van der Hoff2021-04-211-1/+1
| | | | | | | | This undoes part of b076bc276e881b262048307b6a226061d96c4a8d.
* | Merge branch 'master' into developAndrew Morgan2021-04-201-1/+1
|\|
| * Always use the name as the log ID. (#9829)Patrick Cloke2021-04-201-1/+1
| | | | | | | | | | As far as I can tell our logging contexts are meant to log the request ID, or sometimes the request ID followed by a suffix (this is generally stored in the name field of LoggingContext). There's also code to log the name@memory location, but I'm not sure this is ever used. This simplifies the code paths to require every logging context to have a name and use that in logging. For sub-contexts (created via nested_logging_contexts, defer_to_threadpool, Measure) we use the current context's str (which becomes their name or the string "sentinel") and then potentially modify that (e.g. add a suffix).
* | Add presence federation stream (#9819)Erik Johnston2021-04-203-3/+31
| |
* | Move some replication processing out of generic_worker (#9796)Erik Johnston2021-04-141-7/+224
| | | | | | Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
* | Remove redundant "coding: utf-8" lines (#9786)Jonathan de Jong2021-04-1447-47/+0
|/ | | | | | | Part of #9744 Removes all redundant `# -*- coding: utf-8 -*-` lines from files, as python 3 automatically reads source code as utf-8 now. `Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>`
* Record more information into structured logs. (#9654)Patrick Cloke2021-04-081-2/+3
| | | | Records additional request information into the structured logs, e.g. the requester, IP address, etc.
* Update mypy configuration: `no_implicit_optional = True` (#9742)Jonathan de Jong2021-04-051-1/+1
|
* Make RateLimiter class check for ratelimit overrides (#9711)Erik Johnston2021-03-301-1/+1
| | | | | | | This should fix a class of bug where we forget to check if e.g. the appservice shouldn't be ratelimited. We also check the `ratelimit_override` table to check if the user has ratelimiting disabled. That table is really only meant to override the event sender ratelimiting, so we don't use any values from it (as they might not make sense for different rate limits), but we do infer that if ratelimiting is disabled for the user we should disabled all ratelimits. Fixes #9663
* Add type hints for the federation sender. (#9681)Patrick Cloke2021-03-292-6/+14
| | | | Includes an abstract base class which both the FederationSender and the FederationRemoteSendQueue must implement.
* Make it possible to use dmypy (#9692)Erik Johnston2021-03-261-1/+1
| | | | | | | | | Running `dmypy run` will do a `mypy` check while spinning up a daemon that makes rerunning `dmypy run` a lot faster. `dmypy` doesn't support `follow_imports = silent` and has `local_partial_types` enabled, so this PR enables those options and fixes the issues that were newly raised. Note that `local_partial_types` will be enabled by default in upcoming mypy releases.
* Import HomeServer from the proper module. (#9665)Patrick Cloke2021-03-232-2/+2
|
* Fix up types for the typing handler. (#9638)Patrick Cloke2021-03-171-7/+10
| | | | By splitting this to two separate methods the callers know what methods they can expect on the handler.
* Prep work for removing `outlier` from `internal_metadata` (#9411)Richard van der Hoff2021-03-172-1/+6
| | | | | | | | | | | | * Populate `internal_metadata.outlier` based on `events` table Rather than relying on `outlier` being in the `internal_metadata` column, populate it based on the `events.outlier` column. * Move `outlier` out of InternalMetadata._dict Ultimately, this will allow us to stop writing it to the database. For now, we have to grandfather it back in so as to maintain compatibility with older versions of Synapse.
* Fix remaining mypy issues due to Twisted upgrade. (#9608)Patrick Cloke2021-03-153-3/+12
|
* Fix additional type hints from Twisted 21.2.0. (#9591)Patrick Cloke2021-03-123-38/+38
|
* Add logging for redis connection setup (#9590)Richard van der Hoff2021-03-111-0/+35
|
* Fix the auth provider on the logins metric (#9573)Richard van der Hoff2021-03-101-2/+2
| | | | | We either need to pass the auth provider over the replication api, or make sure we report the auth provider on the worker that received the request. I've gone with the latter.
* Add ResponseCache tests. (#9458)Jonathan de Jong2021-03-081-3/+6
|
* Create a SynapseReactor type which incorporates the necessary reactor ↵Patrick Cloke2021-03-081-1/+1
| | | | | interfaces. (#9528) This helps fix some type hints when running with Twisted 21.2.0.
* Fix additional type hints from Twisted upgrade. (#9518)Patrick Cloke2021-03-031-3/+1
|
* Bump the mypy and mypy-zope versions. (#9529)Patrick Cloke2021-03-031-1/+1
|
* Use the proper Request in type hints. (#9515)Patrick Cloke2021-03-011-5/+4
| | | | This also pins the Twisted version in the mypy job for CI until proper type hints are fixed throughout Synapse.
* Fix deleting pushers when using sharded pushers. (#9465)Erik Johnston2021-02-224-50/+74
|
* Add configs to make profile data more private (#9203)AndrewFerr2021-02-191-1/+2
| | | | | | | Add off-by-default configuration settings to: - disable putting an invitee's profile info in invite events - disable profile lookup via federation Signed-off-by: Andrew Ferrazzutti <fair@miscworks.net>
* Update black, and run auto formatting over the codebase (#9381)Eric Eastwood2021-02-1612-71/+62
| | | | | | | - Update black version to the latest - Run black auto formatting over the codebase - Run autoformatting according to [`docs/code_style.md `](https://github.com/matrix-org/synapse/blob/80d6dc9783aa80886a133756028984dbf8920168/docs/code_style.md) - Update `code_style.md` docs around installing black to use the correct version
* Ensure that we never stop reconnecting to redis (#9391)Erik Johnston2021-02-111-2/+24
|
* Precompute joined hosts and store in Redis (#9198)Erik Johnston2021-01-262-14/+106
|
* Periodically send pings to detect dead Redis connections (#9218)Erik Johnston2021-01-262-53/+98
| | | | | | | | This is done by creating a custom `RedisFactory` subclass that periodically pings all connections in its pool. We also ensure that the `replyTimeout` param is non-null, so that we timeout waiting for the reply to those pings (and thus triggering a reconnect).
* Allow moving account data and receipts streams off master (#9104)Erik Johnston2021-01-186-76/+217
|
* Enforce all replication HTTP clients calls use kwargs (#9144)Erik Johnston2021-01-181-1/+1
|
* Allow running sendToDevice on workers (#9044)Erik Johnston2021-01-072-31/+10
|
* Some cleanups to device inbox store. (#9041)Erik Johnston2021-01-071-8/+0
|
* Merge remote-tracking branch 'origin/erikj/as_mau_block' into developErik Johnston2020-12-181-2/+10
|\
| * Correctly handle AS registerations and add testErik Johnston2020-12-171-2/+10
| |
* | Convert internal pusher dicts to attrs classes. (#8940)Patrick Cloke2020-12-162-10/+27
| | | | | | This improves type hinting and should use less memory.
* | Various clean-ups to the logging context code (#8935)Patrick Cloke2020-12-141-2/+1
| |
* | Add authentication to replication endpoints. (#8853)Patrick Cloke2020-12-041-6/+41
|/ | | | Authentication is done by checking a shared secret provided in the Synapse configuration file.
* Add typing to membership Replication class methods (#8809)Andrew Morgan2020-11-271-22/+44
| | | | | This PR grew out of #6739, and adds typing to some method arguments You'll notice that there are a lot of `# type: ignores` in here. This is due to the base methods not matching the overloads here. This is necessary to stop mypy complaining, but a better solution is #8828.
* Generalise _maybe_store_room_on_invite (#8754)Andrew Morgan2020-11-131-5/+5
| | | | | | | | | There's a handy function called maybe_store_room_on_invite which allows us to create an entry in the rooms table for a room and its version for which we aren't joined to yet, but we can reference when ingesting events about. This is currently used for invites where we receive some stripped state about the room and pass it down via /sync to the client, without us being in the room yet. There is a similar requirement for knocking, where we will eventually do the same thing, and need an entry in the rooms table as well. Thus, reusing this function works, however its name needs to be generalised a bit. Separated out from #6739.
* Add ability for access tokens to belong to one user but grant access to ↵Erik Johnston2020-10-292-6/+3
| | | | | | | | | | another user. (#8616) We do it this way round so that only the "owner" can delete the access token (i.e. `/logout/all` by the "owner" also deletes that token, but `/logout/all` by the "target user" doesn't). A future PR will add an API for creating such a token. When the target user and authenticated entity are different the `Processed request` log line will be logged with a: `{@admin:server as @bob:server} ...`. I'm not convinced by that format (especially since it adds spaces in there, making it harder to use `cut -d ' '` to chop off the start of log lines). Suggestions welcome.
* Don't pull event from DB when handling replication traffic. (#8669)Erik Johnston2020-10-282-16/+25
| | | | | I was trying to make it so that we didn't have to start a background task when handling RDATA, but that is a bigger job (due to all the code in `generic_worker`). However I still think not pulling the event from the DB may help reduce some DB usage due to replication, even if most workers will simply go and pull that event from the DB later anyway. Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
* Don't unnecessarily start bg process in replication sending loop. (#8670)Erik Johnston2020-10-271-0/+10
|
* Start fewer opentracing spans (#8640)Erik Johnston2020-10-261-1/+3
| | | | | | | #8567 started a span for every background process. This is good as it means all Synapse code that gets run should be in a span (unless in the sentinel logging context), but it means we generate about 15x the number of spans as we did previously. This PR attempts to reduce that number by a) not starting one for send commands to Redis, and b) deferring starting background processes until after we're sure they're necessary. I don't really know how much this will help.
* Replace DeferredCache with LruCache where possible (#8563)Richard van der Hoff2020-10-191-5/+5
| | | Most of these uses don't need a full-blown DeferredCache; LruCache is lighter and more appropriate.
* move DeferredCache into its own moduleRichard van der Hoff2020-10-141-1/+1
|
* Rename Cache->DeferredCacheRichard van der Hoff2020-10-141-3/+3
|
* Add some more type annotations to CacheRichard van der Hoff2020-10-141-1/+1
|
* Fix message duplication if something goes wrong after persisting the event ↵Erik Johnston2020-10-131-2/+14
| | | | | (#8476) Should fix #3365.
* Make event persisters periodically announce position over replication. (#8499)Erik Johnston2020-10-124-21/+90
| | | | | Currently background proccesses stream the events stream use the "minimum persisted position" (i.e. `get_current_token()`) rather than the vector clock style tokens. This is broadly fine as it doesn't matter if the background processes lag a small amount. However, in extreme cases (i.e. SyTests) where we only write to one event persister the background processes will never make progress. This PR changes it so that the `MultiWriterIDGenerator` keeps the current position of a given instance as up to date as possible (i.e using the latest token it sees if its not in the process of persisting anything), and then periodically announces that over replication. This then allows the "minimum persisted position" to advance, albeit with a small lag.
* Add type hints to response cache. (#8507)Patrick Cloke2020-10-091-1/+1
|
* Only send RDATA for instance local events. (#8496)Erik Johnston2020-10-092-6/+11
| | | | | When pulling events out of the DB to send over replication we were not filtering by instance name, and so we were sending events for other instances.
* Remove the deprecated Handlers object (#8494)Patrick Cloke2020-10-092-2/+2
| | | All handlers now available via get_*_handler() methods on the HomeServer.
* Add unit test for event persister sharding (#8433)Erik Johnston2020-10-022-4/+42
|
* Enable mypy checking for unreachable code and fix instances. (#8432)Patrick Cloke2020-10-011-4/+6
|
* Various clean ups to room stream tokens. (#8423)Erik Johnston2020-09-291-4/+2
|
* Add metrics to track success/otherwise of replication requests (#8406)Richard van der Hoff2020-09-291-12/+28
| | | One hope is that this might provide some insights into #3365.
* Fix MultiWriteIdGenerator's handling of restarts. (#8374)Erik Johnston2020-09-241-0/+2
| | | | | | | | | | | | | | | | | | | On startup `MultiWriteIdGenerator` fetches the maximum stream ID for each instance from the table and uses that as its initial "current position" for each writer. This is problematic as a) it involves either a scan of events table or an index (neither of which is ideal), and b) if rows are being persisted out of order elsewhere while the process restarts then using the maximum stream ID is not correct. This could theoretically lead to race conditions where e.g. events that are persisted out of order are not sent down sync streams. We fix this by creating a new table that tracks the current positions of each writer to the stream, and update it each time we finish persisting a new entry. This is a relatively small overhead when persisting events. However for the cache invalidation stream this is a much bigger relative overhead, so instead we note that for invalidation we don't actually care about reliability over restarts (as there's no caches to invalidate) and simply don't bother reading and writing to the new table in that particular case.
* Add EventStreamPosition type (#8388)Erik Johnston2020-09-241-3/+9
| | | | | | | | | | | | | | The idea is to remove some of the places we pass around `int`, where it can represent one of two things: 1. the position of an event in the stream; or 2. a token that partitions the stream, used as part of the stream tokens. The valid operations are then: 1. did a position happen before or after a token; 2. get all events that happened before or after a token; and 3. get all events between two tokens. (Note that we don't want to allow other operations as we want to change the tokens to be vector clocks rather than simple ints)
* Simplify super() calls to Python 3 syntax. (#8344)Patrick Cloke2020-09-1819-25/+25
| | | | | | | This converts calls like super(Foo, self) -> super(). Generated with: sed -i "" -Ee 's/super\([^\(]+\)/super()/g' **/*.py
* Switch metaclass initialization to python 3-compatible syntax (#8326)Jonathan de Jong2020-09-161-3/+1
|
* Use slots in attrs classes where possible (#8296)Patrick Cloke2020-09-141-2/+2
| | | | | slots use less memory (and attribute access is faster) while slightly limiting the flexibility of the class attributes. This focuses on objects which are instantiated "often" and for short periods of time.
* Fix typos in comments.Patrick Cloke2020-09-141-1/+1
|
* Add experimental support for sharding event persister. Again. (#8294)Erik Johnston2020-09-143-6/+12
| | | | | | This is *not* ready for production yet. Caveats: 1. We should write some tests... 2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
* Clean up `Notifier.on_new_room_event` code path (#8288)Erik Johnston2020-09-101-6/+3
| | | | | | | | | | | | | The idea here is that we pass the `max_stream_id` to everything, and only use the stream ID of the particular event to figure out *when* the max stream position has caught up to the event and we can notify people about it. This is to maintain the distinction between the position of an item in the stream (i.e. event A has stream ID 513) and a token that can be used to partition the stream (i.e. give me all events after stream ID 352). This distinction becomes important when the tokens are more complicated than a single number, which they will be once we start tracking the position of multiple writers in the tokens. The valid operations here are: 1. Is a position before or after a token 2. Fetching all events between two tokens 3. Merging multiple tokens to get the "max", i.e. `C = max(A, B)` means that for all positions P where P is before A *or* before B, then P is before C. Future PR will change the token type to a dedicated type.
* Remove some unused distributor signals (#8216)Patrick Cloke2020-09-091-6/+4
| | | | | Removes the `user_joined_room` and stops calling it since there are no observers. Also cleans-up some other unused signals and related code.
* Fixup pusher pool notifications (#8287)Erik Johnston2020-09-091-1/+2
| | | | | `pusher_pool.on_new_notifications` expected a min and max stream ID, however that was not what we were passing in. Instead, let's just pass it the current max stream ID and have it track the last stream ID it got passed. I believe that it mostly worked as we called the function for every event. However, it would break for events that got persisted out of order, i.e, that were persisted but the max stream ID wasn't incremented as not all preceding events had finished persisting, and push for that event would be delayed until another event got pushed to the effected users.
* Revert "Fixup pusher pool notifications"Erik Johnston2020-09-091-2/+1
| | | | This reverts commit e7fd336a53a4ca489cdafc389b494d5477019dc0.
* Fixup pusher pool notificationsErik Johnston2020-09-091-1/+2
|
* Stop sub-classing object (#8249)Patrick Cloke2020-09-046-7/+7
|
* Revert "Add experimental support for sharding event persister. (#8170)" (#8242)Brendan Abolivier2020-09-043-12/+6
| | | | | | | * Revert "Add experimental support for sharding event persister. (#8170)" This reverts commit 82c1ee1c22a87b9e6e3179947014b0f11c0a1ac3. * Changelog
* Add experimental support for sharding event persister. (#8170)Erik Johnston2020-09-023-6/+12
| | | | | | This is *not* ready for production yet. Caveats: 1. We should write some tests... 2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
* Move and rename `get_devices_with_keys_by_user` (#8204)Richard van der Hoff2020-09-011-0/+3
| | | | | | | | | | | | | | | | | | * Move `get_devices_with_keys_by_user` to `EndToEndKeyWorkerStore` this seems a better fit for it. This commit simply moves the existing code: no other changes at all. * Rename `get_devices_with_keys_by_user` to better reflect what it does. * get_device_stream_token abstract method To avoid referencing fields which are declared in the derived classes, make `get_device_stream_token` abstract, and define that in the classes which define `_device_list_id_gen`.
* Fix `wait_for_stream_position` for multiple waiters. (#8196)Erik Johnston2020-08-281-4/+2
| | | | | | This fixes a bug where having multiple callers waiting on the same stream and position will cause it to try and compare two deferreds, which fails (due to the sorted list having an entry of `Tuple[int, Deferred]`).
* Make SlavedIdTracker.advance have same interface as MultiWriterIDGenerator ↵Erik Johnston2020-08-2610-13/+13
| | | | (#8171)
* Remove `ChainedIdGenerator`. (#8123)Erik Johnston2020-08-192-7/+5
| | | | | It's just a thin wrapper around two ID gens to make `get_current_token` and `get_next` return tuples. This can easily be replaced by calling the appropriate methods on the underlying ID gens directly.
* Be stricter about JSON that is accepted by Synapse (#8106)Patrick Cloke2020-08-191-7/+5
|
* Separate `get_current_token` into two. (#8113)Erik Johnston2020-08-192-1/+9
| | | | | | | | | | | | The function is used for two purposes: 1) for subscribers of streams to get a token they can use to get further updates with, and 2) for replication to track position of the writers of the stream. For streams with a single writer the two scenarios produce the same result, however the situation becomes complicated for streams with multiple writers. The current `MultiWriterIdGenerator` does not correctly handle the first case (which is not an issue as its only used for the `caches` stream which nothing subscribes to outside of replication).
* Add a shadow-banned flag to users. (#8092)Patrick Cloke2020-08-141-0/+4
|
* Reduce unnecessary whitespace in JSON. (#7372)David Vo2020-08-071-2/+3
|
* Convert synapse.api to async/await (#8031)Patrick Cloke2020-08-061-1/+1
|
* Rename database classes to make some sense (#8033)Erik Johnston2020-08-0519-54/+54
|
* Convert replication code to async/await. (#7987)Patrick Cloke2020-08-039-37/+27
|
* Merge tag 'v1.18.0rc2' into developRichard van der Hoff2020-07-284-87/+112
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Synapse 1.18.0rc2 (2020-07-28) ============================== Bugfixes -------- - Fix an `AssertionError` exception introduced in v1.18.0rc1. ([\#7876](https://github.com/matrix-org/synapse/issues/7876)) - Fix experimental support for moving typing off master when worker is restarted, which is broken in v1.18.0rc1. ([\#7967](https://github.com/matrix-org/synapse/issues/7967)) Internal Changes ---------------- - Further optimise queueing of inbound replication commands. ([\#7876](https://github.com/matrix-org/synapse/issues/7876))
| * Typing worker needs to handle stream update requests (#7967)Erik Johnston2020-07-281-1/+1
| | | | | | | | | | IIRC this doesn't break tests because its only hit on reconnection, or something. Basically, when a process needs to fetch missing updates for the `typing` stream it needs to query the writer instance via HTTP (as we don't write typing notifications to the DB), the problem was that the endpoint (`streams`) was only registered on master and specifically not on the typing writer worker.
| * Handle replication commands synchronously where possible (#7876)Richard van der Hoff2020-07-273-86/+111
| | | | | | Most of the stuff we do for replication commands can be done synchronously. There's no point spinning up background processes if we're not going to need them.
* | Convert a synapse.events to async/await. (#7949)Patrick Cloke2020-07-272-2/+4
|/
* Fix typing replication not being handled on master (#7959)Erik Johnston2020-07-271-0/+8
| | | | | | | | | | | | | | | | Handling of incoming typing stream updates from replication was not hooked up on master, effecting set ups where typing was handled on a different worker. This is really only a problem if the master process is also handling sync requests, which is unlikely for those that are at the stage of moving typing off. The other observable effect is that if a worker restarts or a replication connect drops then the typing worker will issue a `POSITION typing`, triggering master process to try and stream *all* typing updates from position 0. Fixes #7907
* Remove an unused prometheus metric (#7878)Richard van der Hoff2020-07-221-3/+1
|
* Track command processing as a background process (#7879)Richard van der Hoff2020-07-222-3/+38
| | | | I'm going to be doing more stuff synchronously, and I don't want to lose the CPU metrics down the sofa.
* Fix deprecation warning: import ABC from collections.abc (#7892)Karthikeyan Singaravelan2020-07-201-1/+1
|
* Stop using 'device_max_stream_id' (#7882)Erik Johnston2020-07-171-1/+1
| | | | | It serves no purpose and updating everytime we write to the device inbox stream means all such transactions will conflict, causing lots of transaction failures and retries.
* Optimise queueing of inbound replication commands (#7861)Richard van der Hoff2020-07-161-116/+215
| | | | | | | | | | | When we get behind on replication, we tend to stack up background processes behind a linearizer. Bg processes are heavy (particularly with respect to prometheus metrics) and linearizers aren't terribly efficient once the queue gets long either. A better approach is to maintain a queue of requests to be processed, and nominate a single process to work its way through the queue. Fixes: #7444
* Allow moving typing off master (#7869)Erik Johnston2020-07-162-3/+13
|
* Add ability to shard the federation sender (#7798)Erik Johnston2020-07-102-6/+8
|
* Fix some spelling mistakes / typos. (#7811)Patrick Cloke2020-07-096-7/+7
|
* Generate real events when we reject invites (#7804)Richard van der Hoff2020-07-091-67/+25
| | | | | | | | Fixes #2181. The basic premise is that, when we fail to reject an invite via the remote server, we can generate our own out-of-band leave event and persist it as an outlier, so that we have something to send to the client.
* Do not use simplejson in Synapse. (#7800)Patrick Cloke2020-07-081-9/+2
|
* Refactor getting replication updates from database v2. (#7740)Erik Johnston2020-07-071-46/+10
|
* isort 5 compatibility (#7786)Will Hunt2020-07-053-5/+3
| | | The CI appears to use the latest version of isort, which is a problem when isort gets a major version bump. Rather than try to pin the version, I've done the necessary to make isort5 happy with synapse.
* Merge different Resource implementation classes (#7732)Erik Johnston2020-07-032-10/+4
|
* Use symbolic names for replication stream names (#7768)Richard van der Hoff2020-07-018-17/+17
| | | This makes it much easier to find where streams are referenced.
* Refactor getting replication updates from database. (#7636)Erik Johnston2020-06-161-21/+8
| | | The aim here is to make it easier to reason about when streams are limited and when they're not, by moving the logic into the database functions themselves. This should mean we can kill of `db_query_to_update_function` function.
* Replace all remaining six usage with native Python 3 equivalents (#7704)Dagfinn Ilmari Mannsåker2020-06-161-4/+2
|
* Discard RDATA from already seen positions. (#7648)Patrick Cloke2020-06-152-6/+28
|
* Fix bug in account data replication stream. (#7656)Erik Johnston2020-06-092-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | * Ensure account data stream IDs are unique. The account data stream is shared between three tables, and the maximum allocated ID was tracked in a dedicated table. Updating the max ID happened outside the transaction that allocated the ID, leading to a race where if the server was restarted then the same ID could be allocated but the max ID failed to be updated, leading it to be reused. The ID generators have support for tracking across multiple tables, so we may as well use that instead of a dedicated table. * Fix bug in account data replication stream. If the same stream ID was used in both global and room account data then the getting updates for the replication stream would fail due to `heapq.merge(..)` trying to compare a `str` with a `None`. (This is because you'd have two rows like `(534, '!room')` and `(534, None)` from the room and global account data tables). Fix is just to order by stream ID, since we don't rely on the ordering beyond that. The bug where stream IDs can be reused should be fixed now, so this case shouldn't happen going forward. Fixes #7617
* Typo fixes.Patrick Cloke2020-06-051-1/+1
|
* Ensure ReplicationStreamer is always started when replication enabled. (#7579)Erik Johnston2020-05-271-0/+3
| | | Fixes #7566.
* Add option to move event persistence off master (#7517)Erik Johnston2020-05-225-2/+171
|
* Add ability to wait for replication streams (#7542)Erik Johnston2020-05-225-18/+108
| | | | | | | The idea here is that if an instance persists an event via the replication HTTP API it can return before we receive that event over replication, which can lead to races where code assumes that persisting an event immediately updates various caches (e.g. current state of the room). Most of Synapse doesn't hit such races, so we don't do the waiting automagically, instead we do so where necessary to avoid unnecessary delays. We may decide to change our minds here if it turns out there are a lot of subtle races going on. People probably want to look at this commit by commit.
* Allow ReplicationRestResource to be added to workers (#7515)Erik Johnston2020-05-181-5/+8
| | | This allows workers to talk to each other over HTTP replication.
* Merge pull request #7519 from matrix-org/rav/kill_py2_codeRichard van der Hoff2020-05-182-13/+4
|\ | | | | Kill off some old python 2 code
| * remove redundant `__func__`Richard van der Hoff2020-05-152-13/+4
| | | | | | | | this is a no-op under python 3
* | Fix limit logic for AccountDataStream (#7384)Richard van der Hoff2020-05-151-12/+56
| | | | | | | | | | | | Make sure that the AccountDataStream presents complete updates, in the right order. This is much the same fix as #7337 and #7358, but applied to a different stream.
* | Move event stream handling out of slave store. (#7491)Erik Johnston2020-05-152-97/+0
|/ | | | | This allows us to have the logic on both master and workers, which is necessary to move event persistence off master. We also combine the instantiation of ID generators from DataStore and slave stores to the base worker stores. This allows us to select which process writes events independently of the master/worker splits.
* Move EventStream handling into default ReplicationDataHandler (#7493)Erik Johnston2020-05-141-4/+33
| | | This is so that the logic can happen on both master and workers when we move event persistence out.
* Add `instance_map` config and route replication calls (#7495)Erik Johnston2020-05-141-6/+15
|
* Have all instances correctly respond to REPLICATE command. (#7475)Erik Johnston2020-05-133-48/+50
| | | | | Before all streams were only written to from master, so only master needed to respond to `REPLICATE` commands. Before all instances wrote to the cache invalidation stream, but didn't respond to `REPLICATE`. This was a bug, which could lead to missed rows from cache invalidation stream if an instance is restarted, however all the caches would be empty in that case so it wasn't a problem.
* Fix Redis reconnection logic (#7482)Erik Johnston2020-05-132-2/+14
| | | Proactively send out `POSITION` commands (as if we had just received a `REPLICATE`) when we connect to Redis. This is important as other instances won't notice we've connected to issue a `REPLICATE` command (unlike for direct TCP connections). This is only currently an issue if master process reconnects without restarting (if it restarts then it won't have written anything and so other instances probably won't have missed anything).
* Allow configuration of Synapse's cache without using synctl or environment ↵Amber Brown2020-05-111-2/+1
| | | | variables (#6391)
* Merge branch 'release-v1.13.0' into developAndrew Morgan2020-05-112-4/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * release-v1.13.0: Don't UPGRADE database rows RST indenting Put rollback instructions in upgrade notes Fix changelog typo Oh yeah, RST Absolute URL it is then Fix upgrade notes link Provide summary of upgrade issues in changelog. Fix ) Move next version notes from changelog to upgrade notes Changelog fixes 1.13.0rc1 Documentation on setting up redis (#7446) Rework UI Auth session validation for registration (#7455) Fix errors from malformed log line (#7454) Drop support for redis.dbid (#7450)
| * Fix errors from malformed log line (#7454)Richard van der Hoff2020-05-071-1/+1
| |
| * Drop support for redis.dbid (#7450)Richard van der Hoff2020-05-071-3/+1
| | | | | | Since we only use pubsub, the dbid is irrelevant.
* | Support any process writing to cache invalidation stream. (#7436)Erik Johnston2020-05-0718-183/+131
| |
* | Merge branch 'release-v1.13.0' into developRichard van der Hoff2020-05-062-34/+69
|\|
| * Merge branch 'release-v1.13.0' into rav/fix_dropped_messagesRichard van der Hoff2020-05-051-1/+1
| |\
| * \ Merge branch 'release-v1.13.0' into rav/fix_dropped_messagesRichard van der Hoff2020-05-0519-132/+96
| |\ \
| * | | Wait for a POSITION on the right connection before accepting RDATARichard van der Hoff2020-05-052-19/+38
| | | | | | | | | | | | | | | | ... otherwise we can believe we're up to date when we're not.
| * | | Wait to subscribe before sending REPLICATERichard van der Hoff2020-05-052-20/+35
| | | |
* | | | Merge branch 'release-v1.13.0' into developRichard van der Hoff2020-05-061-1/+1
|\ \ \ \ | | |_|/ | |/| |
| * | | Move logs about discarded RDATA to debug (#7421)Brendan Abolivier2020-05-051-1/+1
| | |/ | |/|
* / | Fix catchup-on-reconnect for the Federation Stream (#7374)Richard van der Hoff2020-05-053-11/+24
|/ / | | | | | | looks like we managed to break this during the refactorathon.
* | Fix redis password support. (#7401)Erik Johnston2020-05-041-0/+3
| | | | | | | | | | We forgot to set the password on the subscriber connection, as well as not calling super methods for overridden connectionMade/connectionLost functions.
* | Thread through instance name to replication client. (#7369)Erik Johnston2020-05-017-29/+90
| | | | | | For in memory streams when fetching updates on workers we need to query the source of the stream, which currently is hard coded to be master. This PR threads through the source instance we received via `POSITION` through to the update function in each stream, which can then be passed to the replication client for in memory streams.
* | Use `stream.current_token()` and remove `stream_positions()` (#7172)Erik Johnston2020-05-0113-104/+3
|/ | | | We move the processing of typing and federation replication traffic into their handlers so that `Stream.current_token()` points to a valid token. This allows us to remove `get_streams_to_replicate()` and `stream_positions()`.
* Workaround for assertion errors from db_query_to_update_function (#7378)Richard van der Hoff2020-05-011-2/+1
| | | Hopefully this is no worse than what we have on master...
* Add instance name to RDATA/POSITION commands (#7364)Erik Johnston2020-04-292-14/+40
| | | | | This is primarily for allowing us to send those commands from workers, but for now simply allows us to ignore echoed RDATA/POSITION commands that we sent (we get echoes of sent commands when using redis). Currently we log a WARNING on the master process every time we receive an echoed RDATA.
* Don't relay REMOTE_SERVER_UP cmds to same conn. (#7352)Erik Johnston2020-04-293-16/+51
| | | | | | | | | | | | | | For direct TCP connections we need the master to relay REMOTE_SERVER_UP commands to the other connections so that all instances get notified about it. The old implementation just relayed to all connections, assuming that sending back to the original sender of the command was safe. This is not true for redis, where commands sent get echoed back to the sender, which was causing master to effectively infinite loop sending and then re-receiving REMOTE_SERVER_UP commands that it sent. The fix is to ensure that we only relay to *other* connections and not to the connection we received the notification from. Fixes #7334.
* Fix limit logic for EventsStream (#7358)Richard van der Hoff2020-04-292-15/+11
| | | | | | | | | | | | | | | | | | | * Factor out functions for injecting events into database I want to add some more flexibility to the tools for injecting events into the database, and I don't want to clutter up HomeserverTestCase with them, so let's factor them out to a new file. * Rework TestReplicationDataHandler This wasn't very easy to work with: the mock wrapping was largely superfluous, and it's useful to be able to inspect the received rows, and clear out the received list. * Fix AssertionErrors being thrown by EventsStream Part of the problem was that there was an off-by-one error in the assertion, but also the limit logic was too simple. Fix it all up and add some tests.
* Run replication streamers on workers (#7146)Erik Johnston2020-04-281-18/+15
| | | Currently we never write to streams from workers, but that will change soon
* Fix EventsStream raising assertions when it falls behindRichard van der Hoff2020-04-241-18/+95
| | | | | | | | | | Figuring out how to correctly limit updates from this stream without dropping entries is far more complicated than just counting the number of rows being returned. We need to consider each query separately and, if any one query hits the limit, truncate the results from the others. I think this also fixes some potentially long-standing bugs where events or state changes could get missed if we hit the limit on either query.
* Make it clear that the limit for an update_function is a targetRichard van der Hoff2020-04-231-5/+9
|
* Remove 'limit' param from `get_repl_stream_updates` APIRichard van der Hoff2020-04-232-9/+8
| | | | | there doesn't seem to be much point in passing this limit all around, since both sides agree it's meant to be 100.
* Stop the master relaying USER_SYNC for other workers (#7318)Richard van der Hoff2020-04-222-12/+10
| | | | | | | Long story short: if we're handling presence on the current worker, we shouldn't be sending USER_SYNC commands over replication. In an attempt to figure out what is going on here, I ended up refactoring some bits of the presencehandler code, so the first 4 commits here are non-functional refactors to move this code slightly closer to sanity. (There's still plenty to do here :/). Suggest reviewing individual commits. Fixes (I hope) #7257.
* Fix replication metrics when using redis (#7325)Erik Johnston2020-04-222-37/+29
|
* Another go at fixing one-word commands (#7326)Richard van der Hoff2020-04-221-1/+1
| | | I messed this up last time I tried (#7239 / e13c6c7).
* Add ability to run replication protocol over redis. (#7040)Erik Johnston2020-04-225-34/+255
| | | This is configured via the `redis` config options.
* On catchup, process each row with its own stream id (#7286)Richard van der Hoff2020-04-201-5/+68
| | | | | | Other parts of the code (such as the StreamChangeCache) assume that there will not be multiple changes with the same stream id. This code was introduced in #7024, and I hope this fixes #7206.
* Improve type checking in `replication.tcp.Stream` (#7291)Richard van der Hoff2020-04-174-122/+142
| | | | | | | The general idea here is to get rid of the type: ignore annotations on all of the current_token and update_function assignments, which would have caught #7290. After a bit of experimentation, it seems like the least-awful way to do this is to pass the offending functions in as parameters to the Stream constructor. Unfortunately that means that the concrete implementations no longer have the same constructor signature as Stream itself, which means that it gets hard to correctly annotate STREAMS_MAP. I've also introduced a couple of new types, to take out some duplication.
* Fix 'generator object is not subscriptable' error (#7290)Richard van der Hoff2020-04-161-1/+2
| | | | | | Some of the query functions return generators rather than lists, so we can't index into the result. Happily we already have a copy of the results. (think this was introduced in #7024)
* Handle one-word replication commands correctlyRichard van der Hoff2020-04-071-3/+11
| | | | | `REPLICATE` is now a valid command, and it's nice if you can issue it from the console without remembering to call it `REPLICATE ` with a trailing space.
* Fix warnings about not calling superclass constructorRichard van der Hoff2020-04-071-15/+24
| | | | | | Separate `SimpleCommand` from `Command`, so that things which don't want to use the `data` property don't have to, and thus fix the warnings PyCharm was giving me about not calling `__init__` in the base class.
* Remove vestigal references to SYNC replication commandRichard van der Hoff2020-04-072-14/+0
| | | | We've ripped pretty much all of this out: let's remove the remains.
* Fix race in replication (#7226)Erik Johnston2020-04-072-29/+47
| | | | Fixes a race between handling `POSITION` and `RDATA` commands. We do this by simply linearizing handling of them.
* Move server command handling out of TCP protocol (#7187)Erik Johnston2020-04-073-269/+236
| | | This completes the merging of server and client command processing.
* Move client command handling out of TCP protocol (#7185)Erik Johnston2020-04-064-322/+336
| | | The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new `ReplicationCommandHandler`, a future PR will move the server paths too.
* Remove connections per replication stream metric. (#7195)Erik Johnston2020-04-011-16/+0
| | | | | This broke in a recent PR (#7024) and is no longer useful due to all replication clients implicitly subscribing to all streams, so let's just remove it.
* Remove usage of "conn_id" for presence. (#7128)Erik Johnston2020-03-304-18/+50
| | | | | | | | | | | | | | | | * Remove `conn_id` usage for UserSyncCommand. Each tcp replication connection is assigned a "conn_id", which is used to give an ID to a remotely connected worker. In a redis world, there will no longer be a one to one mapping between connection and instance, so instead we need to replace such usages with an ID generated by the remote instances and included in the replicaiton commands. This really only effects UserSyncCommand. * Add CLEAR_USER_SYNCS command that is sent on shutdown. This should help with the case where a synchrotron gets restarted gracefully, rather than rely on 5 minute timeout.
* Move catchup of replication streams to worker. (#7024)Erik Johnston2020-03-2512-232/+319
| | | This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
* Convert `*StreamRow` classes to inner classes (#7116)Richard van der Hoff2020-03-232-96/+101
| | | | | This just helps keep the rows closer to their streams, so that it's easier to see what the format of each stream is.
* Fix processing of `groups` stream, and use symbolic names for streams (#7117)Richard van der Hoff2020-03-231-18/+52
| | | | | | `groups` != `receipts` Introduced in #6964
* Remove concept of a non-limited stream. (#7011)Erik Johnston2020-03-202-47/+28
|
* Change device list streams to have one row per ID (#7010)Erik Johnston2020-03-192-17/+32
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Add 'device_lists_outbound_pokes' as extra table. This makes sure we check all the relevant tables to get the current max stream ID. Currently not doing so isn't problematic as the max stream ID in `device_lists_outbound_pokes` is the same as in `device_lists_stream`, however that will change. * Change device lists stream to have one row per id. This will make it possible to process the streams more incrementally, avoiding having to process large chunks at once. * Change device list replication to match new semantics. Instead of sending down batches of user ID/host tuples, send down a row per entity (user ID or host). * Newsfile * Remove handling of multiple rows per ID * Fix worker handling * Comments from review
| * Comments from reviewErik Johnston2020-03-181-0/+3
| |
| * Change device list replication to match new semantics.Erik Johnston2020-02-282-16/+22
| | | | | | | | | | Instead of sending down batches of user ID/host tuples, send down a row per entity (user ID or host).
| * Add 'device_lists_outbound_pokes' as extra table.Erik Johnston2020-02-281-1/+7
| | | | | | | | | | | | | | | | | | This makes sure we check all the relevant tables to get the current max stream ID. Currently not doing so isn't problematic as the max stream ID in `device_lists_outbound_pokes` is the same as in `device_lists_stream`, however that will change.
* | Store room_versions in EventBase objects (#6875)Richard van der Hoff2020-03-052-8/+19
|/ | | | | | | This is a bit fiddly because it all has to be done on one fell swoop: * Wherever we create a new event, pass in the room version (and check it matches the format version) * When we prune an event, use the room version of the unpruned event to create the pruned version. * When we pass an event over the replication protocol, pass the room version over alongside it, and use it when deserialising the event again.
* Store room version on invite (#6983)Richard van der Hoff2020-02-262-2/+36
| | | | | When we get an invite over federation, store the room version in the rooms table. The general idea here is that, when we pull the invite out again, we'll want to know what room_version it belongs to (so that we can later redact it if need be). So we need to store it somewhere...
* Port PresenceHandler to async/await (#6991)Erik Johnston2020-02-261-1/+5
|
* Merge worker apps into one. (#6964)Erik Johnston2020-02-251-0/+20
|
* Increase MAX_EVENTS_BEHIND for replication clientsErik Johnston2020-02-211-1/+1
|
* Allow moving group read APIs to workers (#6866)Erik Johnston2020-02-071-8/+6
|
* Fix sending server up commands from workers (#6811)Erik Johnston2020-01-301-0/+4
| | | | Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
* Detect unknown remote devices and mark cache as stale (#6776)Erik Johnston2020-01-281-1/+1
| | | | We just mark the fact that the cache may be stale in the database for now.
* Propagate cache invalidates from workers to other workers. (#6748)Erik Johnston2020-01-272-4/+7
| | | Currently if a worker invalidates a cache it will be streamed to master, which then didn't forward those to other workers.
* Allow streaming cache invalidate all to workers. (#6749)Erik Johnston2020-01-222-6/+27
|
* Wake up transaction queue when remote server comes back online (#6706)Erik Johnston2020-01-174-0/+44
| | | | | This will be used to retry outbound transactions to a remote server if we think it might have come back up.