summary refs log tree commit diff
path: root/synapse/replication/tcp (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Add debug logging for issue #9533 (#9959)Richard van der Hoff2021-05-111-1/+0
| | | | | Hopefully this will help us track down where to-device messages are getting lost/delayed.
* Time external cache response time (#9904)Erik Johnston2021-05-041-10/+26
|
* Split presence out of master (#9820)Erik Johnston2021-04-232-7/+28
|
* Remove `synapse.types.Collection` (#9856)Richard van der Hoff2021-04-221-2/+1
| | | This is no longer required, since we have dropped support for Python 3.5.
* Merge branch 'master' into developAndrew Morgan2021-04-211-1/+1
|\
| * Stop BackgroundProcessLoggingContext making new prometheus timeseries (#9854)Richard van der Hoff2021-04-211-1/+1
| | | | | | | | This undoes part of b076bc276e881b262048307b6a226061d96c4a8d.
* | Merge branch 'master' into developAndrew Morgan2021-04-201-1/+1
|\|
| * Always use the name as the log ID. (#9829)Patrick Cloke2021-04-201-1/+1
| | | | | | | | | | As far as I can tell our logging contexts are meant to log the request ID, or sometimes the request ID followed by a suffix (this is generally stored in the name field of LoggingContext). There's also code to log the name@memory location, but I'm not sure this is ever used. This simplifies the code paths to require every logging context to have a name and use that in logging. For sub-contexts (created via nested_logging_contexts, defer_to_threadpool, Measure) we use the current context's str (which becomes their name or the string "sentinel") and then potentially modify that (e.g. add a suffix).
* | Add presence federation stream (#9819)Erik Johnston2021-04-203-3/+31
| |
* | Move some replication processing out of generic_worker (#9796)Erik Johnston2021-04-141-7/+224
| | | | | | Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
* | Remove redundant "coding: utf-8" lines (#9786)Jonathan de Jong2021-04-1412-12/+0
|/ | | | | | | Part of #9744 Removes all redundant `# -*- coding: utf-8 -*-` lines from files, as python 3 automatically reads source code as utf-8 now. `Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>`
* Record more information into structured logs. (#9654)Patrick Cloke2021-04-081-2/+3
| | | | Records additional request information into the structured logs, e.g. the requester, IP address, etc.
* Update mypy configuration: `no_implicit_optional = True` (#9742)Jonathan de Jong2021-04-051-1/+1
|
* Add type hints for the federation sender. (#9681)Patrick Cloke2021-03-292-6/+14
| | | | Includes an abstract base class which both the FederationSender and the FederationRemoteSendQueue must implement.
* Make it possible to use dmypy (#9692)Erik Johnston2021-03-261-1/+1
| | | | | | | | | Running `dmypy run` will do a `mypy` check while spinning up a daemon that makes rerunning `dmypy run` a lot faster. `dmypy` doesn't support `follow_imports = silent` and has `local_partial_types` enabled, so this PR enables those options and fixes the issues that were newly raised. Note that `local_partial_types` will be enabled by default in upcoming mypy releases.
* Import HomeServer from the proper module. (#9665)Patrick Cloke2021-03-231-1/+1
|
* Fix up types for the typing handler. (#9638)Patrick Cloke2021-03-171-7/+10
| | | | By splitting this to two separate methods the callers know what methods they can expect on the handler.
* Fix remaining mypy issues due to Twisted upgrade. (#9608)Patrick Cloke2021-03-153-3/+12
|
* Fix additional type hints from Twisted 21.2.0. (#9591)Patrick Cloke2021-03-123-38/+38
|
* Add logging for redis connection setup (#9590)Richard van der Hoff2021-03-111-0/+35
|
* Create a SynapseReactor type which incorporates the necessary reactor ↵Patrick Cloke2021-03-081-1/+1
| | | | | interfaces. (#9528) This helps fix some type hints when running with Twisted 21.2.0.
* Fix additional type hints from Twisted upgrade. (#9518)Patrick Cloke2021-03-031-3/+1
|
* Bump the mypy and mypy-zope versions. (#9529)Patrick Cloke2021-03-031-1/+1
|
* Fix deleting pushers when using sharded pushers. (#9465)Erik Johnston2021-02-222-50/+0
|
* Update black, and run auto formatting over the codebase (#9381)Eric Eastwood2021-02-168-64/+47
| | | | | | | - Update black version to the latest - Run black auto formatting over the codebase - Run autoformatting according to [`docs/code_style.md `](https://github.com/matrix-org/synapse/blob/80d6dc9783aa80886a133756028984dbf8920168/docs/code_style.md) - Update `code_style.md` docs around installing black to use the correct version
* Ensure that we never stop reconnecting to redis (#9391)Erik Johnston2021-02-111-2/+24
|
* Precompute joined hosts and store in Redis (#9198)Erik Johnston2021-01-262-14/+106
|
* Periodically send pings to detect dead Redis connections (#9218)Erik Johnston2021-01-262-53/+98
| | | | | | | | This is done by creating a custom `RedisFactory` subclass that periodically pings all connections in its pool. We also ensure that the `replyTimeout` param is non-null, so that we timeout waiting for the reply to those pings (and thus triggering a reconnect).
* Allow moving account data and receipts streams off master (#9104)Erik Johnston2021-01-181-0/+19
|
* Allow running sendToDevice on workers (#9044)Erik Johnston2021-01-071-0/+9
|
* Various clean-ups to the logging context code (#8935)Patrick Cloke2020-12-141-2/+1
|
* Don't pull event from DB when handling replication traffic. (#8669)Erik Johnston2020-10-282-16/+25
| | | | | I was trying to make it so that we didn't have to start a background task when handling RDATA, but that is a bigger job (due to all the code in `generic_worker`). However I still think not pulling the event from the DB may help reduce some DB usage due to replication, even if most workers will simply go and pull that event from the DB later anyway. Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
* Don't unnecessarily start bg process in replication sending loop. (#8670)Erik Johnston2020-10-271-0/+10
|
* Start fewer opentracing spans (#8640)Erik Johnston2020-10-261-1/+3
| | | | | | | #8567 started a span for every background process. This is good as it means all Synapse code that gets run should be in a span (unless in the sentinel logging context), but it means we generate about 15x the number of spans as we did previously. This PR attempts to reduce that number by a) not starting one for send commands to Redis, and b) deferring starting background processes until after we're sure they're necessary. I don't really know how much this will help.
* Make event persisters periodically announce position over replication. (#8499)Erik Johnston2020-10-124-21/+90
| | | | | Currently background proccesses stream the events stream use the "minimum persisted position" (i.e. `get_current_token()`) rather than the vector clock style tokens. This is broadly fine as it doesn't matter if the background processes lag a small amount. However, in extreme cases (i.e. SyTests) where we only write to one event persister the background processes will never make progress. This PR changes it so that the `MultiWriterIDGenerator` keeps the current position of a given instance as up to date as possible (i.e using the latest token it sees if its not in the process of persisting anything), and then periodically announces that over replication. This then allows the "minimum persisted position" to advance, albeit with a small lag.
* Only send RDATA for instance local events. (#8496)Erik Johnston2020-10-092-6/+11
| | | | | When pulling events out of the DB to send over replication we were not filtering by instance name, and so we were sending events for other instances.
* Add unit test for event persister sharding (#8433)Erik Johnston2020-10-022-4/+42
|
* Enable mypy checking for unreachable code and fix instances. (#8432)Patrick Cloke2020-10-011-4/+6
|
* Various clean ups to room stream tokens. (#8423)Erik Johnston2020-09-291-4/+2
|
* Add EventStreamPosition type (#8388)Erik Johnston2020-09-241-3/+9
| | | | | | | | | | | | | | The idea is to remove some of the places we pass around `int`, where it can represent one of two things: 1. the position of an event in the stream; or 2. a token that partitions the stream, used as part of the stream tokens. The valid operations are then: 1. did a position happen before or after a token; 2. get all events that happened before or after a token; and 3. get all events between two tokens. (Note that we don't want to allow other operations as we want to change the tokens to be vector clocks rather than simple ints)
* Simplify super() calls to Python 3 syntax. (#8344)Patrick Cloke2020-09-181-1/+1
| | | | | | | This converts calls like super(Foo, self) -> super(). Generated with: sed -i "" -Ee 's/super\([^\(]+\)/super()/g' **/*.py
* Use slots in attrs classes where possible (#8296)Patrick Cloke2020-09-141-2/+2
| | | | | slots use less memory (and attribute access is faster) while slightly limiting the flexibility of the class attributes. This focuses on objects which are instantiated "often" and for short periods of time.
* Fix typos in comments.Patrick Cloke2020-09-141-1/+1
|
* Add experimental support for sharding event persister. Again. (#8294)Erik Johnston2020-09-142-3/+3
| | | | | | This is *not* ready for production yet. Caveats: 1. We should write some tests... 2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
* Clean up `Notifier.on_new_room_event` code path (#8288)Erik Johnston2020-09-101-6/+3
| | | | | | | | | | | | | The idea here is that we pass the `max_stream_id` to everything, and only use the stream ID of the particular event to figure out *when* the max stream position has caught up to the event and we can notify people about it. This is to maintain the distinction between the position of an item in the stream (i.e. event A has stream ID 513) and a token that can be used to partition the stream (i.e. give me all events after stream ID 352). This distinction becomes important when the tokens are more complicated than a single number, which they will be once we start tracking the position of multiple writers in the tokens. The valid operations here are: 1. Is a position before or after a token 2. Fetching all events between two tokens 3. Merging multiple tokens to get the "max", i.e. `C = max(A, B)` means that for all positions P where P is before A *or* before B, then P is before C. Future PR will change the token type to a dedicated type.
* Fixup pusher pool notifications (#8287)Erik Johnston2020-09-091-1/+2
| | | | | `pusher_pool.on_new_notifications` expected a min and max stream ID, however that was not what we were passing in. Instead, let's just pass it the current max stream ID and have it track the last stream ID it got passed. I believe that it mostly worked as we called the function for every event. However, it would break for events that got persisted out of order, i.e, that were persisted but the max stream ID wasn't incremented as not all preceding events had finished persisting, and push for that event would be delayed until another event got pushed to the effected users.
* Revert "Fixup pusher pool notifications"Erik Johnston2020-09-091-2/+1
| | | | This reverts commit e7fd336a53a4ca489cdafc389b494d5477019dc0.
* Fixup pusher pool notificationsErik Johnston2020-09-091-1/+2
|
* Stop sub-classing object (#8249)Patrick Cloke2020-09-044-5/+5
|
* Revert "Add experimental support for sharding event persister. (#8170)" (#8242)Brendan Abolivier2020-09-042-3/+3
| | | | | | | * Revert "Add experimental support for sharding event persister. (#8170)" This reverts commit 82c1ee1c22a87b9e6e3179947014b0f11c0a1ac3. * Changelog
* Add experimental support for sharding event persister. (#8170)Erik Johnston2020-09-022-3/+3
| | | | | | This is *not* ready for production yet. Caveats: 1. We should write some tests... 2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
* Fix `wait_for_stream_position` for multiple waiters. (#8196)Erik Johnston2020-08-281-4/+2
| | | | | | This fixes a bug where having multiple callers waiting on the same stream and position will cause it to try and compare two deferreds, which fails (due to the sorted list having an entry of `Tuple[int, Deferred]`).
* Remove `ChainedIdGenerator`. (#8123)Erik Johnston2020-08-191-1/+1
| | | | | It's just a thin wrapper around two ID gens to make `get_current_token` and `get_next` return tuples. This can easily be replaced by calling the appropriate methods on the underlying ID gens directly.
* Be stricter about JSON that is accepted by Synapse (#8106)Patrick Cloke2020-08-191-7/+5
|
* Separate `get_current_token` into two. (#8113)Erik Johnston2020-08-191-1/+1
| | | | | | | | | | | | The function is used for two purposes: 1) for subscribers of streams to get a token they can use to get further updates with, and 2) for replication to track position of the writers of the stream. For streams with a single writer the two scenarios produce the same result, however the situation becomes complicated for streams with multiple writers. The current `MultiWriterIdGenerator` does not correctly handle the first case (which is not an issue as its only used for the `caches` stream which nothing subscribes to outside of replication).
* Reduce unnecessary whitespace in JSON. (#7372)David Vo2020-08-071-2/+3
|
* Handle replication commands synchronously where possible (#7876)Richard van der Hoff2020-07-273-86/+111
| | | Most of the stuff we do for replication commands can be done synchronously. There's no point spinning up background processes if we're not going to need them.
* Fix typing replication not being handled on master (#7959)Erik Johnston2020-07-271-0/+8
| | | | | | | | | | | | | | | | Handling of incoming typing stream updates from replication was not hooked up on master, effecting set ups where typing was handled on a different worker. This is really only a problem if the master process is also handling sync requests, which is unlikely for those that are at the stage of moving typing off. The other observable effect is that if a worker restarts or a replication connect drops then the typing worker will issue a `POSITION typing`, triggering master process to try and stream *all* typing updates from position 0. Fixes #7907
* Remove an unused prometheus metric (#7878)Richard van der Hoff2020-07-221-3/+1
|
* Track command processing as a background process (#7879)Richard van der Hoff2020-07-222-3/+38
| | | | I'm going to be doing more stuff synchronously, and I don't want to lose the CPU metrics down the sofa.
* Fix deprecation warning: import ABC from collections.abc (#7892)Karthikeyan Singaravelan2020-07-201-1/+1
|
* Optimise queueing of inbound replication commands (#7861)Richard van der Hoff2020-07-161-116/+215
| | | | | | | | | | | When we get behind on replication, we tend to stack up background processes behind a linearizer. Bg processes are heavy (particularly with respect to prometheus metrics) and linearizers aren't terribly efficient once the queue gets long either. A better approach is to maintain a queue of requests to be processed, and nominate a single process to work its way through the queue. Fixes: #7444
* Allow moving typing off master (#7869)Erik Johnston2020-07-162-3/+13
|
* Add ability to shard the federation sender (#7798)Erik Johnston2020-07-102-6/+8
|
* Fix some spelling mistakes / typos. (#7811)Patrick Cloke2020-07-095-5/+5
|
* Do not use simplejson in Synapse. (#7800)Patrick Cloke2020-07-081-9/+2
|
* Refactor getting replication updates from database v2. (#7740)Erik Johnston2020-07-071-46/+10
|
* isort 5 compatibility (#7786)Will Hunt2020-07-053-5/+3
| | | The CI appears to use the latest version of isort, which is a problem when isort gets a major version bump. Rather than try to pin the version, I've done the necessary to make isort5 happy with synapse.
* Refactor getting replication updates from database. (#7636)Erik Johnston2020-06-161-21/+8
| | | The aim here is to make it easier to reason about when streams are limited and when they're not, by moving the logic into the database functions themselves. This should mean we can kill of `db_query_to_update_function` function.
* Discard RDATA from already seen positions. (#7648)Patrick Cloke2020-06-152-6/+28
|
* Fix bug in account data replication stream. (#7656)Erik Johnston2020-06-091-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | * Ensure account data stream IDs are unique. The account data stream is shared between three tables, and the maximum allocated ID was tracked in a dedicated table. Updating the max ID happened outside the transaction that allocated the ID, leading to a race where if the server was restarted then the same ID could be allocated but the max ID failed to be updated, leading it to be reused. The ID generators have support for tracking across multiple tables, so we may as well use that instead of a dedicated table. * Fix bug in account data replication stream. If the same stream ID was used in both global and room account data then the getting updates for the replication stream would fail due to `heapq.merge(..)` trying to compare a `str` with a `None`. (This is because you'd have two rows like `(534, '!room')` and `(534, None)` from the room and global account data tables). Fix is just to order by stream ID, since we don't rely on the ordering beyond that. The bug where stream IDs can be reused should be fixed now, so this case shouldn't happen going forward. Fixes #7617
* Typo fixes.Patrick Cloke2020-06-051-1/+1
|
* Ensure ReplicationStreamer is always started when replication enabled. (#7579)Erik Johnston2020-05-271-0/+3
| | | Fixes #7566.
* Add option to move event persistence off master (#7517)Erik Johnston2020-05-221-0/+10
|
* Add ability to wait for replication streams (#7542)Erik Johnston2020-05-221-2/+88
| | | | | | | The idea here is that if an instance persists an event via the replication HTTP API it can return before we receive that event over replication, which can lead to races where code assumes that persisting an event immediately updates various caches (e.g. current state of the room). Most of Synapse doesn't hit such races, so we don't do the waiting automagically, instead we do so where necessary to avoid unnecessary delays. We may decide to change our minds here if it turns out there are a lot of subtle races going on. People probably want to look at this commit by commit.
* Fix limit logic for AccountDataStream (#7384)Richard van der Hoff2020-05-151-12/+56
| | | | | | Make sure that the AccountDataStream presents complete updates, in the right order. This is much the same fix as #7337 and #7358, but applied to a different stream.
* Move EventStream handling into default ReplicationDataHandler (#7493)Erik Johnston2020-05-141-4/+33
| | | This is so that the logic can happen on both master and workers when we move event persistence out.
* Have all instances correctly respond to REPLICATE command. (#7475)Erik Johnston2020-05-132-46/+48
| | | | | Before all streams were only written to from master, so only master needed to respond to `REPLICATE` commands. Before all instances wrote to the cache invalidation stream, but didn't respond to `REPLICATE`. This was a bug, which could lead to missed rows from cache invalidation stream if an instance is restarted, however all the caches would be empty in that case so it wasn't a problem.
* Fix Redis reconnection logic (#7482)Erik Johnston2020-05-132-2/+14
| | | Proactively send out `POSITION` commands (as if we had just received a `REPLICATE`) when we connect to Redis. This is important as other instances won't notice we've connected to issue a `REPLICATE` command (unlike for direct TCP connections). This is only currently an issue if master process reconnects without restarting (if it restarts then it won't have written anything and so other instances probably won't have missed anything).
* Merge branch 'release-v1.13.0' into developAndrew Morgan2020-05-112-4/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * release-v1.13.0: Don't UPGRADE database rows RST indenting Put rollback instructions in upgrade notes Fix changelog typo Oh yeah, RST Absolute URL it is then Fix upgrade notes link Provide summary of upgrade issues in changelog. Fix ) Move next version notes from changelog to upgrade notes Changelog fixes 1.13.0rc1 Documentation on setting up redis (#7446) Rework UI Auth session validation for registration (#7455) Fix errors from malformed log line (#7454) Drop support for redis.dbid (#7450)
| * Fix errors from malformed log line (#7454)Richard van der Hoff2020-05-071-1/+1
| |
| * Drop support for redis.dbid (#7450)Richard van der Hoff2020-05-071-3/+1
| | | | | | Since we only use pubsub, the dbid is irrelevant.
* | Support any process writing to cache invalidation stream. (#7436)Erik Johnston2020-05-077-106/+100
| |
* | Merge branch 'release-v1.13.0' into developRichard van der Hoff2020-05-062-34/+69
|\|
| * Merge branch 'release-v1.13.0' into rav/fix_dropped_messagesRichard van der Hoff2020-05-051-1/+1
| |\
| * \ Merge branch 'release-v1.13.0' into rav/fix_dropped_messagesRichard van der Hoff2020-05-056-53/+74
| |\ \
| * | | Wait for a POSITION on the right connection before accepting RDATARichard van der Hoff2020-05-052-19/+38
| | | | | | | | | | | | | | | | ... otherwise we can believe we're up to date when we're not.
| * | | Wait to subscribe before sending REPLICATERichard van der Hoff2020-05-052-20/+35
| | | |
* | | | Merge branch 'release-v1.13.0' into developRichard van der Hoff2020-05-061-1/+1
|\ \ \ \ | | |_|/ | |/| |
| * | | Move logs about discarded RDATA to debug (#7421)Brendan Abolivier2020-05-051-1/+1
| | |/ | |/|
* / | Fix catchup-on-reconnect for the Federation Stream (#7374)Richard van der Hoff2020-05-053-11/+24
|/ / | | | | | | looks like we managed to break this during the refactorathon.
* | Fix redis password support. (#7401)Erik Johnston2020-05-041-0/+3
| | | | | | | | | | We forgot to set the password on the subscriber connection, as well as not calling super methods for overridden connectionMade/connectionLost functions.
* | Thread through instance name to replication client. (#7369)Erik Johnston2020-05-015-27/+69
| | | | | | For in memory streams when fetching updates on workers we need to query the source of the stream, which currently is hard coded to be master. This PR threads through the source instance we received via `POSITION` through to the update function in each stream, which can then be passed to the replication client for in memory streams.
* | Use `stream.current_token()` and remove `stream_positions()` (#7172)Erik Johnston2020-05-012-27/+2
|/ | | | We move the processing of typing and federation replication traffic into their handlers so that `Stream.current_token()` points to a valid token. This allows us to remove `get_streams_to_replicate()` and `stream_positions()`.
* Workaround for assertion errors from db_query_to_update_function (#7378)Richard van der Hoff2020-05-011-2/+1
| | | Hopefully this is no worse than what we have on master...
* Add instance name to RDATA/POSITION commands (#7364)Erik Johnston2020-04-292-14/+40
| | | | | This is primarily for allowing us to send those commands from workers, but for now simply allows us to ignore echoed RDATA/POSITION commands that we sent (we get echoes of sent commands when using redis). Currently we log a WARNING on the master process every time we receive an echoed RDATA.
* Don't relay REMOTE_SERVER_UP cmds to same conn. (#7352)Erik Johnston2020-04-293-16/+51
| | | | | | | | | | | | | | For direct TCP connections we need the master to relay REMOTE_SERVER_UP commands to the other connections so that all instances get notified about it. The old implementation just relayed to all connections, assuming that sending back to the original sender of the command was safe. This is not true for redis, where commands sent get echoed back to the sender, which was causing master to effectively infinite loop sending and then re-receiving REMOTE_SERVER_UP commands that it sent. The fix is to ensure that we only relay to *other* connections and not to the connection we received the notification from. Fixes #7334.
* Fix limit logic for EventsStream (#7358)Richard van der Hoff2020-04-292-15/+11
| | | | | | | | | | | | | | | | | | | * Factor out functions for injecting events into database I want to add some more flexibility to the tools for injecting events into the database, and I don't want to clutter up HomeserverTestCase with them, so let's factor them out to a new file. * Rework TestReplicationDataHandler This wasn't very easy to work with: the mock wrapping was largely superfluous, and it's useful to be able to inspect the received rows, and clear out the received list. * Fix AssertionErrors being thrown by EventsStream Part of the problem was that there was an off-by-one error in the assertion, but also the limit logic was too simple. Fix it all up and add some tests.
* Run replication streamers on workers (#7146)Erik Johnston2020-04-281-18/+15
| | | Currently we never write to streams from workers, but that will change soon
* Fix EventsStream raising assertions when it falls behindRichard van der Hoff2020-04-241-18/+95
| | | | | | | | | | Figuring out how to correctly limit updates from this stream without dropping entries is far more complicated than just counting the number of rows being returned. We need to consider each query separately and, if any one query hits the limit, truncate the results from the others. I think this also fixes some potentially long-standing bugs where events or state changes could get missed if we hit the limit on either query.
* Make it clear that the limit for an update_function is a targetRichard van der Hoff2020-04-231-5/+9
|
* Remove 'limit' param from `get_repl_stream_updates` APIRichard van der Hoff2020-04-231-4/+1
| | | | | there doesn't seem to be much point in passing this limit all around, since both sides agree it's meant to be 100.
* Stop the master relaying USER_SYNC for other workers (#7318)Richard van der Hoff2020-04-222-12/+10
| | | | | | | Long story short: if we're handling presence on the current worker, we shouldn't be sending USER_SYNC commands over replication. In an attempt to figure out what is going on here, I ended up refactoring some bits of the presencehandler code, so the first 4 commits here are non-functional refactors to move this code slightly closer to sanity. (There's still plenty to do here :/). Suggest reviewing individual commits. Fixes (I hope) #7257.
* Fix replication metrics when using redis (#7325)Erik Johnston2020-04-222-37/+29
|
* Another go at fixing one-word commands (#7326)Richard van der Hoff2020-04-221-1/+1
| | | I messed this up last time I tried (#7239 / e13c6c7).
* Add ability to run replication protocol over redis. (#7040)Erik Johnston2020-04-225-34/+255
| | | This is configured via the `redis` config options.
* On catchup, process each row with its own stream id (#7286)Richard van der Hoff2020-04-201-5/+68
| | | | | | Other parts of the code (such as the StreamChangeCache) assume that there will not be multiple changes with the same stream id. This code was introduced in #7024, and I hope this fixes #7206.
* Improve type checking in `replication.tcp.Stream` (#7291)Richard van der Hoff2020-04-174-122/+142
| | | | | | | The general idea here is to get rid of the type: ignore annotations on all of the current_token and update_function assignments, which would have caught #7290. After a bit of experimentation, it seems like the least-awful way to do this is to pass the offending functions in as parameters to the Stream constructor. Unfortunately that means that the concrete implementations no longer have the same constructor signature as Stream itself, which means that it gets hard to correctly annotate STREAMS_MAP. I've also introduced a couple of new types, to take out some duplication.
* Fix 'generator object is not subscriptable' error (#7290)Richard van der Hoff2020-04-161-1/+2
| | | | | | Some of the query functions return generators rather than lists, so we can't index into the result. Happily we already have a copy of the results. (think this was introduced in #7024)
* Handle one-word replication commands correctlyRichard van der Hoff2020-04-071-3/+11
| | | | | `REPLICATE` is now a valid command, and it's nice if you can issue it from the console without remembering to call it `REPLICATE ` with a trailing space.
* Fix warnings about not calling superclass constructorRichard van der Hoff2020-04-071-15/+24
| | | | | | Separate `SimpleCommand` from `Command`, so that things which don't want to use the `data` property don't have to, and thus fix the warnings PyCharm was giving me about not calling `__init__` in the base class.
* Remove vestigal references to SYNC replication commandRichard van der Hoff2020-04-072-14/+0
| | | | We've ripped pretty much all of this out: let's remove the remains.
* Fix race in replication (#7226)Erik Johnston2020-04-072-29/+47
| | | | Fixes a race between handling `POSITION` and `RDATA` commands. We do this by simply linearizing handling of them.
* Move server command handling out of TCP protocol (#7187)Erik Johnston2020-04-073-269/+236
| | | This completes the merging of server and client command processing.
* Move client command handling out of TCP protocol (#7185)Erik Johnston2020-04-064-322/+336
| | | The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new `ReplicationCommandHandler`, a future PR will move the server paths too.
* Remove connections per replication stream metric. (#7195)Erik Johnston2020-04-011-16/+0
| | | | | This broke in a recent PR (#7024) and is no longer useful due to all replication clients implicitly subscribing to all streams, so let's just remove it.
* Remove usage of "conn_id" for presence. (#7128)Erik Johnston2020-03-304-18/+50
| | | | | | | | | | | | | | | | * Remove `conn_id` usage for UserSyncCommand. Each tcp replication connection is assigned a "conn_id", which is used to give an ID to a remotely connected worker. In a redis world, there will no longer be a one to one mapping between connection and instance, so instead we need to replace such usages with an ID generated by the remote instances and included in the replicaiton commands. This really only effects UserSyncCommand. * Add CLEAR_USER_SYNCS command that is sent on shutdown. This should help with the case where a synchrotron gets restarted gracefully, rather than rely on 5 minute timeout.
* Move catchup of replication streams to worker. (#7024)Erik Johnston2020-03-258-229/+225
| | | This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
* Convert `*StreamRow` classes to inner classes (#7116)Richard van der Hoff2020-03-232-96/+101
| | | | | This just helps keep the rows closer to their streams, so that it's easier to see what the format of each stream is.
* Fix processing of `groups` stream, and use symbolic names for streams (#7117)Richard van der Hoff2020-03-231-18/+52
| | | | | | `groups` != `receipts` Introduced in #6964
* Remove concept of a non-limited stream. (#7011)Erik Johnston2020-03-202-47/+28
|
* Change device list replication to match new semantics.Erik Johnston2020-02-281-4/+9
| | | | | Instead of sending down batches of user ID/host tuples, send down a row per entity (user ID or host).
* Port PresenceHandler to async/await (#6991)Erik Johnston2020-02-261-1/+5
|
* Increase MAX_EVENTS_BEHIND for replication clientsErik Johnston2020-02-211-1/+1
|
* Fix sending server up commands from workers (#6811)Erik Johnston2020-01-301-0/+4
| | | | Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
* Propagate cache invalidates from workers to other workers. (#6748)Erik Johnston2020-01-272-4/+7
| | | Currently if a worker invalidates a cache it will be streamed to master, which then didn't forward those to other workers.
* Allow streaming cache invalidate all to workers. (#6749)Erik Johnston2020-01-221-5/+21
|
* Wake up transaction queue when remote server comes back online (#6706)Erik Johnston2020-01-174-0/+44
| | | | | This will be used to retry outbound transactions to a remote server if we think it might have come back up.
* Port synapse.replication.tcp to async/await (#6666)Erik Johnston2020-01-165-85/+63
| | | | | | | | | | * Port synapse.replication.tcp to async/await * Newsfile * Correctly document type of on_<FOO> functions as async * Don't be overenthusiastic with the asyncing....
* Fixup synapse.replication to pass mypy checks (#6667)Erik Johnston2020-01-147-77/+93
|
* Reduce the reconnect time when replication fails. (#6617)Richard van der Hoff2020-01-031-1/+2
|
* lintAndrew Morgan2019-11-081-2/+1
|
* Remove content from being sent for account data rdata streamAndrew Morgan2019-11-081-3/+3
|
* document the REPLICATE command a bit better (#6305)Richard van der Hoff2019-11-042-8/+86
| | | | since I found myself wonder how it works
* Merge branch 'develop' into uhoreg/cross_signing_fix_workers_notifyHubert Chathi2019-10-312-2/+2
|\
| * Remove usage of deprecated logger.warn method from codebase (#6271)Andrew Morgan2019-10-312-2/+2
| | | | | | Replace every instance of `logger.warn` with `logger.warning` as the former is deprecated.
* | make user signatures a separate streamHubert Chathi2019-10-302-0/+19
|/
* Remove unnecessary parentheses around return statements (#5931)Andrew Morgan2019-08-301-4/+4
| | | | | Python will return a tuple whether there are parentheses around the returned values or not. I'm just sick of my editor complaining about this all over the place :)
* Replace returnValue with return (#5736)Amber Brown2019-07-232-7/+7
|
* Move logging utilities out of the side drawer of util/ and into logging/ (#5606)Amber Brown2019-07-041-1/+1
|
* Run Black. (#5482)Amber Brown2019-06-207-183/+228
|
* Fix relations in worker modeErik Johnston2019-05-162-5/+7
|
* Combine the CurrentStateDeltaStream into the EventStreamRichard van der Hoff2019-03-273-23/+33
|
* Make EventStream rows have a typeRichard van der Hoff2019-03-272-14/+88
| | | | ... as a precursor to combining it with the CurrentStateDelta stream.
* Skip building a ROW_TYPE when building updatesRichard van der Hoff2019-03-271-2/+2
| | | | | We're about to turn it straight into a JSON object anyway so building a ROW_TYPE is a bit pointless, and reduces flexibility in the update_function.
* Add parse_row method to replication stream classRichard van der Hoff2019-03-273-3/+19
| | | | This will allow individual stream classes to override how a row is parsed.
* move FederationStream out to its own fileRichard van der Hoff2019-03-274-23/+43
|
* move EventsStream out to its own fileRichard van der Hoff2019-03-273-23/+42
|
* Move replication.tcp.streams into a packageRichard van der Hoff2019-03-272-33/+51
|
* Fix/improve some docstrings in the replication code. (#4949)Richard van der Hoff2019-03-272-7/+19
|
* Fix ClientReplicationStreamProtocol.__str__ (#4929)Richard van der Hoff2019-03-252-4/+5
| | | | | | | | `__str__` depended on `self.addr`, which was absent from ClientReplicationStreamProtocol, so attempting to call str on such an object would raise an exception. We can calculate the peer addr from the transport, so there is no need for addr anyway.
* Fix bug where read-receipts lost their timestamps (#4927)Richard van der Hoff2019-03-252-11/+27
| | | | | Make sure that they are sent correctly over the replication stream. Fixes: #4898
* Add a config option for torture-testing worker replication. (#4902)Richard van der Hoff2019-03-201-1/+17
| | | Setting this to 50 or so makes a bunch of sytests fail in worker mode.
* Simplify token replication logicAndrew Morgan2019-03-051-23/+14
|
* Clean up logic and add commentsAndrew Morgan2019-03-041-11/+18
|
* Clearer branching, fix missing list clearAndrew Morgan2019-03-041-4/+11
|
* Prevent replication wedgingAndrew Morgan2019-03-041-4/+24
|
* Merge pull request #4749 from matrix-org/erikj/replication_connection_backoffErik Johnston2019-02-273-5/+39
|\ | | | | Fix tightloop over connecting to replication server
| * Move connecting logic into ClientReplicationStreamProtocolErik Johnston2019-02-272-18/+17
| |
| * Increase the max delay between retry attemptsErik Johnston2019-02-261-1/+1
| | | | | | | | | | Otherwise if you have many workers they can easily take out master with their connection attempts
| * Fix tightloop over connecting to replication serverErik Johnston2019-02-262-4/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the client failed to process incoming commands during the initial set up of the replication connection it would immediately disconnect and reconnect, resulting in a tightloop. This can happen, for example, when subscribing to a stream that has a row that is too long in the backlog. The fix here is to not consider the connection successfully set up until the client has succesfully subscribed and caught up with the streams. This ensures that the retry logic timers aren't reset until then, meaning that if an error does happen during start up the client will continue backing off before retrying again.
* | Limit cache invalidation replication line length (#4748)Erik Johnston2019-02-271-1/+16
|/
* Don't truncate command name in metricsErik Johnston2018-10-291-2/+2
|
* Make the replication logger quieter (#4108)Amber Brown2018-10-291-1/+1
|
* Fix minor typo in exceptionTravis Ralston2018-09-131-1/+1
|
* Remove conn_idErik Johnston2018-09-041-2/+2
|
* Remove conn_id from repl prometheus metricsErik Johnston2018-09-031-10/+10
| | | | | `conn_id` gets set to a random string, and so we end up filling up prometheus with tonnes of data series, which is bad.
* Logcontexts for replication command handlersRichard van der Hoff2018-08-173-15/+43
| | | | | | | | | | Run the handlers for replication commands as background processes. This should improve the visibility in our metrics, and reduce the number of "running db transaction from sentinel context" warnings. Ideally it means converting the things that fire off deferreds into the night into things that actually return a Deferred when they are done. I've made a bit of a stab at this, but it will probably be leaky.
* Fix unit testsRichard van der Hoff2018-07-251-1/+1
| | | | | | on_notifier_poke no longer runs synchonously, so we have to do a different hack to make sure that the replication data has been sent. Let's actually listen for its arrival.
* Wrap a number of things that run in the backgroundRichard van der Hoff2018-07-251-6/+8
| | | | | This will reduce the number of "Starting db connection from sentinel context" warnings, and will help with our metrics.
* run isortAmber Brown2018-07-094-30/+41
|
* Attempt to be more performant on PyPy (#3462)Amber Brown2018-06-281-6/+10
|
* Remove all global reactor imports & pass it around explicitly (#3424)Amber Brown2018-06-252-5/+5
|
* Fix tcp protocol metrics naming (#3410)Amber Brown2018-06-211-18/+35
|
* Fix replication metricsRichard van der Hoff2018-06-041-2/+2
| | | | fix bug introduced in #3256
* Merge remote-tracking branch 'origin/develop' into 3218-official-promAmber Brown2018-05-282-8/+9
|\
| * Merge pull request #3244 from NotAFile/py3-six-4Amber Brown2018-05-242-5/+7
| |\ | | | | | | replace some iteritems with six
| | * replace some iteritems with sixAdrian Tschira2018-05-192-5/+7
| | | | | | | | | | | | Signed-off-by: Adrian Tschira <nota@notafile.com>
* | | more cleanupAmber Brown2018-05-222-6/+10
| | |
* | | fix the test failuresAmber Brown2018-05-221-1/+1
| | |
* | | cleanups, self-registrationAmber Brown2018-05-221-4/+5
| | |
* | | Merge remote-tracking branch 'origin/develop' into 3218-official-promAmber Brown2018-05-221-0/+2
|\| |
| * | Send users a server notice about consentRichard van der Hoff2018-05-221-0/+2
| |/ | | | | | | | | When a user first syncs, we will send them a server notice asking them to consent to the privacy policy if they have not already done so.
* | rest of the changesAmber Brown2018-05-211-16/+14
| |
* | replacing portionsAmber Brown2018-05-211-54/+34
|/
* make imports localAdrian Tschira2018-04-282-4/+4
| | | | Signed-off-by: Adrian Tschira <nota@notafile.com>
* Fix json encoding bug in replicationRichard van der Hoff2018-04-031-1/+1
| | | | json encoders have an encode method, not a dumps method.
* Use static JSONEncodersRichard van der Hoff2018-03-291-3/+5
| | | | | using json.dumps with custom options requires us to create a new JSONEncoder on each call. It's more efficient to create one upfront and reuse it.
* Explicitly use simplejsonErik Johnston2018-03-201-7/+7
|
* Fix replication after switch to simplejsonErik Johnston2018-03-191-2/+4
| | | | | Turns out that simplejson serialises namedtuple's as dictionaries rather than tuples by default.
* Merge branch 'master' of github.com:matrix-org/synapse into developErik Johnston2018-03-191-1/+1
|\
| * Replace ujson with simplejsonErik Johnston2018-03-151-1/+1
| |
* | Metrics for number of RDATA commands receivedRichard van der Hoff2018-01-151-5/+14
|/ | | | I found myself wishing we had this.
* Fix some logcontext leaks in replication resourceRichard van der Hoff2017-11-231-2/+4
| | | | | The @measure_func annotations rely on the wrapped function respecting the logcontext rules. Add the necessary yields to make this work.
* replace 'except:' with 'except Exception:'Richard van der Hoff2017-10-231-1/+1
| | | | what could possibly go wrong
* log when we get an exception handling replication updateshera2017-10-121-1/+5
|
* Fix replication. And notifyErik Johnston2017-07-201-0/+20
|
* Reduce log levels in tcp replicationErik Johnston2017-07-111-2/+2
|
* Serialize user ip command as jsonErik Johnston2017-06-271-5/+9
|
* Make workers report to master for user ip updatesErik Johnston2017-06-274-0/+55
|
* Initial worker implErik Johnston2017-06-161-0/+22
|
* Add missing notifierErik Johnston2017-06-091-1/+2
|
* TypoErik Johnston2017-04-101-1/+1
|
* Merge pull request #2109 from matrix-org/erikj/send_queue_fixErik Johnston2017-04-101-2/+2
|\ | | | | Fix up federation SendQueue and document types
| * CommentsErik Johnston2017-04-101-2/+2
| |
* | Up replication ping timeoutErik Johnston2017-04-101-2/+4
|/
* Merge pull request #2103 from matrix-org/erikj/no-double-encodeErik Johnston2017-04-071-28/+76
|\ | | | | Don't double encode replication data
| * Document types of the replication streamsErik Johnston2017-04-061-28/+76
| |
* | Fix incorrect type when using InvalidateCacheCommandErik Johnston2017-04-061-1/+1
| |
* | Add log linesErik Johnston2017-04-051-1/+2
| |
* | Rearrange metricsErik Johnston2017-04-051-16/+31
| |
* | Fix typoErik Johnston2017-04-051-2/+2
| |
* | Fixup some metrics for tcp replErik Johnston2017-04-051-0/+16
|/
* Merge pull request #2097 from matrix-org/erikj/repl_tcp_clientErik Johnston2017-04-051-0/+196
|\ | | | | Move to using TCP replication
| * Add basic replication client handler and factoryErik Johnston2017-04-031-0/+196
| |
* | Merge pull request #2098 from matrix-org/erikj/repl_tcp_fixErik Johnston2017-04-043-6/+15
|\ \ | | | | | | Advance replication streams even if nothing is listening
| * | Advance replication streams even if nothing is listeningErik Johnston2017-04-043-6/+15
| |/ | | | | | | | | | | Otherwise the streams don't advance and steadily fall behind, so when a worker does connect either a) they'll be streamed lots of old updates or b) the connection will fail as the streams are too far behind.
* / Fiddle tcp replication loggingErik Johnston2017-04-041-2/+2
|/
* Always advance stream tokensErik Johnston2017-04-031-1/+4
|
* Use callbacks to notify tcp replication rather than deferredsErik Johnston2017-03-311-14/+1
|
* Add a timestamp to USER_SYNC commandErik Johnston2017-03-313-9/+17
| | | | This timestamp is used to indicate when the user last sync'd
* Fix up docsErik Johnston2017-03-312-19/+3
|
* Add server side resource for tcp replicationErik Johnston2017-03-301-0/+300
|
* Initial TCP protocol implementationErik Johnston2017-03-303-0/+974
| | | | This defines the low level TCP replication protocol
* Define the various streams we will replicateErik Johnston2017-03-302-0/+423