summary refs log tree commit diff
path: root/synapse/replication (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Use slots in attrs classes where possible (#8296)Patrick Cloke2020-09-141-2/+2
| | | | | slots use less memory (and attribute access is faster) while slightly limiting the flexibility of the class attributes. This focuses on objects which are instantiated "often" and for short periods of time.
* Fix typos in comments.Patrick Cloke2020-09-141-1/+1
|
* Add experimental support for sharding event persister. Again. (#8294)Erik Johnston2020-09-143-6/+12
| | | | | | This is *not* ready for production yet. Caveats: 1. We should write some tests... 2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
* Clean up `Notifier.on_new_room_event` code path (#8288)Erik Johnston2020-09-101-6/+3
| | | | | | | | | | | | | The idea here is that we pass the `max_stream_id` to everything, and only use the stream ID of the particular event to figure out *when* the max stream position has caught up to the event and we can notify people about it. This is to maintain the distinction between the position of an item in the stream (i.e. event A has stream ID 513) and a token that can be used to partition the stream (i.e. give me all events after stream ID 352). This distinction becomes important when the tokens are more complicated than a single number, which they will be once we start tracking the position of multiple writers in the tokens. The valid operations here are: 1. Is a position before or after a token 2. Fetching all events between two tokens 3. Merging multiple tokens to get the "max", i.e. `C = max(A, B)` means that for all positions P where P is before A *or* before B, then P is before C. Future PR will change the token type to a dedicated type.
* Remove some unused distributor signals (#8216)Patrick Cloke2020-09-091-6/+4
| | | | | Removes the `user_joined_room` and stops calling it since there are no observers. Also cleans-up some other unused signals and related code.
* Fixup pusher pool notifications (#8287)Erik Johnston2020-09-091-1/+2
| | | | | `pusher_pool.on_new_notifications` expected a min and max stream ID, however that was not what we were passing in. Instead, let's just pass it the current max stream ID and have it track the last stream ID it got passed. I believe that it mostly worked as we called the function for every event. However, it would break for events that got persisted out of order, i.e, that were persisted but the max stream ID wasn't incremented as not all preceding events had finished persisting, and push for that event would be delayed until another event got pushed to the effected users.
* Revert "Fixup pusher pool notifications"Erik Johnston2020-09-091-2/+1
| | | | This reverts commit e7fd336a53a4ca489cdafc389b494d5477019dc0.
* Fixup pusher pool notificationsErik Johnston2020-09-091-1/+2
|
* Stop sub-classing object (#8249)Patrick Cloke2020-09-046-7/+7
|
* Revert "Add experimental support for sharding event persister. (#8170)" (#8242)Brendan Abolivier2020-09-043-12/+6
| | | | | | | * Revert "Add experimental support for sharding event persister. (#8170)" This reverts commit 82c1ee1c22a87b9e6e3179947014b0f11c0a1ac3. * Changelog
* Add experimental support for sharding event persister. (#8170)Erik Johnston2020-09-023-6/+12
| | | | | | This is *not* ready for production yet. Caveats: 1. We should write some tests... 2. The stream token that we use for events can get stalled at the minimum position of all writers. This means that new events may not be processed and e.g. sent down sync streams if a writer isn't writing or is slow.
* Move and rename `get_devices_with_keys_by_user` (#8204)Richard van der Hoff2020-09-011-0/+3
| | | | | | | | | | | | | | | | | | * Move `get_devices_with_keys_by_user` to `EndToEndKeyWorkerStore` this seems a better fit for it. This commit simply moves the existing code: no other changes at all. * Rename `get_devices_with_keys_by_user` to better reflect what it does. * get_device_stream_token abstract method To avoid referencing fields which are declared in the derived classes, make `get_device_stream_token` abstract, and define that in the classes which define `_device_list_id_gen`.
* Fix `wait_for_stream_position` for multiple waiters. (#8196)Erik Johnston2020-08-281-4/+2
| | | | | | This fixes a bug where having multiple callers waiting on the same stream and position will cause it to try and compare two deferreds, which fails (due to the sorted list having an entry of `Tuple[int, Deferred]`).
* Make SlavedIdTracker.advance have same interface as MultiWriterIDGenerator ↵Erik Johnston2020-08-2610-13/+13
| | | | (#8171)
* Remove `ChainedIdGenerator`. (#8123)Erik Johnston2020-08-192-7/+5
| | | | | It's just a thin wrapper around two ID gens to make `get_current_token` and `get_next` return tuples. This can easily be replaced by calling the appropriate methods on the underlying ID gens directly.
* Be stricter about JSON that is accepted by Synapse (#8106)Patrick Cloke2020-08-191-7/+5
|
* Separate `get_current_token` into two. (#8113)Erik Johnston2020-08-192-1/+9
| | | | | | | | | | | | The function is used for two purposes: 1) for subscribers of streams to get a token they can use to get further updates with, and 2) for replication to track position of the writers of the stream. For streams with a single writer the two scenarios produce the same result, however the situation becomes complicated for streams with multiple writers. The current `MultiWriterIdGenerator` does not correctly handle the first case (which is not an issue as its only used for the `caches` stream which nothing subscribes to outside of replication).
* Add a shadow-banned flag to users. (#8092)Patrick Cloke2020-08-141-0/+4
|
* Reduce unnecessary whitespace in JSON. (#7372)David Vo2020-08-071-2/+3
|
* Convert synapse.api to async/await (#8031)Patrick Cloke2020-08-061-1/+1
|
* Rename database classes to make some sense (#8033)Erik Johnston2020-08-0519-54/+54
|
* Convert replication code to async/await. (#7987)Patrick Cloke2020-08-039-37/+27
|
* Merge tag 'v1.18.0rc2' into developRichard van der Hoff2020-07-284-87/+112
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Synapse 1.18.0rc2 (2020-07-28) ============================== Bugfixes -------- - Fix an `AssertionError` exception introduced in v1.18.0rc1. ([\#7876](https://github.com/matrix-org/synapse/issues/7876)) - Fix experimental support for moving typing off master when worker is restarted, which is broken in v1.18.0rc1. ([\#7967](https://github.com/matrix-org/synapse/issues/7967)) Internal Changes ---------------- - Further optimise queueing of inbound replication commands. ([\#7876](https://github.com/matrix-org/synapse/issues/7876))
| * Typing worker needs to handle stream update requests (#7967)Erik Johnston2020-07-281-1/+1
| | | | | | | | | | IIRC this doesn't break tests because its only hit on reconnection, or something. Basically, when a process needs to fetch missing updates for the `typing` stream it needs to query the writer instance via HTTP (as we don't write typing notifications to the DB), the problem was that the endpoint (`streams`) was only registered on master and specifically not on the typing writer worker.
| * Handle replication commands synchronously where possible (#7876)Richard van der Hoff2020-07-273-86/+111
| | | | | | Most of the stuff we do for replication commands can be done synchronously. There's no point spinning up background processes if we're not going to need them.
* | Convert a synapse.events to async/await. (#7949)Patrick Cloke2020-07-272-2/+4
|/
* Fix typing replication not being handled on master (#7959)Erik Johnston2020-07-271-0/+8
| | | | | | | | | | | | | | | | Handling of incoming typing stream updates from replication was not hooked up on master, effecting set ups where typing was handled on a different worker. This is really only a problem if the master process is also handling sync requests, which is unlikely for those that are at the stage of moving typing off. The other observable effect is that if a worker restarts or a replication connect drops then the typing worker will issue a `POSITION typing`, triggering master process to try and stream *all* typing updates from position 0. Fixes #7907
* Remove an unused prometheus metric (#7878)Richard van der Hoff2020-07-221-3/+1
|
* Track command processing as a background process (#7879)Richard van der Hoff2020-07-222-3/+38
| | | | I'm going to be doing more stuff synchronously, and I don't want to lose the CPU metrics down the sofa.
* Fix deprecation warning: import ABC from collections.abc (#7892)Karthikeyan Singaravelan2020-07-201-1/+1
|
* Stop using 'device_max_stream_id' (#7882)Erik Johnston2020-07-171-1/+1
| | | | | It serves no purpose and updating everytime we write to the device inbox stream means all such transactions will conflict, causing lots of transaction failures and retries.
* Optimise queueing of inbound replication commands (#7861)Richard van der Hoff2020-07-161-116/+215
| | | | | | | | | | | When we get behind on replication, we tend to stack up background processes behind a linearizer. Bg processes are heavy (particularly with respect to prometheus metrics) and linearizers aren't terribly efficient once the queue gets long either. A better approach is to maintain a queue of requests to be processed, and nominate a single process to work its way through the queue. Fixes: #7444
* Allow moving typing off master (#7869)Erik Johnston2020-07-162-3/+13
|
* Add ability to shard the federation sender (#7798)Erik Johnston2020-07-102-6/+8
|
* Fix some spelling mistakes / typos. (#7811)Patrick Cloke2020-07-096-7/+7
|
* Generate real events when we reject invites (#7804)Richard van der Hoff2020-07-091-67/+25
| | | | | | | | Fixes #2181. The basic premise is that, when we fail to reject an invite via the remote server, we can generate our own out-of-band leave event and persist it as an outlier, so that we have something to send to the client.
* Do not use simplejson in Synapse. (#7800)Patrick Cloke2020-07-081-9/+2
|
* Refactor getting replication updates from database v2. (#7740)Erik Johnston2020-07-071-46/+10
|
* isort 5 compatibility (#7786)Will Hunt2020-07-053-5/+3
| | | The CI appears to use the latest version of isort, which is a problem when isort gets a major version bump. Rather than try to pin the version, I've done the necessary to make isort5 happy with synapse.
* Merge different Resource implementation classes (#7732)Erik Johnston2020-07-032-10/+4
|
* Use symbolic names for replication stream names (#7768)Richard van der Hoff2020-07-018-17/+17
| | | This makes it much easier to find where streams are referenced.
* Refactor getting replication updates from database. (#7636)Erik Johnston2020-06-161-21/+8
| | | The aim here is to make it easier to reason about when streams are limited and when they're not, by moving the logic into the database functions themselves. This should mean we can kill of `db_query_to_update_function` function.
* Replace all remaining six usage with native Python 3 equivalents (#7704)Dagfinn Ilmari Mannsåker2020-06-161-4/+2
|
* Discard RDATA from already seen positions. (#7648)Patrick Cloke2020-06-152-6/+28
|
* Fix bug in account data replication stream. (#7656)Erik Johnston2020-06-092-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | * Ensure account data stream IDs are unique. The account data stream is shared between three tables, and the maximum allocated ID was tracked in a dedicated table. Updating the max ID happened outside the transaction that allocated the ID, leading to a race where if the server was restarted then the same ID could be allocated but the max ID failed to be updated, leading it to be reused. The ID generators have support for tracking across multiple tables, so we may as well use that instead of a dedicated table. * Fix bug in account data replication stream. If the same stream ID was used in both global and room account data then the getting updates for the replication stream would fail due to `heapq.merge(..)` trying to compare a `str` with a `None`. (This is because you'd have two rows like `(534, '!room')` and `(534, None)` from the room and global account data tables). Fix is just to order by stream ID, since we don't rely on the ordering beyond that. The bug where stream IDs can be reused should be fixed now, so this case shouldn't happen going forward. Fixes #7617
* Typo fixes.Patrick Cloke2020-06-051-1/+1
|
* Ensure ReplicationStreamer is always started when replication enabled. (#7579)Erik Johnston2020-05-271-0/+3
| | | Fixes #7566.
* Add option to move event persistence off master (#7517)Erik Johnston2020-05-225-2/+171
|
* Add ability to wait for replication streams (#7542)Erik Johnston2020-05-225-18/+108
| | | | | | | The idea here is that if an instance persists an event via the replication HTTP API it can return before we receive that event over replication, which can lead to races where code assumes that persisting an event immediately updates various caches (e.g. current state of the room). Most of Synapse doesn't hit such races, so we don't do the waiting automagically, instead we do so where necessary to avoid unnecessary delays. We may decide to change our minds here if it turns out there are a lot of subtle races going on. People probably want to look at this commit by commit.
* Allow ReplicationRestResource to be added to workers (#7515)Erik Johnston2020-05-181-5/+8
| | | This allows workers to talk to each other over HTTP replication.
* Merge pull request #7519 from matrix-org/rav/kill_py2_codeRichard van der Hoff2020-05-182-13/+4
|\ | | | | Kill off some old python 2 code
| * remove redundant `__func__`Richard van der Hoff2020-05-152-13/+4
| | | | | | | | this is a no-op under python 3
* | Fix limit logic for AccountDataStream (#7384)Richard van der Hoff2020-05-151-12/+56
| | | | | | | | | | | | Make sure that the AccountDataStream presents complete updates, in the right order. This is much the same fix as #7337 and #7358, but applied to a different stream.
* | Move event stream handling out of slave store. (#7491)Erik Johnston2020-05-152-97/+0
|/ | | | | This allows us to have the logic on both master and workers, which is necessary to move event persistence off master. We also combine the instantiation of ID generators from DataStore and slave stores to the base worker stores. This allows us to select which process writes events independently of the master/worker splits.
* Move EventStream handling into default ReplicationDataHandler (#7493)Erik Johnston2020-05-141-4/+33
| | | This is so that the logic can happen on both master and workers when we move event persistence out.
* Add `instance_map` config and route replication calls (#7495)Erik Johnston2020-05-141-6/+15
|
* Have all instances correctly respond to REPLICATE command. (#7475)Erik Johnston2020-05-133-48/+50
| | | | | Before all streams were only written to from master, so only master needed to respond to `REPLICATE` commands. Before all instances wrote to the cache invalidation stream, but didn't respond to `REPLICATE`. This was a bug, which could lead to missed rows from cache invalidation stream if an instance is restarted, however all the caches would be empty in that case so it wasn't a problem.
* Fix Redis reconnection logic (#7482)Erik Johnston2020-05-132-2/+14
| | | Proactively send out `POSITION` commands (as if we had just received a `REPLICATE`) when we connect to Redis. This is important as other instances won't notice we've connected to issue a `REPLICATE` command (unlike for direct TCP connections). This is only currently an issue if master process reconnects without restarting (if it restarts then it won't have written anything and so other instances probably won't have missed anything).
* Allow configuration of Synapse's cache without using synctl or environment ↵Amber Brown2020-05-111-2/+1
| | | | variables (#6391)
* Merge branch 'release-v1.13.0' into developAndrew Morgan2020-05-112-4/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * release-v1.13.0: Don't UPGRADE database rows RST indenting Put rollback instructions in upgrade notes Fix changelog typo Oh yeah, RST Absolute URL it is then Fix upgrade notes link Provide summary of upgrade issues in changelog. Fix ) Move next version notes from changelog to upgrade notes Changelog fixes 1.13.0rc1 Documentation on setting up redis (#7446) Rework UI Auth session validation for registration (#7455) Fix errors from malformed log line (#7454) Drop support for redis.dbid (#7450)
| * Fix errors from malformed log line (#7454)Richard van der Hoff2020-05-071-1/+1
| |
| * Drop support for redis.dbid (#7450)Richard van der Hoff2020-05-071-3/+1
| | | | | | Since we only use pubsub, the dbid is irrelevant.
* | Support any process writing to cache invalidation stream. (#7436)Erik Johnston2020-05-0718-183/+131
| |
* | Merge branch 'release-v1.13.0' into developRichard van der Hoff2020-05-062-34/+69
|\|
| * Merge branch 'release-v1.13.0' into rav/fix_dropped_messagesRichard van der Hoff2020-05-051-1/+1
| |\
| * \ Merge branch 'release-v1.13.0' into rav/fix_dropped_messagesRichard van der Hoff2020-05-0519-132/+96
| |\ \
| * | | Wait for a POSITION on the right connection before accepting RDATARichard van der Hoff2020-05-052-19/+38
| | | | | | | | | | | | | | | | ... otherwise we can believe we're up to date when we're not.
| * | | Wait to subscribe before sending REPLICATERichard van der Hoff2020-05-052-20/+35
| | | |
* | | | Merge branch 'release-v1.13.0' into developRichard van der Hoff2020-05-061-1/+1
|\ \ \ \ | | |_|/ | |/| |
| * | | Move logs about discarded RDATA to debug (#7421)Brendan Abolivier2020-05-051-1/+1
| | |/ | |/|
* / | Fix catchup-on-reconnect for the Federation Stream (#7374)Richard van der Hoff2020-05-053-11/+24
|/ / | | | | | | looks like we managed to break this during the refactorathon.
* | Fix redis password support. (#7401)Erik Johnston2020-05-041-0/+3
| | | | | | | | | | We forgot to set the password on the subscriber connection, as well as not calling super methods for overridden connectionMade/connectionLost functions.
* | Thread through instance name to replication client. (#7369)Erik Johnston2020-05-017-29/+90
| | | | | | For in memory streams when fetching updates on workers we need to query the source of the stream, which currently is hard coded to be master. This PR threads through the source instance we received via `POSITION` through to the update function in each stream, which can then be passed to the replication client for in memory streams.
* | Use `stream.current_token()` and remove `stream_positions()` (#7172)Erik Johnston2020-05-0113-104/+3
|/ | | | We move the processing of typing and federation replication traffic into their handlers so that `Stream.current_token()` points to a valid token. This allows us to remove `get_streams_to_replicate()` and `stream_positions()`.
* Workaround for assertion errors from db_query_to_update_function (#7378)Richard van der Hoff2020-05-011-2/+1
| | | Hopefully this is no worse than what we have on master...
* Add instance name to RDATA/POSITION commands (#7364)Erik Johnston2020-04-292-14/+40
| | | | | This is primarily for allowing us to send those commands from workers, but for now simply allows us to ignore echoed RDATA/POSITION commands that we sent (we get echoes of sent commands when using redis). Currently we log a WARNING on the master process every time we receive an echoed RDATA.
* Don't relay REMOTE_SERVER_UP cmds to same conn. (#7352)Erik Johnston2020-04-293-16/+51
| | | | | | | | | | | | | | For direct TCP connections we need the master to relay REMOTE_SERVER_UP commands to the other connections so that all instances get notified about it. The old implementation just relayed to all connections, assuming that sending back to the original sender of the command was safe. This is not true for redis, where commands sent get echoed back to the sender, which was causing master to effectively infinite loop sending and then re-receiving REMOTE_SERVER_UP commands that it sent. The fix is to ensure that we only relay to *other* connections and not to the connection we received the notification from. Fixes #7334.
* Fix limit logic for EventsStream (#7358)Richard van der Hoff2020-04-292-15/+11
| | | | | | | | | | | | | | | | | | | * Factor out functions for injecting events into database I want to add some more flexibility to the tools for injecting events into the database, and I don't want to clutter up HomeserverTestCase with them, so let's factor them out to a new file. * Rework TestReplicationDataHandler This wasn't very easy to work with: the mock wrapping was largely superfluous, and it's useful to be able to inspect the received rows, and clear out the received list. * Fix AssertionErrors being thrown by EventsStream Part of the problem was that there was an off-by-one error in the assertion, but also the limit logic was too simple. Fix it all up and add some tests.
* Run replication streamers on workers (#7146)Erik Johnston2020-04-281-18/+15
| | | Currently we never write to streams from workers, but that will change soon
* Fix EventsStream raising assertions when it falls behindRichard van der Hoff2020-04-241-18/+95
| | | | | | | | | | Figuring out how to correctly limit updates from this stream without dropping entries is far more complicated than just counting the number of rows being returned. We need to consider each query separately and, if any one query hits the limit, truncate the results from the others. I think this also fixes some potentially long-standing bugs where events or state changes could get missed if we hit the limit on either query.
* Make it clear that the limit for an update_function is a targetRichard van der Hoff2020-04-231-5/+9
|
* Remove 'limit' param from `get_repl_stream_updates` APIRichard van der Hoff2020-04-232-9/+8
| | | | | there doesn't seem to be much point in passing this limit all around, since both sides agree it's meant to be 100.
* Stop the master relaying USER_SYNC for other workers (#7318)Richard van der Hoff2020-04-222-12/+10
| | | | | | | Long story short: if we're handling presence on the current worker, we shouldn't be sending USER_SYNC commands over replication. In an attempt to figure out what is going on here, I ended up refactoring some bits of the presencehandler code, so the first 4 commits here are non-functional refactors to move this code slightly closer to sanity. (There's still plenty to do here :/). Suggest reviewing individual commits. Fixes (I hope) #7257.
* Fix replication metrics when using redis (#7325)Erik Johnston2020-04-222-37/+29
|
* Another go at fixing one-word commands (#7326)Richard van der Hoff2020-04-221-1/+1
| | | I messed this up last time I tried (#7239 / e13c6c7).
* Add ability to run replication protocol over redis. (#7040)Erik Johnston2020-04-225-34/+255
| | | This is configured via the `redis` config options.
* On catchup, process each row with its own stream id (#7286)Richard van der Hoff2020-04-201-5/+68
| | | | | | Other parts of the code (such as the StreamChangeCache) assume that there will not be multiple changes with the same stream id. This code was introduced in #7024, and I hope this fixes #7206.
* Improve type checking in `replication.tcp.Stream` (#7291)Richard van der Hoff2020-04-174-122/+142
| | | | | | | The general idea here is to get rid of the type: ignore annotations on all of the current_token and update_function assignments, which would have caught #7290. After a bit of experimentation, it seems like the least-awful way to do this is to pass the offending functions in as parameters to the Stream constructor. Unfortunately that means that the concrete implementations no longer have the same constructor signature as Stream itself, which means that it gets hard to correctly annotate STREAMS_MAP. I've also introduced a couple of new types, to take out some duplication.
* Fix 'generator object is not subscriptable' error (#7290)Richard van der Hoff2020-04-161-1/+2
| | | | | | Some of the query functions return generators rather than lists, so we can't index into the result. Happily we already have a copy of the results. (think this was introduced in #7024)
* Handle one-word replication commands correctlyRichard van der Hoff2020-04-071-3/+11
| | | | | `REPLICATE` is now a valid command, and it's nice if you can issue it from the console without remembering to call it `REPLICATE ` with a trailing space.
* Fix warnings about not calling superclass constructorRichard van der Hoff2020-04-071-15/+24
| | | | | | Separate `SimpleCommand` from `Command`, so that things which don't want to use the `data` property don't have to, and thus fix the warnings PyCharm was giving me about not calling `__init__` in the base class.
* Remove vestigal references to SYNC replication commandRichard van der Hoff2020-04-072-14/+0
| | | | We've ripped pretty much all of this out: let's remove the remains.
* Fix race in replication (#7226)Erik Johnston2020-04-072-29/+47
| | | | Fixes a race between handling `POSITION` and `RDATA` commands. We do this by simply linearizing handling of them.
* Move server command handling out of TCP protocol (#7187)Erik Johnston2020-04-073-269/+236
| | | This completes the merging of server and client command processing.
* Move client command handling out of TCP protocol (#7185)Erik Johnston2020-04-064-322/+336
| | | The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new `ReplicationCommandHandler`, a future PR will move the server paths too.
* Remove connections per replication stream metric. (#7195)Erik Johnston2020-04-011-16/+0
| | | | | This broke in a recent PR (#7024) and is no longer useful due to all replication clients implicitly subscribing to all streams, so let's just remove it.
* Remove usage of "conn_id" for presence. (#7128)Erik Johnston2020-03-304-18/+50
| | | | | | | | | | | | | | | | * Remove `conn_id` usage for UserSyncCommand. Each tcp replication connection is assigned a "conn_id", which is used to give an ID to a remotely connected worker. In a redis world, there will no longer be a one to one mapping between connection and instance, so instead we need to replace such usages with an ID generated by the remote instances and included in the replicaiton commands. This really only effects UserSyncCommand. * Add CLEAR_USER_SYNCS command that is sent on shutdown. This should help with the case where a synchrotron gets restarted gracefully, rather than rely on 5 minute timeout.
* Move catchup of replication streams to worker. (#7024)Erik Johnston2020-03-2512-232/+319
| | | This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
* Convert `*StreamRow` classes to inner classes (#7116)Richard van der Hoff2020-03-232-96/+101
| | | | | This just helps keep the rows closer to their streams, so that it's easier to see what the format of each stream is.
* Fix processing of `groups` stream, and use symbolic names for streams (#7117)Richard van der Hoff2020-03-231-18/+52
| | | | | | `groups` != `receipts` Introduced in #6964
* Remove concept of a non-limited stream. (#7011)Erik Johnston2020-03-202-47/+28
|
* Change device list streams to have one row per ID (#7010)Erik Johnston2020-03-192-17/+32
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Add 'device_lists_outbound_pokes' as extra table. This makes sure we check all the relevant tables to get the current max stream ID. Currently not doing so isn't problematic as the max stream ID in `device_lists_outbound_pokes` is the same as in `device_lists_stream`, however that will change. * Change device lists stream to have one row per id. This will make it possible to process the streams more incrementally, avoiding having to process large chunks at once. * Change device list replication to match new semantics. Instead of sending down batches of user ID/host tuples, send down a row per entity (user ID or host). * Newsfile * Remove handling of multiple rows per ID * Fix worker handling * Comments from review
| * Comments from reviewErik Johnston2020-03-181-0/+3
| |
| * Change device list replication to match new semantics.Erik Johnston2020-02-282-16/+22
| | | | | | | | | | Instead of sending down batches of user ID/host tuples, send down a row per entity (user ID or host).
| * Add 'device_lists_outbound_pokes' as extra table.Erik Johnston2020-02-281-1/+7
| | | | | | | | | | | | | | | | | | This makes sure we check all the relevant tables to get the current max stream ID. Currently not doing so isn't problematic as the max stream ID in `device_lists_outbound_pokes` is the same as in `device_lists_stream`, however that will change.
* | Store room_versions in EventBase objects (#6875)Richard van der Hoff2020-03-052-8/+19
|/ | | | | | | This is a bit fiddly because it all has to be done on one fell swoop: * Wherever we create a new event, pass in the room version (and check it matches the format version) * When we prune an event, use the room version of the unpruned event to create the pruned version. * When we pass an event over the replication protocol, pass the room version over alongside it, and use it when deserialising the event again.
* Store room version on invite (#6983)Richard van der Hoff2020-02-262-2/+36
| | | | | When we get an invite over federation, store the room version in the rooms table. The general idea here is that, when we pull the invite out again, we'll want to know what room_version it belongs to (so that we can later redact it if need be). So we need to store it somewhere...
* Port PresenceHandler to async/await (#6991)Erik Johnston2020-02-261-1/+5
|
* Merge worker apps into one. (#6964)Erik Johnston2020-02-251-0/+20
|
* Increase MAX_EVENTS_BEHIND for replication clientsErik Johnston2020-02-211-1/+1
|
* Allow moving group read APIs to workers (#6866)Erik Johnston2020-02-071-8/+6
|
* Fix sending server up commands from workers (#6811)Erik Johnston2020-01-301-0/+4
| | | | Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
* Detect unknown remote devices and mark cache as stale (#6776)Erik Johnston2020-01-281-1/+1
| | | | We just mark the fact that the cache may be stale in the database for now.
* Propagate cache invalidates from workers to other workers. (#6748)Erik Johnston2020-01-272-4/+7
| | | Currently if a worker invalidates a cache it will be streamed to master, which then didn't forward those to other workers.
* Allow streaming cache invalidate all to workers. (#6749)Erik Johnston2020-01-222-6/+27
|
* Wake up transaction queue when remote server comes back online (#6706)Erik Johnston2020-01-174-0/+44
| | | | | This will be used to retry outbound transactions to a remote server if we think it might have come back up.
* Port synapse.replication.tcp to async/await (#6666)Erik Johnston2020-01-165-85/+63
| | | | | | | | | | * Port synapse.replication.tcp to async/await * Newsfile * Correctly document type of on_<FOO> functions as async * Don't be overenthusiastic with the asyncing....
* Add `local_current_membership` table (#6655)Erik Johnston2020-01-151-1/+1
| | | | | | | Currently we rely on `current_state_events` to figure out what rooms a user was in and their last membership event in there. However, if the server leaves the room then the table may be cleaned up and that information is lost. So lets add a table that separately holds that information.
* Fixup synapse.replication to pass mypy checks (#6667)Erik Johnston2020-01-1410-86/+103
|
* Reduce the reconnect time when replication fails. (#6617)Richard van der Hoff2020-01-031-1/+2
|
* Change EventContext to use the Storage class (#6564)Erik Johnston2019-12-202-2/+6
|
* Change DataStores to accept 'database' param.Erik Johnston2019-12-0613-26/+39
|
* _CURRENT_STATE_CACHE_NAME is publicErik Johnston2019-12-041-2/+2
|
* Move cache invalidation to main data storeErik Johnston2019-12-041-1/+2
|
* Propagate reason in remotely rejected invitesErik Johnston2019-11-281-2/+5
|
* Prevent account_data content from being sent over TCP replication (#6333)Andrew Morgan2019-11-261-4/+3
|\
| * lintAndrew Morgan2019-11-081-2/+1
| |
| * Remove content from being sent for account data rdata streamAndrew Morgan2019-11-081-3/+3
| |
* | Merge pull request #6332 from matrix-org/erikj/query_devices_fixErik Johnston2019-11-262-1/+82
|\ \ | |/ |/| Fix caching devices for remote servers in worker.
| * Fixup docsErik Johnston2019-11-261-1/+5
| |
| * Fix caching devices for remote servers in worker.Erik Johnston2019-11-052-1/+78
| | | | | | | | | | | | | | | | When the `/keys/query` API is hit on client_reader worker Synapse may decide that it needs to resync some remote deivces. Usually this happens on master, and then gets cached. However, that fails on workers and so it falls back to fetching devices from remotes directly, which may in turn fail if the remote is down.
* | Address review commentsAndrew Morgan2019-11-061-1/+1
| |
* | Don't forget to ratelimit calls outside of RegistrationHandlerAndrew Morgan2019-11-061-0/+2
|/
* document the REPLICATE command a bit better (#6305)Richard van der Hoff2019-11-043-9/+95
| | | | since I found myself wonder how it works
* Merge branch 'develop' into uhoreg/cross_signing_fix_workers_notifyHubert Chathi2019-10-314-4/+4
|\
| * Remove usage of deprecated logger.warn method from codebase (#6271)Andrew Morgan2019-10-314-4/+4
| | | | | | Replace every instance of `logger.warn` with `logger.warning` as the former is deprecated.
* | clean up code a bitHubert Chathi2019-10-311-5/+9
| |
* | make user signatures a separate streamHubert Chathi2019-10-303-2/+25
| |
* | Merge branch 'develop' into uhoreg/cross_signing_fix_workers_notifyHubert Chathi2019-10-306-44/+26
|\|
| * Port replication http server endpoints to async/awaitErik Johnston2019-10-296-44/+26
| |
* | make notification of signatures work with workersHubert Chathi2019-10-241-0/+1
|/
* Merge branch 'develop' of github.com:matrix-org/synapse into ↵Erik Johnston2019-10-221-0/+3
|\ | | | | | | erikj/refactor_stores
| * Merge branch 'develop' into uhoreg/e2e_cross-signing_mergedHubert Chathi2019-09-071-10/+11
| |\
| * | add user signature stream change cache to slaved device storeHubert Chathi2019-09-041-0/+3
| | |
* | | Move storage classes into a main "data store".Erik Johnston2019-10-2117-27/+29
| |/ |/| | | | | | | This is in preparation for having multiple data stores that offer different functionality, e.g. splitting out state or event storage.
* | Trace how long it takes for the send trasaction to complete, including ↵Jorik Schellekens2019-09-051-1/+6
| | | | | | | | retrys (#5986)
* | Add opentracing to all client servlets (#5983)Jorik Schellekens2019-09-051-10/+6
|/
* Remove bind_email and bind_msisdn (#5964)Andrew Morgan2019-09-041-18/+3
| | | Removes the `bind_email` and `bind_msisdn` parameters from the `/register` C/S API endpoint as per [MSC2140: Terms of Service for ISes and IMs](https://github.com/matrix-org/matrix-doc/pull/2140/files#diff-c03a26de5ac40fb532de19cb7fc2aaf7R107).
* Remove unnecessary parentheses around return statements (#5931)Andrew Morgan2019-08-306-15/+15
| | | | | Python will return a tuple whether there are parentheses around the returned values or not. I'm just sick of my editor complaining about this all over the place :)
* Opentracing across workers (#5771)Jorik Schellekens2019-08-221-2/+14
| | | | | | | | | | | | | | Propagate opentracing contexts across workers Also includes some Convenience modifications to opentracing for servlets, notably: - Add boolean to skip the whitelisting check on inject extract methods. - useful when injecting into carriers locally. Otherwise we'd always have to include our own servername and whitelist our servername - start_active_span_from_request instead of header - Add boolean to decide whether to extract context from a request to a servlet
* Revert "Add "require_consent" parameter for registration"Brendan Abolivier2019-08-221-2/+0
| | | | This reverts commit 3320aaab3a9bba3f5872371aba7053b41af9d0a0.
* Add "require_consent" parameter for registrationHalf-Shot2019-08-221-0/+2
|
* Merge tag 'v1.2.0rc2' into developAndrew Morgan2019-07-241-1/+1
|\ | | | | | | | | | | | | Bugfixes -------- - Fix a regression introduced in v1.2.0rc1 which led to incorrect labels on some prometheus metrics. ([\#5734](https://github.com/matrix-org/synapse/issues/5734))
| * Fix servlet metric names (#5734)Jorik Schellekens2019-07-241-1/+1
| | | | | | | | | | | | | | | | | | | | * Fix servlet metric names Co-Authored-By: Richard van der Hoff <1389908+richvdh@users.noreply.github.com> * Remove redundant check * Cover all return paths
* | Replace returnValue with return (#5736)Amber Brown2019-07-238-20/+20
|/
* Remove access-token support from RegistrationHandler.register (#5641)Richard van der Hoff2019-07-081-6/+0
| | | | | | | | Nothing uses this now, so we can remove the dead code, and clean up the API. Since we're changing the shape of the return value anyway, we take the opportunity to give the method a better name.
* Remove support for invite_3pid_guest. (#5625)Richard van der Hoff2019-07-051-65/+0
| | | | | | | | | This has never been documented, and I'm not sure it's ever been used outside sytest. It's quite a lot of poorly-maintained code, so I'd like to get rid of it. For now I haven't removed the database table; I suggest we leave that for a future clearout.
* Move logging utilities out of the side drawer of util/ and into logging/ (#5606)Amber Brown2019-07-041-1/+1
|
* Run Black. (#5482)Amber Brown2019-06-2026-355/+357
|
* Handle failing to talk to master over replicationErik Johnston2019-06-071-1/+9
|
* Fixup bsaed on review commentsErik Johnston2019-05-171-1/+1
|
* Add basic editing supportErik Johnston2019-05-161-0/+1
|
* Fix relations in worker modeErik Johnston2019-05-163-8/+17
|
* Replace SlavedKeyStore with a shimRichard van der Hoff2019-04-081-14/+4
| | | | | since we're pulling everything out of KeyStore anyway, we may as well simplify it.
* Remove unused server_tls_certificates functions (#5028)Richard van der Hoff2019-04-081-3/+0
| | | | These have been unused since #4120, and with the demise of perspectives, it is unlikely that they will ever be used again.
* Remove presence lists (#4989)Neil Johnson2019-04-031-10/+0
| | | Remove presence list support as per MSC 1819
* Fix sync bug when accepting invites (#4956)Richard van der Hoff2019-04-021-9/+22
| | | | | | | | | | Hopefully this time we really will fix #4422. We need to make sure that the cache on `get_rooms_for_user_with_stream_ordering` is invalidated *before* the SyncHandler is notified for the new events, and we can now do so reliably via the `events` stream.
* Combine the CurrentStateDeltaStream into the EventStreamRichard van der Hoff2019-03-273-23/+33
|
* Make EventStream rows have a typeRichard van der Hoff2019-03-273-16/+94
| | | | ... as a precursor to combining it with the CurrentStateDelta stream.
* Skip building a ROW_TYPE when building updatesRichard van der Hoff2019-03-271-2/+2
| | | | | We're about to turn it straight into a JSON object anyway so building a ROW_TYPE is a bit pointless, and reduces flexibility in the update_function.
* Add parse_row method to replication stream classRichard van der Hoff2019-03-273-3/+19
| | | | This will allow individual stream classes to override how a row is parsed.
* move FederationStream out to its own fileRichard van der Hoff2019-03-274-23/+43
|
* move EventsStream out to its own fileRichard van der Hoff2019-03-273-23/+42
|
* Move replication.tcp.streams into a packageRichard van der Hoff2019-03-272-33/+51
|
* Fix/improve some docstrings in the replication code. (#4949)Richard van der Hoff2019-03-272-7/+19
|
* Fix ClientReplicationStreamProtocol.__str__ (#4929)Richard van der Hoff2019-03-252-4/+5
| | | | | | | | `__str__` depended on `self.addr`, which was absent from ClientReplicationStreamProtocol, so attempting to call str on such an object would raise an exception. We can calculate the peer addr from the transport, so there is no need for addr anyway.
* Fix bug where read-receipts lost their timestamps (#4927)Richard van der Hoff2019-03-252-11/+27
| | | | | Make sure that they are sent correctly over the replication stream. Fixes: #4898
* Add a config option for torture-testing worker replication. (#4902)Richard van der Hoff2019-03-201-1/+17
| | | Setting this to 50 or so makes a bunch of sytests fail in worker mode.
* Prefill client IPs cache on workersErik Johnston2019-03-061-0/+2
|
* Merge pull request #4792 from matrix-org/anoa/replication_tokensAndrew Morgan2019-03-061-3/+28
|\ | | | | Support batch updates in the worker sender
| * Simplify token replication logicAndrew Morgan2019-03-051-23/+14
| |
| * Clean up logic and add commentsAndrew Morgan2019-03-041-11/+18
| |
| * Clearer branching, fix missing list clearAndrew Morgan2019-03-041-4/+11
| |
| * Prevent replication wedgingAndrew Morgan2019-03-041-4/+24
| |
* | Add rate-limiting on registration (#4735)Brendan Abolivier2019-03-051-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Rate-limiting for registration * Add unit test for registration rate limiting * Add config parameters for rate limiting on auth endpoints * Doc * Fix doc of rate limiting function Co-Authored-By: babolivier <contact@brendanabolivier.com> * Incorporate review * Fix config parsing * Fix linting errors * Set default config for auth rate limiting * Fix tests * Add changelog * Advance reactor instead of mocked clock * Move parameters to registration specific config and give them more sensible default values * Remove unused config options * Don't mock the rate limiter un MAU tests * Rename _register_with_store into register_with_store * Make CI happy * Remove unused import * Update sample config * Fix ratelimiting test for py2 * Add non-guest test
* | Fixup slave storesErik Johnston2019-03-043-36/+26
|/
* When presence is enabled don't send over replicationErik Johnston2019-02-271-2/+5
|
* Merge pull request #4749 from matrix-org/erikj/replication_connection_backoffErik Johnston2019-02-273-5/+39
|\ | | | | Fix tightloop over connecting to replication server
| * Move connecting logic into ClientReplicationStreamProtocolErik Johnston2019-02-272-18/+17
| |
| * Increase the max delay between retry attemptsErik Johnston2019-02-261-1/+1
| | | | | | | | | | Otherwise if you have many workers they can easily take out master with their connection attempts
| * Fix tightloop over connecting to replication serverErik Johnston2019-02-262-4/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the client failed to process incoming commands during the initial set up of the replication connection it would immediately disconnect and reconnect, resulting in a tightloop. This can happen, for example, when subscribing to a stream that has a row that is too long in the backlog. The fix here is to not consider the connection successfully set up until the client has succesfully subscribed and caught up with the streams. This ensures that the retry logic timers aren't reset until then, meaning that if an error does happen during start up the client will continue backing off before retrying again.
* | Limit cache invalidation replication line length (#4748)Erik Johnston2019-02-271-1/+16
|/
* Fix state cache invalidation on workersErik Johnston2019-02-221-6/+1
|
* Fix registration on workers (#4682)Erik Johnston2019-02-203-3/+58
| | | | | | | | | | * Move RegistrationHandler init to HomeServer * Move post registration actions to RegistrationHandler * Add post regisration replication endpoint * Newsfile
* Batch cache invalidation over replicationErik Johnston2019-02-181-7/+12
| | | | | | | | | | Currently whenever the current state changes in a room invalidate a lot of caches, which cause *a lot* of traffic over replication. Instead, lets batch up all those invalidations and send a single poke down the replication streams. Hopefully this will reduce load on the master process by substantially reducing traffic.
* Move register_device into handlerErik Johnston2019-02-181-14/+3
|
* Split out registration to workerErik Johnston2019-02-183-1/+179
| | | | | | | | This allows registration to be handled by a worker, though the actual write to the database still happens on master. Note: due to the in-memory session map all registration requests must be handled by the same worker.
* Fix replication for room v3 (#4523)Erik Johnston2019-01-301-1/+4
| | | | | | | | | * Fix replication for room v3 We were not correctly quoting the path fragments over http replication, which meant that it exploded when the event IDs had a slash in them * Newsfile
* Fix receiving events from federation via a workerErik Johnston2019-01-291-1/+1
| | | | This bug was introduced in PR #4470, commit 678a92cb56d547dcadffa723e29b4855a27d0901
* Replace missed usages of FrozenEventErik Johnston2019-01-252-4/+12
|
* Revert "Require event format version to parse or create events"Erik Johnston2019-01-252-12/+4
|
* Replace missed usages of FrozenEventErik Johnston2019-01-242-4/+12
|
* Don't truncate command name in metricsErik Johnston2018-10-291-2/+2
|
* Make the replication logger quieter (#4108)Amber Brown2018-10-291-1/+1
|
* Make workers work on Py3 (#4027)Amber Brown2018-10-136-30/+30
|
* Fix minor typo in exceptionTravis Ralston2018-09-131-1/+1
|
* merge (#3576)Amber Brown2018-09-141-7/+16
|
* Remove conn_idErik Johnston2018-09-041-2/+2
|
* Remove conn_id from repl prometheus metricsErik Johnston2018-09-031-10/+10
| | | | | `conn_id` gets set to a random string, and so we end up filling up prometheus with tonnes of data series, which is bad.
* Merge pull request #3713 from matrix-org/erikj/fixup_fed_loggingErik Johnston2018-08-201-1/+1
|\ | | | | Fix logging bug in EDU handling over replication
| * Fix logging bug in EDU handling over replicationErik Johnston2018-08-171-1/+1
| |
* | Logcontexts for replication command handlersRichard van der Hoff2018-08-173-15/+43
|/ | | | | | | | | | Run the handlers for replication commands as background processes. This should improve the visibility in our metrics, and reduce the number of "running db transaction from sentinel context" warnings. Ideally it means converting the things that fire off deferreds into the night into things that actually return a Deferred when they are done. I've made a bit of a stab at this, but it will probably be leaky.
* Use federation handler function rather than duplicateErik Johnston2018-08-151-41/+3
| | | | This involves renaming _persist_events to be a public function.
* Rename slave TransactionStore to SlaveTransactionStoreErik Johnston2018-08-151-1/+1
|
* Move clean_room_for_join to masterErik Johnston2018-08-091-0/+35
|
* Fixup doc commentsErik Johnston2018-08-091-0/+17
|
* Merge branch 'develop' of github.com:matrix-org/synapse into ↵Erik Johnston2018-08-094-17/+63
|\ | | | | | | erikj/split_federation
| * Merge pull request #3632 from matrix-org/erikj/refactor_repl_servletErik Johnston2018-08-093-243/+374
| |\ | | | | | | Add helper base class for generating new replication endpoints
| | * Fixup wording and remove dead codeErik Johnston2018-08-091-2/+1
| | |
| | * Rename POST param to METHODErik Johnston2018-08-082-13/+22
| | |
| | * Fixup logging and docstringsErik Johnston2018-08-082-2/+40
| | |
| * | Basic support for room versioningRichard van der Hoff2018-08-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | This is the first tranche of support for room versioning. It includes: * setting the default room version in the config file * new room_version param on the createRoom API * storing the version of newly-created rooms in the m.room.create event * fishing the version of existing rooms out of the m.room.create event
* | | Import all functions from TransactionStoreErik Johnston2018-08-061-11/+2
| | |
* | | Add EDU/query handling over replicationErik Johnston2018-08-061-1/+1
| | |
* | | Add replication APIs for persisting federation eventsErik Johnston2018-08-062-1/+247
| |/ |/|
* | Fix isortErik Johnston2018-08-061-4/+1
| |
* | Merge branch 'develop' of github.com:matrix-org/synapse into ↵Erik Johnston2018-08-031-4/+3
|\| | | | | | | erikj/refactor_repl_servlet
| * Kill off MatrixCodeMessageExceptionRichard van der Hoff2018-08-012-16/+12
| | | | | | | | | | | | | | | | | | | | | | This code brings the SimpleHttpClient into line with the MatrixFederationHttpClient by having it raise HttpResponseExceptions when a request fails (rather than trying to parse for matrix errors and maybe raising MatrixCodeMessageException). Then, whenever we were checking for MatrixCodeMessageException and turning them into SynapseErrors, we now need to check for HttpResponseExceptions and call to_synapse_error.
* | Use new helper base class for membership requestsErik Johnston2018-07-311-171/+91
| |
* | Use new helper base class for ReplicationSendEventRestServletErik Johnston2018-07-311-79/+36
| |
* | Add helper base class for generating new replication endpointsErik Johnston2018-07-311-0/+208
|/ | | | | This will hopefully reduce the boiler plate required to implement new internal HTTP requests.
* Fix unit testsRichard van der Hoff2018-07-251-1/+1
| | | | | | on_notifier_poke no longer runs synchonously, so we have to do a different hack to make sure that the replication data has been sent. Let's actually listen for its arrival.
* Wrap a number of things that run in the backgroundRichard van der Hoff2018-07-251-6/+8
| | | | | This will reduce the number of "Starting db connection from sentinel context" warnings, and will help with our metrics.
* Fix missing attributes on workers.Erik Johnston2018-07-231-2/+5
| | | | | This was missed during the transition from attribute to getter for getting state from context.
* Use stream cache in get_linearized_receipts_for_roomErik Johnston2018-07-101-1/+1
| | | | | This avoids us from uncessarily hitting the database when there has been no change for the room
* run isortAmber Brown2018-07-0924-66/+91
|
* Attempt to be more performant on PyPy (#3462)Amber Brown2018-06-281-6/+10
|
* Merge pull request #3441 from matrix-org/erikj/redo_erasureErik Johnston2018-06-251-0/+2
|\ | | | | Fix user erasure and re-enable
| * Add UserErasureWorkerStore to workersErik Johnston2018-06-251-0/+2
| |
* | Remove all global reactor imports & pass it around explicitly (#3424)Amber Brown2018-06-252-5/+5
|/
* Pass around the reactor explicitly (#3385)Amber Brown2018-06-221-3/+3
|
* Fix tcp protocol metrics naming (#3410)Amber Brown2018-06-211-18/+35
|
* Fix replication metricsRichard van der Hoff2018-06-041-2/+2
| | | | fix bug introduced in #3256
* Merge remote-tracking branch 'origin/develop' into 3218-official-promAmber Brown2018-05-282-8/+9
|\
| * Merge pull request #3244 from NotAFile/py3-six-4Amber Brown2018-05-242-5/+7
| |\ | | | | | | replace some iteritems with six
| | * replace some iteritems with sixAdrian Tschira2018-05-192-5/+7
| | | | | | | | | | | | Signed-off-by: Adrian Tschira <nota@notafile.com>
* | | more cleanupAmber Brown2018-05-222-6/+10
| | |
* | | fix the test failuresAmber Brown2018-05-221-1/+1
| | |
* | | cleanups, self-registrationAmber Brown2018-05-221-4/+5
| | |
* | | Merge remote-tracking branch 'origin/develop' into 3218-official-promAmber Brown2018-05-221-0/+2
|\| |