| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
| |
Most of the stuff we do for replication commands can be done synchronously. There's no point spinning up background processes if we're not going to need them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Handling of incoming typing stream updates from replication was not
hooked up on master, effecting set ups where typing was handled on a
different worker.
This is really only a problem if the master process is also handling
sync requests, which is unlikely for those that are at the stage of
moving typing off.
The other observable effect is that if a worker restarts or a
replication connect drops then the typing worker will issue a
`POSITION typing`, triggering master process to try and stream *all*
typing updates from position 0.
Fixes #7907
|
| |
|
|
|
|
| |
I'm going to be doing more stuff synchronously, and I don't want to lose the
CPU metrics down the sofa.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
When we get behind on replication, we tend to stack up background processes
behind a linearizer. Bg processes are heavy (particularly with respect to
prometheus metrics) and linearizers aren't terribly efficient once the queue
gets long either.
A better approach is to maintain a queue of requests to be processed, and
nominate a single process to work its way through the queue.
Fixes: #7444
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
| |
The CI appears to use the latest version of isort, which is a problem when isort gets a major version bump. Rather than try to pin the version, I've done the necessary to make isort5 happy with synapse.
|
|
|
| |
The aim here is to make it easier to reason about when streams are limited and when they're not, by moving the logic into the database functions themselves. This should mean we can kill of `db_query_to_update_function` function.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Ensure account data stream IDs are unique.
The account data stream is shared between three tables, and the maximum
allocated ID was tracked in a dedicated table. Updating the max ID
happened outside the transaction that allocated the ID, leading to a
race where if the server was restarted then the same ID could be
allocated but the max ID failed to be updated, leading it to be reused.
The ID generators have support for tracking across multiple tables, so
we may as well use that instead of a dedicated table.
* Fix bug in account data replication stream.
If the same stream ID was used in both global and room account data then
the getting updates for the replication stream would fail due to
`heapq.merge(..)` trying to compare a `str` with a `None`. (This is
because you'd have two rows like `(534, '!room')` and `(534, None)` from
the room and global account data tables).
Fix is just to order by stream ID, since we don't rely on the ordering
beyond that. The bug where stream IDs can be reused should be fixed now,
so this case shouldn't happen going forward.
Fixes #7617
|
| |
|
|
|
| |
Fixes #7566.
|
| |
|
|
|
|
|
|
|
| |
The idea here is that if an instance persists an event via the replication HTTP API it can return before we receive that event over replication, which can lead to races where code assumes that persisting an event immediately updates various caches (e.g. current state of the room).
Most of Synapse doesn't hit such races, so we don't do the waiting automagically, instead we do so where necessary to avoid unnecessary delays. We may decide to change our minds here if it turns out there are a lot of subtle races going on.
People probably want to look at this commit by commit.
|
|
|
|
|
|
| |
Make sure that the AccountDataStream presents complete updates, in the right
order.
This is much the same fix as #7337 and #7358, but applied to a different stream.
|
|
|
| |
This is so that the logic can happen on both master and workers when we move event persistence out.
|
|
|
|
|
| |
Before all streams were only written to from master, so only master needed to respond to `REPLICATE` commands.
Before all instances wrote to the cache invalidation stream, but didn't respond to `REPLICATE`. This was a bug, which could lead to missed rows from cache invalidation stream if an instance is restarted, however all the caches would be empty in that case so it wasn't a problem.
|
|
|
| |
Proactively send out `POSITION` commands (as if we had just received a `REPLICATE`) when we connect to Redis. This is important as other instances won't notice we've connected to issue a `REPLICATE` command (unlike for direct TCP connections). This is only currently an issue if master process reconnects without restarting (if it restarts then it won't have written anything and so other instances probably won't have missed anything).
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
* release-v1.13.0:
Don't UPGRADE database rows
RST indenting
Put rollback instructions in upgrade notes
Fix changelog typo
Oh yeah, RST
Absolute URL it is then
Fix upgrade notes link
Provide summary of upgrade issues in changelog. Fix )
Move next version notes from changelog to upgrade notes
Changelog fixes
1.13.0rc1
Documentation on setting up redis (#7446)
Rework UI Auth session validation for registration (#7455)
Fix errors from malformed log line (#7454)
Drop support for redis.dbid (#7450)
|
| | |
|
| |
| |
| | |
Since we only use pubsub, the dbid is irrelevant.
|
| | |
|
|\| |
|
| |\ |
|
| |\ \ |
|
| | | |
| | | |
| | | |
| | | | |
... otherwise we can believe we're up to date when we're not.
|
| | | | |
|
|\ \ \ \
| | |_|/
| |/| | |
|
| | |/
| |/| |
|
|/ /
| |
| |
| | |
looks like we managed to break this during the refactorathon.
|
| |
| |
| |
| |
| | |
We forgot to set the password on the subscriber connection, as well as
not calling super methods for overridden connectionMade/connectionLost
functions.
|
| |
| |
| | |
For in memory streams when fetching updates on workers we need to query the source of the stream, which currently is hard coded to be master. This PR threads through the source instance we received via `POSITION` through to the update function in each stream, which can then be passed to the replication client for in memory streams.
|
|/
|
|
| |
We move the processing of typing and federation replication traffic into their handlers so that `Stream.current_token()` points to a valid token. This allows us to remove `get_streams_to_replicate()` and `stream_positions()`.
|
|
|
| |
Hopefully this is no worse than what we have on master...
|
|
|
|
|
| |
This is primarily for allowing us to send those commands from workers, but for now simply allows us to ignore echoed RDATA/POSITION commands that we sent (we get echoes of sent commands when using redis). Currently we log a WARNING on the master process every time we receive an echoed RDATA.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For direct TCP connections we need the master to relay REMOTE_SERVER_UP
commands to the other connections so that all instances get notified
about it. The old implementation just relayed to all connections,
assuming that sending back to the original sender of the command was
safe. This is not true for redis, where commands sent get echoed back to
the sender, which was causing master to effectively infinite loop
sending and then re-receiving REMOTE_SERVER_UP commands that it sent.
The fix is to ensure that we only relay to *other* connections and not
to the connection we received the notification from.
Fixes #7334.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Factor out functions for injecting events into database
I want to add some more flexibility to the tools for injecting events into the
database, and I don't want to clutter up HomeserverTestCase with them, so let's
factor them out to a new file.
* Rework TestReplicationDataHandler
This wasn't very easy to work with: the mock wrapping was largely superfluous,
and it's useful to be able to inspect the received rows, and clear out the
received list.
* Fix AssertionErrors being thrown by EventsStream
Part of the problem was that there was an off-by-one error in the assertion,
but also the limit logic was too simple. Fix it all up and add some tests.
|
|
|
| |
Currently we never write to streams from workers, but that will change soon
|
|
|
|
|
|
|
|
|
|
| |
Figuring out how to correctly limit updates from this stream without dropping
entries is far more complicated than just counting the number of rows being
returned. We need to consider each query separately and, if any one query hits
the limit, truncate the results from the others.
I think this also fixes some potentially long-standing bugs where events or
state changes could get missed if we hit the limit on either query.
|
| |
|
|
|
|
|
| |
there doesn't seem to be much point in passing this limit all around, since
both sides agree it's meant to be 100.
|
|
|
|
|
|
|
| |
Long story short: if we're handling presence on the current worker, we shouldn't be sending USER_SYNC commands over replication.
In an attempt to figure out what is going on here, I ended up refactoring some bits of the presencehandler code, so the first 4 commits here are non-functional refactors to move this code slightly closer to sanity. (There's still plenty to do here :/). Suggest reviewing individual commits.
Fixes (I hope) #7257.
|
| |
|
|
|
| |
I messed this up last time I tried (#7239 / e13c6c7).
|
|
|
| |
This is configured via the `redis` config options.
|
|
|
|
|
|
| |
Other parts of the code (such as the StreamChangeCache) assume that there will
not be multiple changes with the same stream id.
This code was introduced in #7024, and I hope this fixes #7206.
|
|
|
|
|
|
|
| |
The general idea here is to get rid of the type: ignore annotations on all of the current_token and update_function assignments, which would have caught #7290.
After a bit of experimentation, it seems like the least-awful way to do this is to pass the offending functions in as parameters to the Stream constructor. Unfortunately that means that the concrete implementations no longer have the same constructor signature as Stream itself, which means that it gets hard to correctly annotate STREAMS_MAP.
I've also introduced a couple of new types, to take out some duplication.
|
|
|
|
|
|
| |
Some of the query functions return generators rather than lists, so we can't
index into the result. Happily we already have a copy of the results.
(think this was introduced in #7024)
|
|
|
|
|
| |
`REPLICATE` is now a valid command, and it's nice if you can issue it from the
console without remembering to call it `REPLICATE ` with a trailing space.
|
|
|
|
|
|
| |
Separate `SimpleCommand` from `Command`, so that things which don't want to use
the `data` property don't have to, and thus fix the warnings PyCharm was giving
me about not calling `__init__` in the base class.
|
|
|
|
| |
We've ripped pretty much all of this out: let's remove the remains.
|
|
|
|
| |
Fixes a race between handling `POSITION` and `RDATA` commands. We do this by simply linearizing handling of them.
|
|
|
| |
This completes the merging of server and client command processing.
|
|
|
| |
The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new `ReplicationCommandHandler`, a future PR will move the server paths too.
|
|
|
|
|
| |
This broke in a recent PR (#7024) and is no longer useful due to all
replication clients implicitly subscribing to all streams, so let's
just remove it.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Remove `conn_id` usage for UserSyncCommand.
Each tcp replication connection is assigned a "conn_id", which is used
to give an ID to a remotely connected worker. In a redis world, there
will no longer be a one to one mapping between connection and instance,
so instead we need to replace such usages with an ID generated by the
remote instances and included in the replicaiton commands.
This really only effects UserSyncCommand.
* Add CLEAR_USER_SYNCS command that is sent on shutdown.
This should help with the case where a synchrotron gets restarted
gracefully, rather than rely on 5 minute timeout.
|
|
|
| |
This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
|
|
|
|
|
| |
This just helps keep the rows closer to their streams, so that it's easier to
see what the format of each stream is.
|
|
|
|
|
|
| |
`groups` != `receipts`
Introduced in #6964
|
| |
|
|
|
|
|
| |
Instead of sending down batches of user ID/host tuples, send down a row
per entity (user ID or host).
|
| |
|
| |
|
|
|
|
| |
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
|
|
|
| |
Currently if a worker invalidates a cache it will be streamed to master, which then didn't forward those to other workers.
|
| |
|
|
|
|
|
| |
This will be used to retry outbound transactions to a remote server if
we think it might have come back up.
|
|
|
|
|
|
|
|
|
|
| |
* Port synapse.replication.tcp to async/await
* Newsfile
* Correctly document type of on_<FOO> functions as async
* Don't be overenthusiastic with the asyncing....
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
since I found myself wonder how it works
|
|\ |
|
| |
| |
| | |
Replace every instance of `logger.warn` with `logger.warning` as the former is deprecated.
|
|/ |
|
|
|
|
|
| |
Python will return a tuple whether there are parentheses around the returned values or not.
I'm just sick of my editor complaining about this all over the place :)
|
| |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
... as a precursor to combining it with the CurrentStateDelta stream.
|
|
|
|
|
| |
We're about to turn it straight into a JSON object anyway so building a
ROW_TYPE is a bit pointless, and reduces flexibility in the update_function.
|
|
|
|
| |
This will allow individual stream classes to override how a row is parsed.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
`__str__` depended on `self.addr`, which was absent from
ClientReplicationStreamProtocol, so attempting to call str on such an object
would raise an exception.
We can calculate the peer addr from the transport, so there is no need for addr
anyway.
|
|
|
|
|
| |
Make sure that they are sent correctly over the replication stream.
Fixes: #4898
|
|
|
| |
Setting this to 50 or so makes a bunch of sytests fail in worker mode.
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| | |
Fix tightloop over connecting to replication server
|
| | |
|
| |
| |
| |
| |
| | |
Otherwise if you have many workers they can easily take out master with
their connection attempts
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If the client failed to process incoming commands during the initial set
up of the replication connection it would immediately disconnect and
reconnect, resulting in a tightloop.
This can happen, for example, when subscribing to a stream that has a
row that is too long in the backlog.
The fix here is to not consider the connection successfully set up until
the client has succesfully subscribed and caught up with the streams.
This ensures that the retry logic timers aren't reset until then,
meaning that if an error does happen during start up the client will
continue backing off before retrying again.
|
|/ |
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
`conn_id` gets set to a random string, and so we end up filling up
prometheus with tonnes of data series, which is bad.
|
|
|
|
|
|
|
|
|
|
| |
Run the handlers for replication commands as background processes. This should
improve the visibility in our metrics, and reduce the number of "running db
transaction from sentinel context" warnings.
Ideally it means converting the things that fire off deferreds into the night
into things that actually return a Deferred when they are done. I've made a bit
of a stab at this, but it will probably be leaky.
|
|
|
|
|
|
| |
on_notifier_poke no longer runs synchonously, so we have to do a different hack
to make sure that the replication data has been sent. Let's actually listen for
its arrival.
|
|
|
|
|
| |
This will reduce the number of "Starting db connection from sentinel context"
warnings, and will help with our metrics.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
| |
fix bug introduced in #3256
|
|\ |
|
| |\
| | |
| | | |
replace some iteritems with six
|
| | |
| | |
| | |
| | | |
Signed-off-by: Adrian Tschira <nota@notafile.com>
|
| | | |
|
| | | |
|
| | | |
|
|\| | |
|
| |/
| |
| |
| |
| | |
When a user first syncs, we will send them a server notice asking them to
consent to the privacy policy if they have not already done so.
|
| | |
|
|/ |
|
|
|
|
| |
Signed-off-by: Adrian Tschira <nota@notafile.com>
|
|
|
|
| |
json encoders have an encode method, not a dumps method.
|
|
|
|
|
| |
using json.dumps with custom options requires us to create a new JSONEncoder on
each call. It's more efficient to create one upfront and reuse it.
|
| |
|
|
|
|
|
| |
Turns out that simplejson serialises namedtuple's as dictionaries rather
than tuples by default.
|
|\ |
|
| | |
|
|/
|
|
| |
I found myself wishing we had this.
|
|
|
|
|
| |
The @measure_func annotations rely on the wrapped function respecting the
logcontext rules. Add the necessary yields to make this work.
|
|
|
|
| |
what could possibly go wrong
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
| |
|
|\
| |
| | |
Fix up federation SendQueue and document types
|
| | |
|
|/ |
|
|\
| |
| | |
Don't double encode replication data
|
| | |
|
| | |
|
| | |
|
| | |
|
| | |
|
|/ |
|
|\
| |
| | |
Move to using TCP replication
|
| | |
|
|\ \
| | |
| | | |
Advance replication streams even if nothing is listening
|
| |/
| |
| |
| |
| |
| | |
Otherwise the streams don't advance and steadily fall behind, so when a
worker does connect either a) they'll be streamed lots of old updates or
b) the connection will fail as the streams are too far behind.
|
|/ |
|
| |
|
| |
|
|
|
|
| |
This timestamp is used to indicate when the user last sync'd
|
| |
|
| |
|
|
|
|
| |
This defines the low level TCP replication protocol
|
|
|