| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
to catch up. (#15120)Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Improve logging messages for the 'wait for repl stream' read-after-write consistency feature
* Newsfile
Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>
* Update synapse/replication/tcp/client.py
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
---------
Signed-off-by: Olivier Wilkinson (reivilibre) <oliverw@matrix.org>
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(#14870)
* Allow `AbstractSet` in `StrCollection`
Or else frozensets are excluded. This will be useful in an upcoming
commit where I plan to change a function that accepts `List[str]` to
accept `StrCollection` instead.
* `rooms_to_exclude` -> `rooms_to_exclude_globally`
I am about to make use of this exclusion mechanism to exclude rooms for
a specific user and a specific sync. This rename helps to clarify the
distinction between the global config and the rooms to exclude for a
specific sync.
* Better function names for internal sync methods
* Track a list of excluded rooms on SyncResultBuilder
I plan to feed a list of partially stated rooms for this sync to ignore
* Exclude partial state rooms during eager sync
using the mechanism established in the previous commit
* Track un-partial-state stream in sync tokens
So that we can work out which rooms have become fully-stated during a
given sync period.
* Fix mutation of `@cached` return value
This was fouling up a complement test added alongside this PR.
Excluding a room would mean the set of forgotten rooms in the cache
would be extended. This means that room could be erroneously considered
forgotten in the future.
Introduced in #12310, Synapse 1.57.0. I don't think this had any
user-visible side effects (until now).
* SyncResultBuilder: track rooms to force as newly joined
Similar plan as before. We've omitted rooms from certain sync responses;
now we establish the mechanism to reintroduce them into future syncs.
* Read new field, to present rooms as newly joined
* Force un-partial-stated rooms to be newly-joined
for eager incremental syncs only, provided they're still fully stated
* Notify user stream listeners to wake up long polling syncs
* Changelog
* Typo fix
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Unnecessary list cast
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Rephrase comment
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Another comment
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
* Fixup merge(?)
* Poke notifier when receiving un-partial-stated msg over replication
* Fixup merge whoops
Thanks MV :)
Co-authored-by: Mathieu Velen <mathieuv@matrix.org>
Co-authored-by: Mathieu Velten <mathieuv@matrix.org>
Co-authored-by: Sean Quah <8349537+squahtx@users.noreply.github.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
finishing join (#14874)
* Faster joins: Update room stats and user directory on workers when done
When finishing a partial state join to a room, we update the current
state of the room without persisting additional events. Workers receive
notice of the current state update over replication, but neglect to wake
the room stats and user directory updaters, which then get incidentally
triggered the next time an event is persisted or an unrelated event
persister sends out a stream position update.
We wake the room stats and user directory updaters at the appropriate
time in this commit.
Part of #12814 and #12815.
Signed-off-by: Sean Quah <seanq@matrix.org>
* fixup comment
Signed-off-by: Sean Quah <seanq@matrix.org>
|
|
|
|
|
|
| |
Now that we wait for stream positions whenever we do a HTTP replication
hit, we need to be less brutal in the case where we do timeout (as we
have bugs around this).
|
|
|
|
|
|
|
| |
This caused some requests to fail.
This caused some requests to fail.
This really only started causing issues due to #14856
|
|
|
|
| |
This should hopefully mitigate a class of races where data gets out of
sync due a HTTP replication request racing with the replication streams.
|
|
|
|
|
| |
We were incorrectly checking if the *local* token had been advanced, rather than the token for the remote instance.
In practice, I don't think this has caused any bugs due to where we use `wait_for_stream_position`, as critically we don't use it on instances that also write to the given streams (and so the local token will lag behind all remote tokens).
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This creates a new store method, `process_replication_position` that
is called after `process_replication_rows`. By moving stream ID advances
here this guarantees any relevant cache invalidations will have been
applied before the stream is advanced.
This avoids race conditions where Python switches between threads mid
way through processing the `process_replication_rows` method where stream
IDs may be advanced before caches are invalidated due to class resolution
ordering.
See this comment/issue for further discussion:
https://github.com/matrix-org/synapse/issues/14158#issuecomment-1344048703
|
|
|
|
| |
receiving un-partial-stated event notifications over replication. [rei:frrj/streams/unpsr] (#14546)
|
|
|
|
| |
un-partial-stating of that room is received over the replication stream. [rei:frrj/streams/unpsr] (#14474)
|
| |
|
|
|
|
| |
Updates the `/receipts` endpoint and receipt EDU handler to parse a
`thread_id` from the body and insert it in the database.
|
|
|
| |
Partial implementation of MSC3881
|
|
|
| |
Signed-off-by: Šimon Brandner <simon.bra.ag@gmail.com>
|
|
|
|
|
| |
This adds support for the stable identifiers of MSC2285 while
continuing to support the unstable identifiers behind the configuration
flag. These will be removed in a future version.
|
| |
|
|
|
| |
This reverts commit 5d4028f217f178fcd384d5bfddd92225b4e78c51.
|
|
|
|
|
| |
More prep work for asyncronous caching, also makes all process_replication_rows methods consistent (presence handler already is so).
Signed off by Nick @ Beeper (@Fizzadar)
|
|
|
|
| |
The replication logic for groups is no longer used, so the message
passing infrastructure can be removed.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
Refactor and convert `Linearizer` to async. This makes a `Linearizer`
cancellation bug easier to fix.
Also refactor to use an async context manager, which eliminates an
unlikely footgun where code that doesn't immediately use the context
manager could forget to release the lock.
Signed-off-by: Sean Quah <seanq@element.io>
|
|
|
|
|
|
| |
Since the object it returns is a ReplicationCommandHandler.
This is clean-up from adding support to Redis where the command handler
was added as an additional layer of abstraction from the TCP protocol.
|
| |
|
|
|
|
|
|
|
| |
The presence of this method was confusing, and mostly present for backwards
compatibility. Let's get rid of it.
Part of #11733
|
| |
|
|
|
| |
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
|
|
|
|
|
|
| |
This removes the magic allowing accessing configurable
variables directly from the config object. It is now required
that a specific configuration class is used (e.g. `config.foo`
must be replaced with `config.server.foo`).
|
|
|
|
| |
Instead of proxying through the magic getter of the RootConfig
object. This should be more performant (and is more explicit).
|
|
|
| |
Mostly this involves decorating a few Deferred declarations with extra type hints. We wrap the types in quotes to avoid runtime errors when running against older versions of Twisted that don't have generics on Deferred.
|
|
|
| |
Implementation of matrix-org/matrix-doc#2285
|
| |
|
|
|
|
|
| |
Hopefully this will help us track down where to-device messages are getting
lost/delayed.
|
| |
|
|
|
| |
Co-authored-by: Richard van der Hoff <1389908+richvdh@users.noreply.github.com>
|
|
|
|
|
|
|
| |
Part of #9744
Removes all redundant `# -*- coding: utf-8 -*-` lines from files, as python 3 automatically reads source code as utf-8 now.
`Signed-off-by: Jonathan de Jong <jonathan@automatia.nl>`
|
| |
|
|
|
|
|
| |
I was trying to make it so that we didn't have to start a background task when handling RDATA, but that is a bigger job (due to all the code in `generic_worker`). However I still think not pulling the event from the DB may help reduce some DB usage due to replication, even if most workers will simply go and pull that event from the DB later anyway.
Co-authored-by: Patrick Cloke <clokep@users.noreply.github.com>
|
|
|
|
|
| |
Currently background proccesses stream the events stream use the "minimum persisted position" (i.e. `get_current_token()`) rather than the vector clock style tokens. This is broadly fine as it doesn't matter if the background processes lag a small amount. However, in extreme cases (i.e. SyTests) where we only write to one event persister the background processes will never make progress.
This PR changes it so that the `MultiWriterIDGenerator` keeps the current position of a given instance as up to date as possible (i.e using the latest token it sees if its not in the process of persisting anything), and then periodically announces that over replication. This then allows the "minimum persisted position" to advance, albeit with a small lag.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The idea is to remove some of the places we pass around `int`, where it can represent one of two things:
1. the position of an event in the stream; or
2. a token that partitions the stream, used as part of the stream tokens.
The valid operations are then:
1. did a position happen before or after a token;
2. get all events that happened before or after a token; and
3. get all events between two tokens.
(Note that we don't want to allow other operations as we want to change the tokens to be vector clocks rather than simple ints)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The idea here is that we pass the `max_stream_id` to everything, and only use the stream ID of the particular event to figure out *when* the max stream position has caught up to the event and we can notify people about it.
This is to maintain the distinction between the position of an item in the stream (i.e. event A has stream ID 513) and a token that can be used to partition the stream (i.e. give me all events after stream ID 352). This distinction becomes important when the tokens are more complicated than a single number, which they will be once we start tracking the position of multiple writers in the tokens.
The valid operations here are:
1. Is a position before or after a token
2. Fetching all events between two tokens
3. Merging multiple tokens to get the "max", i.e. `C = max(A, B)` means that for all positions P where P is before A *or* before B, then P is before C.
Future PR will change the token type to a dedicated type.
|
|
|
|
|
| |
`pusher_pool.on_new_notifications` expected a min and max stream ID, however that was not what we were passing in. Instead, let's just pass it the current max stream ID and have it track the last stream ID it got passed.
I believe that it mostly worked as we called the function for every event. However, it would break for events that got persisted out of order, i.e, that were persisted but the max stream ID wasn't incremented as not all preceding events had finished persisting, and push for that event would be delayed until another event got pushed to the effected users.
|
|
|
|
| |
This reverts commit e7fd336a53a4ca489cdafc389b494d5477019dc0.
|
| |
|
|
|
|
|
|
| |
This fixes a bug where having multiple callers waiting on the same
stream and position will cause it to try and compare two deferreds,
which fails (due to the sorted list having an entry of `Tuple[int,
Deferred]`).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Handling of incoming typing stream updates from replication was not
hooked up on master, effecting set ups where typing was handled on a
different worker.
This is really only a problem if the master process is also handling
sync requests, which is unlikely for those that are at the stage of
moving typing off.
The other observable effect is that if a worker restarts or a
replication connect drops then the typing worker will issue a
`POSITION typing`, triggering master process to try and stream *all*
typing updates from position 0.
Fixes #7907
|
|
|
| |
The CI appears to use the latest version of isort, which is a problem when isort gets a major version bump. Rather than try to pin the version, I've done the necessary to make isort5 happy with synapse.
|
| |
|
|
|
|
|
|
|
| |
The idea here is that if an instance persists an event via the replication HTTP API it can return before we receive that event over replication, which can lead to races where code assumes that persisting an event immediately updates various caches (e.g. current state of the room).
Most of Synapse doesn't hit such races, so we don't do the waiting automagically, instead we do so where necessary to avoid unnecessary delays. We may decide to change our minds here if it turns out there are a lot of subtle races going on.
People probably want to look at this commit by commit.
|
|
|
| |
This is so that the logic can happen on both master and workers when we move event persistence out.
|
| |
|
|
|
| |
For in memory streams when fetching updates on workers we need to query the source of the stream, which currently is hard coded to be master. This PR threads through the source instance we received via `POSITION` through to the update function in each stream, which can then be passed to the replication client for in memory streams.
|
|
|
|
| |
We move the processing of typing and federation replication traffic into their handlers so that `Stream.current_token()` points to a valid token. This allows us to remove `get_streams_to_replicate()` and `stream_positions()`.
|
|
|
| |
This is configured via the `redis` config options.
|
|
|
| |
The aim here is to move the command handling out of the TCP protocol classes and to also merge the client and server command handling (so that we can reuse them for redis protocol). This PR simply moves the client paths to the new `ReplicationCommandHandler`, a future PR will move the server paths too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Remove `conn_id` usage for UserSyncCommand.
Each tcp replication connection is assigned a "conn_id", which is used
to give an ID to a remotely connected worker. In a redis world, there
will no longer be a one to one mapping between connection and instance,
so instead we need to replace such usages with an ID generated by the
remote instances and included in the replicaiton commands.
This really only effects UserSyncCommand.
* Add CLEAR_USER_SYNCS command that is sent on shutdown.
This should help with the case where a synchrotron gets restarted
gracefully, rather than rely on 5 minute timeout.
|
|
|
| |
This changes the replication protocol so that the server does not send down `RDATA` for rows that happened before the client connected. Instead, the server will send a `POSITION` and clients then query the database (or master out of band) to get up to date.
|
|
|
|
| |
Co-authored-by: Andrew Morgan <1342360+anoadragon453@users.noreply.github.com>
|
|
|
|
|
| |
This will be used to retry outbound transactions to a remote server if
we think it might have come back up.
|
|
|
|
|
|
|
|
|
|
| |
* Port synapse.replication.tcp to async/await
* Newsfile
* Correctly document type of on_<FOO> functions as async
* Don't be overenthusiastic with the asyncing....
|
| |
|
| |
|
|
|
|
| |
since I found myself wonder how it works
|
|
|
| |
Replace every instance of `logger.warn` with `logger.warning` as the former is deprecated.
|
| |
|
|
|
|
| |
This will allow individual stream classes to override how a row is parsed.
|
| |
|
| |
|
|
|
|
|
| |
Otherwise if you have many workers they can easily take out master with
their connection attempts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the client failed to process incoming commands during the initial set
up of the replication connection it would immediately disconnect and
reconnect, resulting in a tightloop.
This can happen, for example, when subscribing to a stream that has a
row that is too long in the backlog.
The fix here is to not consider the connection successfully set up until
the client has succesfully subscribed and caught up with the streams.
This ensures that the retry logic timers aren't reset until then,
meaning that if an error does happen during start up the client will
continue backing off before retrying again.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Run the handlers for replication commands as background processes. This should
improve the visibility in our metrics, and reduce the number of "running db
transaction from sentinel context" warnings.
Ideally it means converting the things that fire off deferreds into the night
into things that actually return a Deferred when they are done. I've made a bit
of a stab at this, but it will probably be leaky.
|
|
|
|
|
|
| |
on_notifier_poke no longer runs synchonously, so we have to do a different hack
to make sure that the replication data has been sent. Let's actually listen for
its arrival.
|
| |
|
| |
|
| |
|
| |
|
|
|