Tidy up documentation a bit

author: Azrenbeth <7782548+Azrenbeth@users.noreply.github.com> 2021-09-28 13:50:57 +0100
committer: Azrenbeth <7782548+Azrenbeth@users.noreply.github.com> 2021-09-28 13:50:57 +0100
commit: d6b511e669eeefeb1dfa634c20bb6f1e552a9931 (patch)
tree: 27aa217bb202066c75c08e03c1cd07e085796a82
parent: Better search for state database (diff)
download: synapse-d6b511e669eeefeb1dfa634c20bb6f1e552a9931.tar.xz
2 files changed, 12 insertions, 120 deletions
diff --git a/docs/state_compressor.md b/docs/state_compressor.md
index 97265bfa93..56f21a03cd 100644
--- a/docs/state_compressor.md
+++ b/docs/state_compressor.md
@@ -1,135 +1,27 @@
-TODO: Update with final contents of README after PR #70 merged in rust-synapse-compress-state repo
-
 # State compressor
 
 The state compressor is an **experimental** tool that attempts to reduce the number of rows 
-in the `state_groups_state` table inside of a postgres database.
-
-## Introduction to the state tables and compression
-### What is state?
-State is things like who is in a room, what the room topic/name is, who has
-what privilege levels etc. Synapse keeps track of it so that it can spot invalid
-events (e.g. ones sent by banned users, or by people with insufficient privilege).
-
-### What is a state group?
-
-Synapse needs to keep track of the state at the moment of each event. A state group
-corresponds to a unique state. The database table `event_to_state_groups` keeps track
-of the mapping from event ids to state group ids.
-
-Consider the following simplified example:
-```
-State group id   |          State
-_____________________________________________
-       1         |      Alice in room
-       2         | Alice in room, Bob in room
-       3         |        Bob in room
-
-
-Event id |     What the event was
-______________________________________
-    1    |    Alice sends a message
-    3    |     Bob joins the room
-    4    |     Bob sends a message
-    5    |    Alice leaves the room
-    6    |     Bob sends a message
-
-
-Event id | State group id
-_________________________
-    1    |       1
-    2    |       1
-    3    |       2
-    4    |       2
-    5    |       3
-    6    |       3
-```
-### What are deltas and predecessors?
-When a new state event happens (e.g. Bob joins the room) a new state group is created.
-BUT instead of copying all of the state from the previous state group, we just store
-the change from the previous group (saving on lots of storage space!). The difference
-from the previous state group is called the "delta"
-
-So for the previous example we would have the following (Note only rows 1 and 2 will
-make sense at this point):
-
-```
-State group id | Previous state group id |      Delta
-____________________________________________________________
-       1       |          NONE           |   Alice in room
-       2       |           1             |    Bob in room
-       3       |          NONE           |    Bob in room
-```
-So why is state group 3's previous state group NONE and not 2? Well the way that deltas 
-work in synapse is that they can only add in new state or overwrite old state, but they
-cannot remove it. (So if the room topic is changed then that is just overwriting state,
-but removing alice from the room is neither an addition or an overwriting). If it is 
-impossible to find a delta, then you just start from scratch again with a "snapshot" of
-the entire state. 
-
-(NOTE this is not documentation on how synapse handles leaving rooms but is purely for illustrative
-purposes)
-
-The state of a state group is worked out by following the previous state group's and adding
-together all of the deltas (with the most recent taking precedence).
-
-The mapping from state group to previous state group takes place in `state_group_edges` 
-and the deltas are stored in `state_groups_state`
-
-### What are we compressing then?
-In order to speed up the converstion from state group id to state, there is a limit of 100 
-hops set by synapse (that is: we will only ever have to lookup the deltas for a maximum of 
-100 state groups). It does this by taking another "snapshot" every 100 state groups.
-
-However, it is these snapshots that take up the bulk of the storage in a synapse database,
-so we want to find a way to reduce the number of them without dramatically increasing the 
-maximum number of hops needed to do lookups.
-
-
-## Compression Algorithm
-
-The algorithm works by attempting to create a *tree* of deltas, produced by
-appending state groups to different "levels". Each level has a maximum size, where
-each state group is appended to the lowest level that is not full. This tool calls a 
-state group "compressed" once it has been added to
-one of these levels.
-
-This produces a graph that looks approximately like the following, in the case
-of having two levels with the bottom level (L1) having a maximum size of 3:
-
-```
-L2 <-------------------- L2 <---------- ...
-^--- L1 <--- L1 <--- L1  ^--- L1 <--- L1 <--- L1
-
-NOTE: A <--- B means that state group B's predecessor is A
-```
-The structure that synapse creates by default would be equivalent to having one level with
-a maximum length of 100. 
-
-**Note**: Increasing the sum of the sizes of levels will increase the time it
-takes to query the full state of a given state group.
+in the `state_groups_state` table inside of a postgres database. Documentation on how it works
+can be found on [its github repository](https://github.com/matrix-org/rust-synapse-compress-state).
 
 ## Enabling the state compressor
 
 The state compressor requires the python library for the `auto_compressor` tool to be 
-installed. Instructions for this can be found in the `README.md` file
-in the <a href=https://github.com/matrix-org/rust-synapse-compress-state>source repo</a> . 
+installed. Instructions for this can be found in [the `python.md` file in the source
+repo](https://github.com/matrix-org/rust-synapse-compress-state/blob/main/docs/python.md).
 
 The following configuration options are provided:
 
 - `chunk_size`  
-The rough number of state groups to work on at once. All of the entries from 
+The number of state groups to work on at once. All of the entries from 
 `state_groups_state` are requested from the database for state groups that are 
 worked on. Therefore small chunk sizes may be needed on machines with low memory. 
 Note: if the compressor fails to find space savings on the chunk as a whole 
 (which may well happen in rooms with lots of backfill in) then the entire chunk 
-is skipped. This defaults to 500  
+is skipped. This defaults to 500 
   
-
-- `number_of_rooms`  
-The compressor will identify the rooms with the most uncompressed state and run on
-this many of them. This defaults to 5
-
+- `number_of_chunks`  
+The compressor will stop once it has finished compressing this many chunks. Defaults to 100
 
 - `default_levels`  
 Sizes of each new level in the compression algorithm, as a comma separated list.
@@ -140,7 +32,6 @@ the levels effect the performance of fetching the state from the database, as th
 sum of the sizes is the upper bound on number of iterations needed to fetch a
 given set of state. This defaults to "100,50,25"
 
-
 - `time_between_runs`
 This controls how often the state compressor is run. This defaults to once every
 day.
@@ -150,7 +41,7 @@ An example configuration:
 state_compressor:
     enabled: true
     chunk_size: 500
-    number_of_rooms: 5
+    number_of_chunks: 5
     default_levels: 100,50,25
     time_between_runs: 1d
 ```
\ No newline at end of file
diff --git a/synapse/config/state_compressor.py b/synapse/config/state_compressor.py
index 40390fbf52..92a0b7e533 100644
--- a/synapse/config/state_compressor.py
+++ b/synapse/config/state_compressor.py
@@ -36,7 +36,7 @@ class StateCompressorConfig(Config):
             raise ConfigError from e
 
         self.compressor_chunk_size = compressor_config.get("chunk_size") or 500
-        self.compressor_number_of_chunks = compressor_config.get("number_of_chunks") or 50
+        self.compressor_number_of_chunks = compressor_config.get("number_of_chunks") or 100
         self.compressor_default_levels = (
             compressor_config.get("default_levels") or "100,50,25"
         )
@@ -67,7 +67,7 @@ class StateCompressorConfig(Config):
           #
           #chunk_size: 1000
 
-          # The number of chunks to compress on each run. Defaults to 50.
+          # The number of chunks to compress on each run. Defaults to 100.
           #
           #number_of_chunks: 1
 
@@ -87,6 +87,7 @@ _STATE_COMPRESSOR_SCHEMA = {
     "properties": {
         "enabled": {"type": "boolean"},
         "chunk_size": {"type": "number"},
+        "number_of_chunks": {"type": "number"},
         "default_levels": {"type": "string"},
         "time_between_runs": {"type": "string"},
     },
author	Azrenbeth <7782548+Azrenbeth@users.noreply.github.com>	2021-09-28 13:50:57 +0100
committer	Azrenbeth <7782548+Azrenbeth@users.noreply.github.com>	2021-09-28 13:50:57 +0100
commit	d6b511e669eeefeb1dfa634c20bb6f1e552a9931 (patch)
tree	27aa217bb202066c75c08e03c1cd07e085796a82
parent	Better search for state database (diff)
download	synapse-d6b511e669eeefeb1dfa634c20bb6f1e552a9931.tar.xz