summary refs log tree commit diff
diff options
context:
space:
mode:
authorOlivier Wilkinson (reivilibre) <olivier@librepush.net>2019-08-20 15:35:07 +0100
committerOlivier Wilkinson (reivilibre) <olivier@librepush.net>2019-08-20 15:35:20 +0100
commit6a19f7e1010e409752507fd29b23af5b795a0440 (patch)
tree2db23133206375c9d5e202fd78c1111c2c178e52
parentUpdate synapse/storage/stats.py (diff)
downloadsynapse-6a19f7e1010e409752507fd29b23af5b795a0440.tar.xz
Add room and user statistics documentation.
Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>
-rw-r--r--docs/room_and_user_statistics.md146
1 files changed, 146 insertions, 0 deletions
diff --git a/docs/room_and_user_statistics.md b/docs/room_and_user_statistics.md
new file mode 100644
index 0000000000..eb003f771a
--- /dev/null
+++ b/docs/room_and_user_statistics.md
@@ -0,0 +1,146 @@
+Room and User Statistics
+========================
+
+Synapse maintains room and user statistics (as well as a cache of room state),
+in various tables.
+
+These can be used for administrative purposes but are also used when generating
+the public room directory. If these tables get stale or out of sync (possibly
+after database corruption), you may wish to regenerate them.
+
+
+# Synapse Administrator Documentation
+
+## Various SQL scripts that you may find useful
+
+### Delete stats, including historical stats
+
+```sql
+DELETE FROM room_stats_current;
+DELETE FROM room_stats_historical;
+DELETE FROM user_stats_current;
+DELETE FROM user_stats_historical;
+```
+
+### Regenerate stats (all subjects)
+
+```sql
+BEGIN;
+    DELETE FROM stats_incremental_position;
+    INSERT INTO stats_incremental_position (
+        state_delta_stream_id,
+        total_events_min_stream_ordering,
+        total_events_max_stream_ordering,
+        is_background_contract
+    ) VALUES (NULL, NULL, NULL, FALSE), (NULL, NULL, NULL, TRUE);
+COMMIT;
+
+DELETE FROM room_stats_current;
+DELETE FROM user_stats_current;
+```
+
+then follow the steps below for **'Regenerate stats (missing subjects only)'**
+
+### Regenerate stats (missing subjects only)
+
+```sql
+-- Set up staging tables
+-- we depend on current_state_events_membership because this is used
+-- in our counting.
+INSERT INTO background_updates (update_name, progress_json) VALUES
+    ('populate_stats_prepare', '{}', 'current_state_events_membership');
+
+-- Run through each room and update stats
+INSERT INTO background_updates (update_name, progress_json, depends_on) VALUES
+    ('populate_stats_process_rooms', '{}', 'populate_stats_prepare');
+
+-- Run through each user and update stats.
+INSERT INTO background_updates (update_name, progress_json, depends_on) VALUES
+    ('populate_stats_process_users', '{}', 'populate_stats_process_rooms');
+
+-- Clean up staging tables
+INSERT INTO background_updates (update_name, progress_json, depends_on) VALUES
+    ('populate_stats_cleanup', '{}', 'populate_stats_process_users');
+```
+
+then **restart Synapse**.
+
+
+# Synapse Developer Documentation
+
+## High-Level Concepts
+
+### Definitions
+
+* **subject**: Something we are tracking stats about – currently a room or user.
+* **current row**: An entry for a subject in the appropriate current statistics
+    table. Each subject can have only one.
+* **historical row**: An entry for a subject in the appropriate historical
+    statistics table. Each subject can have any number of these.
+
+### Overview
+
+Stats are maintained as time series. There are two kinds of column:
+
+* absolute columns – where the value is correct for the time given by `end_ts`
+    in the stats row. (Imagine a line graph for these values)
+* per-slice columns – where the value corresponds to how many of the occurrences
+    occurred within the time slice given by `(end_ts − bucket_size)…end_ts`
+    or `start_ts…end_ts`. (Imagine a histogram for these values)
+
+Currently, only absolute columns are in use.
+
+Stats are maintained in two tables (for each type): current and historical.
+
+Current stats correspond to the present values. Each subject can only have one
+entry.
+
+Historical stats correspond to values in the past. Subjects may have multiple
+entries.
+
+## Concepts around the management of stats
+
+### current rows
+
+#### dirty current rows
+
+Current rows can be **dirty**, which means that they have changed since the
+latest historical row for the same subject.
+**Dirty** current rows possess an end timestamp, `end_ts`.
+
+#### old current rows and old collection
+
+When a (necessarily dirty) current row has an `end_ts` in the past, it is said
+to be **old**.
+Old current rows must be copied into a historical row, and cleared of their dirty
+status, before further statistics can be tracked for that subject.
+The process which does this is referred to as **old collection**.
+
+#### incomplete current rows
+
+There are also **incomplete** current rows, which are current rows that do not
+contain a full count yet – this is because they are waiting for the stats
+regenerator to give them an initial count. Incomplete current rows DO NOT contain
+correct and up-to-date values. As such, *incomplete rows are not old-collected*.
+Instead, old incomplete rows will be extended so they are no longer old.
+
+### historical rows
+
+Historical rows can always be considered to be valid for the time slice and
+end time specified. (This, of course, assumes a lack of defects in the code
+to track the statistics, and assumes integrity of the database).
+
+Even still, there are two considerations that we may need to bear in mind:
+
+* historical rows will not exist for every time slice – they will be omitted
+    if there were no changes. In this case, the following assumptions can be
+    made to interpolate/recreate missing rows:
+    - absolute fields have the same values as in the preceding row
+    - per-slice fields are zero (`0`)
+* historical rows will not be retained forever – rows older than a configurable
+    time will be purged.
+
+#### purge
+
+The purging of historical rows is not yet implemented.
+