From d5704cf2a3c6e8d27a6f70bca0db499e04ce6eb9 Mon Sep 17 00:00:00 2001
From: Kegan Dougal <kegan@matrix.org>
Date: Tue, 9 Sep 2014 14:53:35 -0700
Subject: Added initial draft for human-readable ID rules.

---
 docs/human-id-rules.rst | 71 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)
 create mode 100644 docs/human-id-rules.rst

(limited to 'docs/human-id-rules.rst')

diff --git a/docs/human-id-rules.rst b/docs/human-id-rules.rst
new file mode 100644
index 0000000000..36987ddd0d
--- /dev/null
+++ b/docs/human-id-rules.rst
@@ -0,0 +1,71 @@
+This document outlines the format for human-readable IDs within matrix.
+
+Overview
+--------
+UTF-8 is quickly becoming the standard character encoding set on the web. As
+such, Matrix requires that all strings MUST be encoded as UTF-8. However,
+using Unicode as the character set for human-readable IDs is troublesome. There
+are many different characters which appear identical to each other, but would
+identify different users. In addition, there are non-printable characters which
+cannot be rendered the the end-user. This opens up a security vulnerability with
+phishing/spoofing of IDs, commonly known as a homograph attack.
+
+Web browers encountered this problem when International Domain Names were
+introduced. A variety of checks were put in place in order to protect users. If
+an address failed the check, the raw punycode would be displayed to disambiguate
+the address. Similar checks are performed by home servers in Matrix, which will
+then warn the client about the potentially misleading ID. However, Matrix does
+not use punycode, and so does not show raw punycode on a failed check. Instead,
+home servers must outright reject these misleading IDs.
+
+Types of human-readable IDs
+---------------------------
+There are two main human-readable IDs in question:
+
+ - Room aliases
+ - User IDs
+ 
+Room aliases look like ``#localpart:domain``. These aliases point to opaque
+non human-readable room IDs. These pointers can change, so there is already an
+issue present with the same ID pointing to a different destination at a later
+date.
+
+User IDs look like ``@localpart:domain``. These represent actual end-users, and
+unlike room aliases, there is no layer of indirection. This presents a much
+greater concern with homograph attacks. 
+
+Checks
+------
+- Similar to web browsers.
+- blacklisted chars (e.g. non-printable characters)
+- mix of language sets from 'preferred' language not allowed. 
+- Language sets from CLDR dataset.
+- Treated in segments (localpart, domain)
+
+Rejecting
+---------
+- Home servers MUST reject room aliases which do not pass the check, both on 
+  GETs and PUTs.
+- Home servers MUST reject user ID localparts which do not pass the check, both
+  on creation and on events.
+- Any home server whose domain does not pass this check, MUST use their punycode
+  domain name instead of the IDN, to prevent other home servers rejecting you.
+- Error code is M_FAILED_HOMOGRAPH_CHECK.
+- Error message MAY go into further information about which characters were
+  rejected and why.
+  
+Other considerations
+--------------------
+- Basic security: Informational key on the event attached by HS to say "unsafe 
+  ID". Problem: clients can just ignore it, and since it will appear only very
+  rarely, easy to forget when implementing clients.
+- Moderate security: Requires client handshake. Forces clients to implement
+  a check, else they cannot communicate with the misleading ID. However, this is
+  extra overhead in both client implementations and round-trips.
+- High security: Outright rejection of the ID at the point of creation / 
+  receiving event. Point of creation rejection is preferable to avoid the ID
+  entering the system in the first place. However, malicious HSes can just allow
+  the ID. Hence, other home servers must reject them if they see them in events.
+  Client never sees the problem ID, provided the HS is correctly implemented.
+- High security decided; client doesn't need to worry about it, no additional
+  protocol complexity aside from rejection of an event.
\ No newline at end of file
-- 
cgit 1.4.1


From 56a358481e928d6e70ff8afd48756c67860965c9 Mon Sep 17 00:00:00 2001
From: Kegan Dougal <kegan@matrix.org>
Date: Tue, 9 Sep 2014 15:00:48 -0700
Subject: Tyops

---
 docs/human-id-rules.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

(limited to 'docs/human-id-rules.rst')

diff --git a/docs/human-id-rules.rst b/docs/human-id-rules.rst
index 36987ddd0d..999651991c 100644
--- a/docs/human-id-rules.rst
+++ b/docs/human-id-rules.rst
@@ -7,23 +7,23 @@ such, Matrix requires that all strings MUST be encoded as UTF-8. However,
 using Unicode as the character set for human-readable IDs is troublesome. There
 are many different characters which appear identical to each other, but would
 identify different users. In addition, there are non-printable characters which
-cannot be rendered the the end-user. This opens up a security vulnerability with
+cannot be rendered by the end-user. This opens up a security vulnerability with
 phishing/spoofing of IDs, commonly known as a homograph attack.
 
 Web browers encountered this problem when International Domain Names were
 introduced. A variety of checks were put in place in order to protect users. If
 an address failed the check, the raw punycode would be displayed to disambiguate
-the address. Similar checks are performed by home servers in Matrix, which will
-then warn the client about the potentially misleading ID. However, Matrix does
-not use punycode, and so does not show raw punycode on a failed check. Instead,
-home servers must outright reject these misleading IDs.
+the address. Similar checks are performed by home servers in Matrix. However, 
+Matrix does not use punycode representations, and so does not show raw punycode 
+on a failed check. Instead, home servers must outright reject these misleading 
+IDs.
 
 Types of human-readable IDs
 ---------------------------
 There are two main human-readable IDs in question:
 
- - Room aliases
- - User IDs
+- Room aliases
+- User IDs
  
 Room aliases look like ``#localpart:domain``. These aliases point to opaque
 non human-readable room IDs. These pointers can change, so there is already an
-- 
cgit 1.4.1


From f23e5b17b66db0fabb8c53d3f046936268e5e031 Mon Sep 17 00:00:00 2001
From: Kegan Dougal <kegan@matrix.org>
Date: Tue, 9 Sep 2014 15:11:06 -0700
Subject: Extra restrictions to make parsing easier.

---
 docs/human-id-rules.rst | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

(limited to 'docs/human-id-rules.rst')

diff --git a/docs/human-id-rules.rst b/docs/human-id-rules.rst
index 999651991c..6e63bc43a2 100644
--- a/docs/human-id-rules.rst
+++ b/docs/human-id-rules.rst
@@ -41,6 +41,9 @@ Checks
 - mix of language sets from 'preferred' language not allowed. 
 - Language sets from CLDR dataset.
 - Treated in segments (localpart, domain)
+- Additional restrictions for ease of processing IDs.
+  - Room alias localparts MUST NOT have ``#`` or ``:``.
+  - User ID localparts MUST NOT have ``@`` or ``:``.
 
 Rejecting
 ---------
@@ -50,9 +53,13 @@ Rejecting
   on creation and on events.
 - Any home server whose domain does not pass this check, MUST use their punycode
   domain name instead of the IDN, to prevent other home servers rejecting you.
-- Error code is M_FAILED_HOMOGRAPH_CHECK.
+- Error code is ``M_FAILED_HUMAN_ID_CHECK``. (generic enough for both failing 
+  due to homograph attacks, and failing due to including ``:``s, etc)
 - Error message MAY go into further information about which characters were
   rejected and why.
+- Error message SHOULD contain a ``failed_keys`` key which contains an array
+  of strings which represent the keys which failed the check e.g:
+   - ``failed_keys: [ user_id, room_alias ]``
   
 Other considerations
 --------------------
-- 
cgit 1.4.1


From 2bd4346075b119d48afa676dcc883a51199119f2 Mon Sep 17 00:00:00 2001
From: Kegan Dougal <kegan@matrix.org>
Date: Tue, 9 Sep 2014 15:13:50 -0700
Subject: More rst formatting.

---
 docs/human-id-rules.rst | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

(limited to 'docs/human-id-rules.rst')

diff --git a/docs/human-id-rules.rst b/docs/human-id-rules.rst
index 6e63bc43a2..3a1ff39892 100644
--- a/docs/human-id-rules.rst
+++ b/docs/human-id-rules.rst
@@ -42,8 +42,8 @@ Checks
 - Language sets from CLDR dataset.
 - Treated in segments (localpart, domain)
 - Additional restrictions for ease of processing IDs.
-  - Room alias localparts MUST NOT have ``#`` or ``:``.
-  - User ID localparts MUST NOT have ``@`` or ``:``.
+   - Room alias localparts MUST NOT have ``#`` or ``:``.
+   - User ID localparts MUST NOT have ``@`` or ``:``.
 
 Rejecting
 ---------
@@ -54,12 +54,13 @@ Rejecting
 - Any home server whose domain does not pass this check, MUST use their punycode
   domain name instead of the IDN, to prevent other home servers rejecting you.
 - Error code is ``M_FAILED_HUMAN_ID_CHECK``. (generic enough for both failing 
-  due to homograph attacks, and failing due to including ``:``s, etc)
+  due to homograph attacks, and failing due to including ``:`` s, etc)
 - Error message MAY go into further information about which characters were
   rejected and why.
 - Error message SHOULD contain a ``failed_keys`` key which contains an array
-  of strings which represent the keys which failed the check e.g:
-   - ``failed_keys: [ user_id, room_alias ]``
+  of strings which represent the keys which failed the check e.g::
+  
+    failed_keys: [ user_id, room_alias ]
   
 Other considerations
 --------------------
-- 
cgit 1.4.1