feat(text): Switch to using TextDecoder for UTF8

Previously, to decode UTF8 content, we used the browser's decodeUriComponent method. This worked in most situations, but it would stop and error the moment it found an invalid UTF8 character. This meant that a single poorly-encoded character inside a text stream would cause the entire closed captions to fail to display. In this CL, we switch to using the newer TextDecoder API, which will instead replace invalid characters with an "unknown character" code point, and continue parsing. This should make our text parsers more robust when faced with bad encoding. Closes #2816 Change-Id: Ibf2887e143d24d15a127bbcf2961539669580eea
2026-06-16 16:16:40 +03:00 · 2020-08-28 15:14:11 -07:00
parent de478295ca
commit a72a1e9102
2 changed files with 27 additions and 35 deletions
@@ -15,6 +15,20 @@ describe('StringUtils', () => {
        .toBe('F\u20ac \ud800\udf48');
  });

+  it('won\'t break if given cut-off UTF8 character', () => {
+    // This array contains the first half of a 2-byte UTF8 character, stranded
+    // at the very end of the string.
+    const arr1 = [0x53, 0x61, 0x6e, 0x20, 0x4a, 0x6f, 0x73, 0x81];
+    expect(StringUtils.fromUTF8(new Uint8Array(arr1)))
+        .toBe('San Jos\uFFFD');
+
+    // For reasons I don't know, it seems like 0xE9 cannot be the start of a
+    // UTF8 character.  Perhaps it is a reserved number?
+    const arr2 = [0x4a, 0x6f, 0x73, 0xE9, 0x33, 0x33, 0x20, 0x53, 0x61, 0x6e];
+    expect(StringUtils.fromUTF8(new Uint8Array(arr2)))
+        .toBe('Jos\uFFFD33 San');
+  });
+
  it('strips the BOM in fromUTF8', () => {
    // This is 4 Unicode characters, the last will be split into a surrogate
    // pair.