-
Robert Knight authored
When anchoring a quote in a PDF, the quote is first searched in text extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting positions are used to create a range within the hidden text layer of a page. An issue we've seen several times when doing PDF.js upgrades is minor changes to which spaces are included in the text layer. In the past we've adapted our text extraction to match the text layer each time. This slows down the process of upgrading PDF.js and makes maintaining compatibility with a range of PDF.js releases more difficult. In the most recent update, an `includeMarkedContent` option was added to the `getTextContent` API, and the presence of that option could affect whether certain whitespaces are included in the output nor not [1]. Try to address this issue generally by mapping offsets from the page text into offsets in the text layer in a way that ignores whitespace differences. - Add `translateOffsets` utility, which maps a (start, end) pair of offsets in an input string into corresponding offsets in an output string, where the output is a version of the input that has been "corrupted" by the addition or removal of certain characters (eg. whitespace) - Use `translateOffsets` utility in PDF anchoring to map quote offsets in the page text returned by `PDFPage.getTextContent` into offsets in the `textContent` of the text layer element. [1] https://github.com/hypothesis/browser-extension/pull/799#issuecomment-1079864595
4e124a6f