• Robert Knight's avatar
    Allow for whitespace differences between page text and text layer · 4e124a6f
    Robert Knight authored
    When anchoring a quote in a PDF, the quote is first searched in text
    extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting
    positions are used to create a range within the hidden text layer of a
    page.
    
    An issue we've seen several times when doing PDF.js upgrades is minor
    changes to which spaces are included in the text layer. In the past
    we've adapted our text extraction to match the text layer each time.
    This slows down the process of upgrading PDF.js and makes maintaining
    compatibility with a range of PDF.js releases more difficult. In the
    most recent update, an `includeMarkedContent` option was added to the
    `getTextContent` API, and the presence of that option could affect
    whether certain whitespaces are included in the output nor not [1].
    
    Try to address this issue generally by mapping offsets from the page
    text into offsets in the text layer in a way that ignores whitespace
    differences.
    
     - Add `translateOffsets` utility, which maps a (start, end) pair
       of offsets in an input string into corresponding offsets in an output
       string, where the output is a version of the input that has been
       "corrupted" by the addition or removal of certain characters (eg.
       whitespace)
    
     - Use `translateOffsets` utility in PDF anchoring to map quote offsets
       in the page text returned by `PDFPage.getTextContent` into offsets in
       the `textContent` of the text layer element.
    
    [1] https://github.com/hypothesis/browser-extension/pull/799#issuecomment-1079864595
    4e124a6f
Name
Last commit
Last update
..
html-baselines Loading commit data...
fake-pdf-viewer-application.js Loading commit data...
html-anchoring-fixture.html Loading commit data...
html-test.js Loading commit data...
match-quote-test.js Loading commit data...
pdf-test.js Loading commit data...
placeholder-test.js Loading commit data...
text-range-test.js Loading commit data...
types-test.js Loading commit data...
xpath-test.js Loading commit data...