• Robert Knight's avatar
    Allow for whitespace differences between page text and text layer · 4e124a6f
    Robert Knight authored
    When anchoring a quote in a PDF, the quote is first searched in text
    extracted using PDF.js's `PDFPage.getTextContent` API, and the resulting
    positions are used to create a range within the hidden text layer of a
    page.
    
    An issue we've seen several times when doing PDF.js upgrades is minor
    changes to which spaces are included in the text layer. In the past
    we've adapted our text extraction to match the text layer each time.
    This slows down the process of upgrading PDF.js and makes maintaining
    compatibility with a range of PDF.js releases more difficult. In the
    most recent update, an `includeMarkedContent` option was added to the
    `getTextContent` API, and the presence of that option could affect
    whether certain whitespaces are included in the output nor not [1].
    
    Try to address this issue generally by mapping offsets from the page
    text into offsets in the text layer in a way that ignores whitespace
    differences.
    
     - Add `translateOffsets` utility, which maps a (start, end) pair
       of offsets in an input string into corresponding offsets in an output
       string, where the output is a version of the input that has been
       "corrupted" by the addition or removal of certain characters (eg.
       whitespace)
    
     - Use `translateOffsets` utility in PDF anchoring to map quote offsets
       in the page text returned by `PDFPage.getTextContent` into offsets in
       the `textContent` of the text layer element.
    
    [1] https://github.com/hypothesis/browser-extension/pull/799#issuecomment-1079864595
    4e124a6f
Name
Last commit
Last update
.github Loading commit data...
bin Loading commit data...
dev-server Loading commit data...
docs Loading commit data...
embedding-examples Loading commit data...
images Loading commit data...
requirements Loading commit data...
scripts Loading commit data...
src Loading commit data...
.babelrc Loading commit data...
.dockerignore Loading commit data...
.eslintignore Loading commit data...
.eslintrc Loading commit data...
.gitignore Loading commit data...
.npmignore Loading commit data...
.npmrc Loading commit data...
.prettierignore Loading commit data...
.python-version Loading commit data...
CODE_OF_CONDUCT Loading commit data...
Dockerfile Loading commit data...
Jenkinsfile Loading commit data...
LICENSE Loading commit data...
Makefile Loading commit data...
README.md Loading commit data...
codecov.yml Loading commit data...
gulpfile.mjs Loading commit data...
package.json Loading commit data...
pyproject.toml Loading commit data...
requirements-dev.in Loading commit data...
rollup-boot.config.mjs Loading commit data...
rollup-tests.config.mjs Loading commit data...
rollup.config.mjs Loading commit data...
tailwind.config.mjs Loading commit data...
tox.ini Loading commit data...
tsconfig.json Loading commit data...
yarn.lock Loading commit data...