Merge branch 'release/0.27.0'

felliott · felliott · commit 637d8d7a3ff2 · 2018-07-19T11:42:09.000-04:00
diff --git a/CHANGELOG b/CHANGELOG
@@ -2,6 +2,16 @@
 ChangeLog
 *********
 
+0.27.0 (2018-07-19)
+===================
+- Feature: Support the Hypothes.is annotator toolbar on pdfs and files converted to pdf.  This is
+not enabled by default; see `docs/integrations.rst` for instructions on enabling it.  NB: the
+default urls, page titles, and document ids that hypothes.is gets from the rendered document when
+running in an MFR context are not very useful, so MFR will try to provide more appropriate values to
+the annotator.  These may not be valid for all use cases, please see the document mentioned
+above for details.  (h/t @jamescdavis for helping to debug a race condition in the loader!)
+- Code: Don't let pytest descend into node_modules/.  (thanks, @birdbrained!)
+
 0.26.0 (2018-06-22)
 ===================
 - Feature: Teach MFR to identify itself when requesting metadata from WaterButler. This will allow
diff --git a/docs/index.rst b/docs/index.rst
@@ -24,6 +24,7 @@ Guide
    install
    quickstart
    overview
+   integrations
    code
 
 Project info
diff --git a/docs/integrations.rst b/docs/integrations.rst
@@ -0,0 +1,40 @@
+.. _integrations:
+
+Integrations
+============
+
+
+Hypothes.is annotator
+---------------------
+
+MFR supports loading the `Hypothes.is <https://hypothes.is/>`_ annotation sidebar on pdfs and files converted to pdf.  Hypothes.is allows users to publicly comment and converse on internet-accesible files.  The annotator is not automatically loaded; it must be signaled to turn on by the parent iframe.  MFR also overrides some of the properties used by the sidebar to identify the annotation.  
+
+
+Enabling
+^^^^^^^^
+
+The annotator is not loaded automatically for every MFR pdf render. The parent frame will need to send the ``startHypothesis`` event to the MFR iframe to start loading the annotator.  If the iframe is created via ``mfr.js``, then this signal can be sent by calling ``.startHypothesis()`` on the Render object.  If ``mfr.js`` is not used, then the signal can be sent by calling ``.postMessage()`` on the iframe:
+
+.. code-block:: javascript
+
+    $('iframe')[0].contentWindow.postMessage('startHypothesis', mfrUrl);
+
+When the iframe receives this event, it will override the pdf.js metadata the annotator extracts then inject the hypothes.is loader script into the iframe.
+
+Hypothes.is support can be completely disabled by setting the ``ENABLE_HYPOTHESIS`` flag to `False` in the pdf extension settings (`mfr.extensions.pdf.settings`). If running via the OSF's docker-compose, add ``PDF_EXTENSION_CONFIG_ENABLE_HYPOTHESIS=0`` to ``.docker-compose.mfr.env`` in the osf.io repo and recreate the container. If this flag is turned off, sending the ``startHypothesis`` event to the iframe will do nothing.
+
+
+Annotator metadata
+^^^^^^^^^^^^^^^^^^
+
+The annotator client links annotations to both the url of the document and an identifier embedded in the pdf.  It also attaches the page title as metadata to the annotation. [#f1]_  In MFR, all three of these may be unsuitable for one reason or another, so MFR will override the properties that the client retrieves to provide more appropriate values.  These properties are:
+
+**URL**: The MFR url can be complex, especially since it takes another url as a query parameter. Hypothes.is can handle reordering of the top-level parameters, but any change to the internal url will be taken as a new url, causing annotations to be lost. In addition, the url is used by hypothesis to provide share links and "view-in-context" links.  Visiting an MFR render url will load the iframe, but without an external frame to send the ``startHypothesis`` signal, the annotations will never be loaded.  Visiting an MFR export url will start a download of the document, with no chance of showing annotations.  Instead, MFR sets the annotation url to the parent frame, which is expected to be simpler and provide more context.
+
+**Document ID**:  The document ID is an identifier embedded in the pdf.  pdf.js will extract this value, or if it is not present, return the md5 hash of the first 1024 bytes of data in the pdf.  User-provided pdfs will *usually* contain IDs, but may not. If the pdf is updated there is no guarantee that the ID will be preserved across revisions. If the ID changes, the document could lose its annotations.  pdfs exported by LibreOffice do not contain any identifiers and may change unpredictably.  For these reasons, MFR exports a stable identifier that should persist across revisions.  The stable ID is defined by the auth provider.  The OSF auth provider uses a hash of file metadata that is particular to that file and unlikely to change.  MFR does not modify the file, instead overwriting the identifier detected by pdf.js, which is then read by the annotator client.
+
+**Title**: The annotator will derive the annotation page title from the pdf title. Similar to Document IDs, user-provided pdfs may or may not have a title.  LibreOffice-exported pdfs do not have an embedded title.  If an embedded title isn't found, the annotator will fall back to the iframe document's title, which if not set will default to the path part of the iframe url.  This results in annotation titles of "render" or "export", with no distinguishing attributes from other MFR annotations.  MFR works around this by updating the pdf.js-detected title and page title with the source file's name.
+
+.. rubric:: Footnotes
+
+.. [#f1] If the page title changes between annotations, the client will send the new page title with new annotations, but the hypothesis aggregator will discard that and `use the first title received <https://github.com/hypothesis/h/blob/8410ff35150ea600c02458e4558a67db7c926816/h/activity/bucketing.py#L27>`_ for that identifier.
diff --git a/mfr/extensions/pdf/export.py b/mfr/extensions/pdf/export.py
@@ -1,5 +1,6 @@
 import os
 import imghdr
+import logging
 from http import HTTPStatus
 
 from PIL import Image, TiffImagePlugin
@@ -9,6 +10,8 @@
 from mfr.extensions.pdf import exceptions
 from mfr.extensions.pdf.settings import EXPORT_MAX_PAGES
 
+logger = logging.getLogger(__name__)
+
 
 class PdfExporter(extension.BaseExporter):
 
@@ -63,6 +66,7 @@ def tiff_to_pdf(self, tiff_img, max_size):
         c.save()
 
     def export(self):
+        logger.debug('pdf-export: format::{}'.format(self.format))
         parts = self.format.split('.')
         export_type = parts[-1].lower()
         max_size = [int(x) for x in parts[0].split('x')] if len(parts) == 2 else None
diff --git a/mfr/extensions/pdf/render.py b/mfr/extensions/pdf/render.py
@@ -21,36 +21,35 @@ class PdfRenderer(extension.BaseRenderer):
     def render(self):
 
         download_url = munge_url_for_localdev(self.metadata.download_url)
+        escaped_name = escape_url_for_template(
+            '{}{}'.format(self.metadata.name, self.metadata.ext)
+        )
         logger.debug('extension::{}  supported-list::{}'.format(self.metadata.ext,
                                                                 settings.EXPORT_SUPPORTED))
         if self.metadata.ext.lower() not in settings.EXPORT_SUPPORTED:
             logger.debug('Extension not found in supported list!')
             return self.TEMPLATE.render(
                 base=self.assets_url,
                 url=escape_url_for_template(download_url.geturl()),
+                stable_id=self.metadata.stable_id,
+                file_name=escaped_name,
                 enable_hypothesis=settings.ENABLE_HYPOTHESIS,
             )
 
         logger.debug('Extension found in supported list!')
         exported_url = furl.furl(self.export_url)
-        if settings.EXPORT_TYPE:
-            if settings.EXPORT_MAXIMUM_SIZE:
-                exported_url.args['format'] = '{}.{}'.format(settings.EXPORT_MAXIMUM_SIZE,
-                                                             settings.EXPORT_TYPE)
-            else:
-                exported_url.args['format'] = settings.EXPORT_TYPE
-
-            self.metrics.add('needs_export', True)
-            return self.TEMPLATE.render(
-                base=self.assets_url,
-                url=escape_url_for_template(exported_url.url),
-                enable_hypothesis=settings.ENABLE_HYPOTHESIS
-            )
+        if settings.EXPORT_MAXIMUM_SIZE:
+            exported_url.args['format'] = '{}.{}'.format(settings.EXPORT_MAXIMUM_SIZE,
+                                                         settings.EXPORT_TYPE)
+        else:
+            exported_url.args['format'] = settings.EXPORT_TYPE
 
-        # TODO: is this dead code? ``settings.EXPORT_TYPE`` is never None or empty
+        self.metrics.add('needs_export', True)
         return self.TEMPLATE.render(
             base=self.assets_url,
-            url=escape_url_for_template(download_url.geturl()),
+            url=escape_url_for_template(exported_url.url),
+            stable_id=self.metadata.stable_id,
+            file_name=escaped_name,
             enable_hypothesis=settings.ENABLE_HYPOTHESIS,
         )
 
diff --git a/mfr/extensions/pdf/settings.py b/mfr/extensions/pdf/settings.py
@@ -4,9 +4,10 @@
 config = settings.child('PDF_EXTENSION_CONFIG')
 
 EXPORT_TYPE = config.get('EXPORT_TYPE', 'pdf')
+assert EXPORT_TYPE  # mandatory config
 EXPORT_MAXIMUM_SIZE = config.get('EXPORT_MAXIMUM_SIZE', '1200x1200')
 
-ENABLE_HYPOTHESIS = config.get_bool('ENABLE_HYPOTHESIS', False)
+ENABLE_HYPOTHESIS = config.get_bool('ENABLE_HYPOTHESIS', True)
 
 # supports multiple files in the form of a space separated string
 EXPORT_SUPPORTED = config.get('EXPORT_SUPPORTED', '.tiff .tif').split(' ')
diff --git a/mfr/extensions/pdf/templates/viewer.mako b/mfr/extensions/pdf/templates/viewer.mako
@@ -424,8 +424,11 @@ http://sourceforge.net/adobe/cmap/wiki/License/
         window.pymChild.sendMessage('embed', 'embed-responsive-pdf');
     </script>
     % if enable_hypothesis:
+    <script>
+        window.MFR_STABLE_ID = '${stable_id}';
+        window.MFR_FILE_NAME = '${file_name}';
+    </script>
     <script src="/static/js/mfr.child.hypothesis.js"></script>
     % endif
   </body>
 </html>
-
diff --git a/mfr/extensions/unoconv/export.py b/mfr/extensions/unoconv/export.py
@@ -1,12 +1,6 @@
 import os
 import subprocess
 
-from pdfrw import (
-    PdfReader,
-    PdfWriter
-)
-
-
 from mfr.core import extension
 from mfr.core import exceptions
 
@@ -39,8 +33,3 @@ def export(self):
                 extension=extension or '',
                 exporter_class='unoconv',
             )
-
-        pdf = PdfReader(self.output_file_path)
-        pdf.ID[0] = self.metadata.stable_id
-        pdf.ID[1] = self.metadata.unique_key
-        PdfWriter(self.output_file_path, trailer=pdf).write()
diff --git a/mfr/providers/osf/provider.py b/mfr/providers/osf/provider.py
@@ -59,6 +59,7 @@ async def metadata(self):
         differently.
         """
         download_url = await self._fetch_download_url()
+        logger.debug('download_url::{}'.format(download_url))
         if '/file?' in download_url:
             # URL is for WaterButler v0 API
             # TODO Remove this when API v0 is officially deprecated
@@ -124,8 +125,10 @@ async def metadata(self):
         self.metrics.add('metadata.clean_url_args', str(cleaned_url))
         meta = metadata['data']
         unique_key = hashlib.sha256((meta['etag'] + cleaned_url.url).encode('utf-8')).hexdigest()
-        stable_id = hashlib.sha256('/{}/{}/{}'.format(meta['resource'], meta['provider'], meta['path'])
-                                   .encode('utf-8')).hexdigest()
+        stable_str = '/{}/{}{}'.format(meta['resource'], meta['provider'], meta['path'])
+        stable_id = hashlib.sha256(stable_str.encode('utf-8')).hexdigest()
+        logger.debug('stable_identifier: str({}) hash({})'.format(stable_str, stable_id))
+
         return provider.ProviderMetadata(name, ext, content_type, unique_key, download_url, stable_id)
 
     async def download(self):
@@ -177,6 +180,7 @@ async def _fetch_download_url(self):
                 )
                 await request.release()
 
+                logger.debug('osf-download-resolver: request.status::{}'.format(request.status))
                 if request.status != 302:
                     raise exceptions.MetadataError(
                         request.reason,
diff --git a/mfr/server/handlers/core.py b/mfr/server/handlers/core.py
@@ -2,6 +2,7 @@
 import abc
 import uuid
 import asyncio
+import logging
 import pkg_resources
 
 import tornado.web
@@ -31,6 +32,8 @@
     'Content-Encoding',
 ]
 
+logger = logging.getLogger(__name__)
+
 
 class CorsMixin:
 
@@ -110,6 +113,7 @@ async def prepare(self):
                 provider=settings.PROVIDER_NAME,
                 code=400,
             )
+        logging.debug('target_url::{}'.format(self.url))
 
         self.provider = utils.make_provider(
             settings.PROVIDER_NAME,
@@ -120,6 +124,7 @@ async def prepare(self):
 
         self.metadata = await self.provider.metadata()
         self.extension_metrics.add('ext', self.metadata.ext)
+        logging.debug('extension::{}'.format(self.metadata.ext))
 
         self.cache_provider = waterbutler.core.utils.make_provider(
             settings.CACHE_PROVIDER_NAME,
diff --git a/mfr/server/static/js/mfr.child.hypothesis.js b/mfr/server/static/js/mfr.child.hypothesis.js
@@ -16,11 +16,53 @@
             return;
         }
 
-        var script = window.document.createElement('script');
-        script.type = 'text/javascript';
-        script.src = 'https://hypothes.is/embed.js';
-        window.document.head.appendChild(script);
-        window.document.body.classList.add('show-hypothesis');
-        hypothesisLoaded = true;
+        // 'pagerendered' is an event emitted by pdf.js after the file metadata has been loaded and
+        // the first page rendered.  We must delay setting the fake metadata until after the
+        // document has been loaded, or pdf.js will overwrite our fake metadata with the real
+        // metadata.
+        document.addEventListener('pagerendered', function(e) {
+
+            // Changes made here will not affect loading of the document, but will change how
+            // Hypothes.is indexes the annotations.
+
+            // If a pdf is being rendered and MFR has provided a stable identifier, override the
+            // documentFingerprint with it before loading the hypothes.is client.  The client
+            // will use this ID to identify the document when fetching/saving annotations.
+            if (window.MFR_STABLE_ID) {
+                // pdf.js uses the first property to set the second. Set both for now, just to be
+                // safe. The second will be going away in a future pdf.js release.
+                window.PDFViewerApplication.pdfDocument.pdfInfo.fingerprint = window.MFR_STABLE_ID;
+                window.PDFViewerApplication.documentFingerprint = window.MFR_STABLE_ID;
+            }
+
+            // Override the document title to the file name and the document url to the parent
+            // window. This will not affect loading of the document, but will change how Hypothes.is
+            // indexes the annotation.  Previously, the page title on h.is would be the final path
+            // part of the download url, which would be an opaque file identifier or just `export`.
+            if (window.MFR_FILE_NAME) {
+                if (window.PDFViewerApplication.documentInfo) {
+                    window.PDFViewerApplication.documentInfo.Title = window.MFR_FILE_NAME;
+                }
+                else {
+                    window.PDFViewerApplication.documentInfo = {"Title": window.MFR_FILE_NAME};
+                }
+                document.title = window.MFR_FILE_NAME;
+            }
+
+            // Override the document url to point to the parent window. Before, the linked url would
+            // point to the export/download url, meaning the annotations could never be viewed in
+            // context.  By linking to the referrer, the annotations can be viewed in the context of
+            // the preprint.
+            window.PDFViewerApplication.url = document.referrer;
+
+            // Load the hypothes.is client
+            var script = window.document.createElement('script');
+            script.type = 'text/javascript';
+            script.src = 'https://hypothes.is/embed.js';
+            window.document.head.appendChild(script);
+            window.document.body.classList.add('show-hypothesis');
+            hypothesisLoaded = true;
+        });
+
     };
 })();
diff --git a/mfr/version.py b/mfr/version.py
@@ -1 +1 @@
-__version__ = '0.26.0'
+__version__ = '0.27.0'
diff --git a/requirements.txt b/requirements.txt
@@ -37,7 +37,6 @@ mistune==0.7
 
 # Pdf
 reportlab==3.4.0
-pdfrw==0.4.0
 
 # Pptx
 # python-pptx==0.5.7
diff --git a/setup.cfg b/setup.cfg
@@ -9,3 +9,6 @@
 ignore = E501,E127,E128,E265,E301,E302,F403,E731
 max-line-length = 100
 exclude = .ropeproject,tests/*,src/*,env,venv,node_modules/*
+
+[pytest]
+norecursedirs = .* build CVS _darcs {arch} *.egg venv node_modules/*
diff --git a/tests/extensions/pdf/test_renderer.py b/tests/extensions/pdf/test_renderer.py
@@ -86,7 +86,7 @@ def test_render_pdf_with_single_quote_in_name(self, assets_url):
     def test_render_tif(self, tif_renderer, assets_url):
         exported_url = furl.furl(tif_renderer.export_url)
         exported_url.args['format'] = '{}.{}'.format(settings.EXPORT_MAXIMUM_SIZE,
-                                                            settings.EXPORT_TYPE)
+                                                     settings.EXPORT_TYPE)
 
         body = tif_renderer.render()
         assert '<base href="{}/{}/web/" target="_blank">'.format(assets_url, 'pdf') in body