Skip to content

Commit 64ea411

Browse files
authored
Merge pull request #357 from clamsproject/347-slow-remote-file-access
caching for remote URI
2 parents 53cc9df + 0a26924 commit 64ea411

3 files changed

Lines changed: 59 additions & 8 deletions

File tree

documentation/plugins.rst

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
Developing plugins for MMIF Python SDK
44
======================================
55

6-
76
Overview
87
--------
98

@@ -80,10 +79,41 @@ And the plugin code.
8079
def help():
8180
return "location format: `<DOCUMENT_ID>.video`"
8281
83-
84-
85-
Bulit-in Document Location Scheme Plugins
82+
Built-in Document Location Scheme Plugins
8683
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8784

88-
At the moment, ``mmif-python`` PyPI distribution ships a built-in *docloc* plugin that support both ``http`` and ``https`` schemes.
85+
At the moment, ``mmif-python`` PyPI distribution ships a built-in *docloc* plugin that support both ``http`` and ``https`` schemes. This plugin implements caching as described above, so repeated access to the same URL will not trigger multiple downloads.
8986
Take a look at :mod:`mmif_docloc_http` module for details.
87+
88+
Caching for Remote File Access
89+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
90+
91+
When developing plugins that resolve remote document locations (e.g., ``http``, ``s3``, or custom schemes), it is highly recommended to implement caching to avoid repeated network requests or file downloads. Since ``mmif-python`` may call the ``resolve`` function multiple times for the same document location during processing, caching can significantly improve performance.
92+
93+
A simple and effective approach is to use a module-level dictionary as a cache. Because Python modules are singletons (loaded once and cached in ``sys.modules``), this cache persists for the entire lifetime of the Python process, across multiple MMIF files and Document objects.
94+
95+
Here's an example of how to implement caching in a plugin:
96+
97+
.. code-block:: python
98+
99+
# mmif_docloc_myscheme/__init__.py
100+
101+
_cache = {}
102+
103+
def resolve(docloc):
104+
if docloc in _cache:
105+
return _cache[docloc]
106+
107+
# ... your resolution logic here ...
108+
resolved_path = do_actual_resolution(docloc)
109+
110+
_cache[docloc] = resolved_path
111+
return resolved_path
112+
113+
This pattern ensures that:
114+
115+
* The first call to ``resolve`` performs the actual resolution (download, API call, etc.)
116+
* Subsequent calls for the same location return the cached result immediately
117+
* The cache is shared across all MMIF objects processed within the same Python process
118+
119+
See :mod:`mmif_docloc_http` for a concrete example of this caching strategy in action.

mmif_docloc_http/__init__.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,22 @@
11
import urllib.request
22
import urllib.error
33

4+
_cache = {}
5+
46

57
def resolve(docloc):
8+
if docloc in _cache:
9+
return _cache[docloc]
610
try:
711
if docloc.startswith('http://') or docloc.startswith('https://'):
8-
return urllib.request.urlretrieve(docloc)[0]
12+
path = urllib.request.urlretrieve(docloc)[0]
13+
_cache[docloc] = path
14+
return path
915
else:
1016
raise ValueError(f'cannot handle document location scheme: {docloc}')
1117
except urllib.error.URLError as e:
1218
raise e
13-
14-
19+
20+
1521
def help():
1622
return "location must be a URL string."

tests/test_serialize.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -269,6 +269,21 @@ def test_document_location_helpers_http(self):
269269
# round_trip = Document(new_doc.serialize())
270270
self.assertEqual(Document(new_doc.serialize()).serialize(), new_doc.serialize())
271271

272+
def test_document_location_http_caching(self):
273+
import mmif_docloc_http
274+
mmif_docloc_http._cache.clear()
275+
test_url = "https://example.com/"
276+
self.assertNotIn(test_url, mmif_docloc_http._cache)
277+
new_doc = Document()
278+
new_doc.id = "d1"
279+
new_doc.location = test_url
280+
new_doc.location_path()
281+
self.assertIn(test_url, mmif_docloc_http._cache)
282+
# second call should use cache (same path returned)
283+
cached_path = mmif_docloc_http._cache[test_url]
284+
second_path = new_doc.location_path()
285+
self.assertEqual(cached_path, second_path)
286+
272287
def test_get_documents_locations(self):
273288
mmif_obj = Mmif(MMIF_EXAMPLES['everything'])
274289
self.assertEqual(1, len(mmif_obj.get_documents_locations(DocumentTypes.VideoDocument)))

0 commit comments

Comments
 (0)