diff --git a/README.md b/README.md index 3e6e4f1..1d9674f 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,163 @@ matscholar logo -`matscholar` (Materials Scholar) is a Python library for materials-focused natural language processing (NLP). It is maintained by a team of researchers at UC Berkeley and Lawrence Berkeley National Laboratory as part of a project funded by the Toyota Research Institute. +`matscholar` (Materials Scholar) is a Python library for materials-focused natural language +processing (NLP). It is maintained by a team of researchers at UC Berkeley and Lawrence Berkeley +National Laboratory as part of a project funded by the Toyota Research Institute. -This library provides a Python interface for interacting with the Materials Scholar API, performing basic NLP tasks on scientific text, and example notebooks on using these tools for materials discovery and design. +This library provides a Python interface for interacting with the Materials Scholar API, performing +basic NLP tasks on scientific text, and example notebooks on using these tools for materials +discovery and design. ## Setup -We *highly* recommend using a [conda environment](https://conda.io/docs/user-guide/tasks/manage-environments.html) when working with materials scholar tools. +We *highly* recommend using a [conda environment](https://conda.io/docs/user-guide/tasks/manage-environments.html) +when working with materials scholar tools. 1. Clone or download this repo 2. Navigate to the root directory (matscholar) 3. `pip install -r requirements.txt` -4. `pip install .` [or](https://stackoverflow.com/questions/15724093/difference-between-python-setup-py-install-and-pip-install) `python setup.py install` +4. `pip install .` [or](https://stackoverflow.com/questions/15724093/difference-between-python-setup-py-install-and-pip-install) +`python setup.py install` ## Configuring Your API Key The Materials Scholar API can only be accessed by providing an API key in `x-api-key` request header field. To receive an API key to access the Materials Scholar API, please contact John Dagdelen at jdagdelen@lbl.gov. -Once you have an API key, you can add it as an environment variable `MATSCHOLAR_API_KEY` for ease of use. +## API Usage + +For convenience, the Materials Scholar API can be accessed via a python wrapper. + +### Instantiating the Rester + +If an API key has already been obtained, the rester is instantiated as follows: + +```python +from matscholar.rest import Rester + +rester = Rester(api_key="your-api-key", endpoint="api.matscholar.com") +``` + +To avoid passing the API key and endpoint as arguments, set the following environment variables +for ease of use: `MATSCHOLAR_API_KEY`, `MATERIALS_SCHOLAR_ENDPOINT`. + +### Resources + +The methods of the Rester class can be used to access resources of the Materials Scholar API. + +**Searching documents** + +Our corpus of materials science abstracts can be searched based on text matching +(ElasticSearch) or by filtering based on the Named Entities extracted from each document. +Entity based searches support the following entity types: material, property, application, +descriptor, characterization, synthesis, phase. + +To get the raw text of abstracts matching a given query: + +```python +# text match for "solid oxide fuel cells" +example_text = "solid oxide fuel cells" + +# entity filters: include documents mentioning BaZrO3 and nanoparticles; +# exclude documents mentioning thin films +example_entities = {"material": ["BaZrO3"], "descriptor": ["nanoparticle", "-thin film"]} + +docs = rester.search_documents(text=example_text, filters=example_entities) +``` + +This will return a list of dictionaries containing the raw-text for each abstracts along with +associated metadata. + +**Searching entities** + +We have extracted materials-science named entities from nearly 3.5 million materials science +absracts. Details on how this was performed can be found in Ref. [1]. + +The extracted named entities for each document associated with a query are returned by the +search_entities method. This method takes as input a dictionary with entity types as keys and a list of entities + as values. For example, to find all of the entities that co-occur with the material +"GaN": + +```python +docs = rester.search_entities(query={"material": ["GaN"]}) +``` + +This wil return a list of dictionaries representing documents matching the query; each dict will contain +the DOI as well as each unique entity found in the corresponding abstract. + +A summary of the entities associated with a query can be generated using the search_entities_summary method. To get +statistics for entities co-occuring with GaN, + +```python +summary = rester.search_entities_summary(query={"material": ["GaN"]}) +``` + This will return a dictionary with entity types as keys; the values will be a list of the top entities + that occur in documents matching the query, each item in the list will be [entity, document count, fraction]. + +To perform a fast literature review, the search_materials_by_entities method may be used. For a chosen application, +this will return a list of all materials that co-occur with that application in our corpus. For example, +to see which materials co-occur with the word thermoelectric in a document, + +```python +mat_list = rester.search_materials_by_entities(["thermoelectric"], elements=["-Pb"], cutoff=None) +``` + +The above search will find all materials co-occurring with thermoelectric that do not contain lead. +The result will be a list, with each element containing a list of [material, co-occurence counts, co-occurrence dois]. + +**Word embeddings** + +Materials science word embeddings trained using word2vec; details on how the embeddings were trained, +and their application in materials science discovery can be found in Ref. [2]. + +To get the word embedding for a given word, +```python +embedding = rester.get_embedding("photovoltaics") +``` + +This will return a dict containing the embedding. The word embedding will be a 200-dimensional array. + +The rester also has a get_close_words method (based on cosine similarity of embeddings) which can be used to +explore the semantic similarity of materials science terms; this approach can be used discover materials +for a new application (as outlined in the reference above), + +To find words with a similar embedding to photovolatic: + +```python +close_words = rester.get_close_words("photovoltaics", top_k=1000) +``` + +This will return the 1000 closest words to photovoltaics. The result will be a dictionary containing +the close words and their cosine similarity to the input word. + +**Named Entity Recognition** + +In addition to the pre-processed entities present in our corpus, users can performed Named Entity +Recognition on any raw materials science text. The details of the model can be found in Ref. [1]. + +The input should be a list of documents with the text represented as a string: + +```python +doc_1 = "The bands gap of TiO2 is 3.2 eV. This was measured via photoluminescence" +doc_2 = "We deposit GaN thin films using MOCVD" +docs = [doc_1, doc_2] +tagged_docs = rester.perform_ner(docs, return_type="concatenated") +``` + +The arguement return_type may be set to iob, concatenated, or normalized. The latter will replace +entities with their most frequently occurring synonym. A list of tagged documents will be returned. +Each doc is a list of sentences; each sentence is a list of (word, tag) pairs. + +## Citation + +If you use any of the API functionality in your research, please consider citing the following papers +where relevent: + +[1] Weston et al., coming soon + +[2] Tshitoyan et al., Nature (accepted) + ## Contributors @jdagdelen, @vtshitoyan, @lweston diff --git a/matscholar/rest.py b/matscholar/rest.py index 11dc1bf..be9e5d8 100644 --- a/matscholar/rest.py +++ b/matscholar/rest.py @@ -12,7 +12,7 @@ """ __author__ = "John Dagdelen" -__credits__ = "Shyue Ping Ong, Shreyas Cholia, Anubhav Jain" +__credits__ = "Leigh Weston, Amalie Trewartha, Vahe Tshitoyan" __copyright__ = "Copyright 2018, Materials Intelligence" __version__ = "0.1" __maintainer__ = "John Dagdelen" @@ -66,6 +66,7 @@ def __exit__(self, exc_type, exc_val, exc_tb): def _make_request(self, sub_url, payload=None, method="GET"): response = None url = self.preamble + sub_url + print(url) try: if method == "POST": response = self.session.post(url, json=payload, verify=True) @@ -88,7 +89,7 @@ def _make_request(self, sub_url, payload=None, method="GET"): if hasattr(response, "content") else str(ex) raise MatScholarRestError(msg) - def materials_search(self, positive, negative=None, ignore_missing=True, top_k=10): + def search_materials(self, positive, negative=None, ignore_missing=True, top_k=10): """ Given input strings or lists of positive and negative words / phrases, returns a ranked list of materials with corresponding scores and numbers of mentions @@ -111,7 +112,7 @@ def materials_search(self, positive, negative=None, ignore_missing=True, top_k=1 return self._make_request(sub_url, payload=payload, method=method) - def close_words(self, positive, negative=None, ignore_missing=True, top_k=10): + def get_close_words(self, positive, negative=None, ignore_missing=True, top_k=10): """ Given input strings or lists of positive and negative words / phrases, returns a list of most similar words / phrases according to cosine similarity @@ -187,32 +188,120 @@ def materials_map(self, highlight, limit=None, ignore_missing=True, number_to_su return self._make_request(sub_url, payload=payload, method=method) - def search_ents(self, query): - ''' + def search_entities(self, query): + """ Get the entities in each document associated with a given query :param query: dict; e.g., {'material': ['GaN', '-InN']), 'application': ['LED']} :return: list of dicts; each dict represents a document and contains the extracted entities - ''' - method = 'POST' - sub_url = '/ent_search' + """ + + method = "POST" + sub_url = "/ent_search" payload = query return self._make_request(sub_url, payload=payload, method=method) - def get_summary(self, query): + def get_close_journals(self, query): ''' + + :param query: string: a paragraph + :return: list: [['journal name', 'cosine similarity'], ...] + ''' + + method = 'POST' + sub_url = '/journal_suggestion' + payload = {'abstract': query} + + return self._make_request(sub_url, payload=payload, method=method) + + + def search_entities_summary(self, query): + """ Get a summary of the entities associated with a given query :param query: dict; e.g., {'material': ['GaN', '-InN']), 'application': ['LED']} :return: dict; a summary dict with keys for each entity type - ''' - method = 'POST' - sub_url = '/ent_search/summary' + """ + + method = "POST" + sub_url = "/ent_search/summary" payload = query return self._make_request(sub_url, payload=payload, method=method) + def get_close_materials(self, material): + """ + Finds the most similar compositions in the corpus. + + :param material: string; a chemical composition + :return: list; the most similar compositions + """ + method = "GET" + sub_url = '/materials/similar/{}'.format(material) + return self._make_request(sub_url, method=method) + + def perform_ner(self, docs, return_type="concatenated"): + """ + Performs Named Entity Recognition. + + :param docs: list; a list of documents; each document is represented as a single string + :param return_type: string; output format, can be "iob", "concatenated", or "normalized" + :return: list; tagged documents + """ + + method = "POST" + sub_url = "/ner" + payload = { + "docs": docs, + "return_type": return_type + } + return self._make_request(sub_url, payload=payload, method=method) + + def search_materials_by_entities(self, entities, elements, cutoff=None): + """ + Finds materials that co-occur with specified entities. The returned materials can be screened + by specifying elements that must be included/excluded from the stoichiometry. + + :param entities: list of strings; each string is a property or application + :param elements: list of strings; each string is a chemical element. Materials + will only be returned if they contain these elements; the opposite can also be + achieved - materials can be removed from the returned list by placing a negative + sign in from of the element, e.g., "-Ti" + :param cutoff: int or None; if int, specifies the number of materials to + return; if None, returns all materials + :return: list; a list of chemical compositions + """ + + method = "POST" + sub_url = "/search/material_search" + payload = { + "entities": entities, + "elements": elements, + "cutoff": cutoff + } + return self._make_request(sub_url, payload=payload, method=method) + + def search_documents(self, text, filters, cutoff=None): + """ + Search abstracts by text with filters for entities + :param text: string; text to search + :param filters: dict; e.g., {'material': ['GaN', '-InN']), 'application': ['LED']} + :param cutoff: int or None; if int, specifies the number of matches to + return; if None, returns all matches + :return: list; a list of chemical compositions + """ + + method = "POST" + sub_url = "/search" + filters['text'] = text + payload = { + "query": filters, + "limit": cutoff + } + + return self._make_request(sub_url, payload=payload, method=method) + class MatScholarRestError(Exception): """ @@ -220,13 +309,3 @@ class MatScholarRestError(Exception): Raised when the query has problems, e.g., bad query format. """ pass - - -if __name__ == '__main__': - query = { - 'material' : ['GaN', '-InN'], - 'application' : ['LED'] - } - query = json.dumps(query) - rest = Rester() - print(rest.get_summary(query)) diff --git a/matscholar/tests/test_rest.py b/matscholar/tests/test_rest.py index a89b119..350b43f 100644 --- a/matscholar/tests/test_rest.py +++ b/matscholar/tests/test_rest.py @@ -7,9 +7,9 @@ class EmbeddingEngineTest(unittest.TestCase): r = Rester() - def test_materials_search(self): + def test_search_materials(self): - top_thermoelectrics = self.r.materials_search("thermoelectric", top_k=10) + top_thermoelectrics = self.r.search_materials("thermoelectric", top_k=10) self.assertListEqual(top_thermoelectrics["counts"], [2452, 9, 2598, 13, 5, 9, 831, 167, 8, 390]) self.assertListEqual(top_thermoelectrics["materials"], ['Bi2Te3', 'MgAgSb', 'PbTe', 'PbSe0.5Te0.5', @@ -63,7 +63,7 @@ def test_close_words(self): negatives, top_ks, ignore_missing, - close_words, + get_close_words, scores, processed_positives, processed_negatives): @@ -174,12 +174,71 @@ class EntSearchTest(unittest.TestCase): def test_ent_search(self): - result = self.rester.search_ents(self.test_query) - self.assertEqual(len(result), 738) + result = self.rester.search_entities(self.test_query) + self.assertEqual(len(result), 1126) self.assertTrue(all(key in result[0].keys() for key in self.KEYS)) def test_summary(self): - result = self.rester.get_summary(self.test_query) - self.assertEqual(result['MAT'][0][1], 738) + result = self.rester.search_entities_summary(self.test_query) + self.assertEqual(result['MAT'][0][1], 1126) subkeys = [key for key in self.KEYS if key != 'doi'] - self.assertTrue(all(key in result for key in subkeys)) \ No newline at end of file + self.assertTrue(all(key in result for key in subkeys)) + +class SimilarMaterialsTest(unittest.TestCase): + + rester = Rester() + + def test_similar_materials(self): + material = 'LiCoO2' + result = self.rester.get_close_materials(material) + self.assertEqual(len(result), 10) + similar_mats = ['CoLi2NiO4', 'Co3Li10Ni7O20', 'CoLi4Ni3O8', 'CoLi3MnO5', 'CoLi2O4Si', + 'FeLiO2', 'CoLi3MnNiO6', 'CoLi10Ni9O20', 'CoLiMnO4', 'Fe2Li3O4P'] + self.assertEqual(result, similar_mats) + +class NERTest(unittest.TestCase): + + rester = Rester() + TEST_DOCS = ["We synthesized AO2 (A = Sr, Ba) thin films. The band gap was 2.5 eV.", + "The lattice constant of ZnO is 3.8 A. This was measured using XRD."] + + def test_iob(self): + tagged_docs = self.rester.perform_ner(self.TEST_DOCS, return_type="iob") + print(tagged_docs) + self.assertEqual(len(tagged_docs), 2) + self.assertEqual(len(tagged_docs[0]), 2) + self.assertEqual(tagged_docs[0][0][2][1], "B-MAT") + + def test_concatenated(self): + tagged_docs = self.rester.perform_ner(self.TEST_DOCS, return_type="concatenated") + self.assertEqual(len(tagged_docs), 2) + self.assertEqual(len(tagged_docs[0]), 2) + self.assertEqual(tagged_docs[0][0][2][1], "MAT") + self.assertEqual(tagged_docs[0][0][2][0], "AO2 ( A = Sr , Ba )") + self.assertFalse(any("-" in tag for token, tag in tagged_docs[0][0])) + + def test_normalized(self): + tagged_docs = self.rester.get_ner_tags(self.TEST_DOCS, return_type="normalized") + self.assertEqual(len(tagged_docs), 2) + self.assertEqual(len(tagged_docs[0]), 2) + self.assertEqual(tagged_docs[0][0][2][1], "MAT") + self.assertTrue(isinstance(tagged_docs[0][0][2][0], list)) + +class MaterialSearchEntsTest(unittest.TestCase): + + rester = Rester() + TEST_QUERY = { + "entities": ["ferroelectric"], + "elements": ["O", "-Pb"], + "cutoff": None + } + + def test_search_materials(self): + result = self.rester.search_materials_by_entities(**self.TEST_QUERY) + self.assertEqual(result[0][0], "BaO3Ti") + self.assertTrue(not any("Pb" in mat for mat, _, _ in result)) + self.assertTrue(all("O" in mat for mat, _, _ in result)) + + + + diff --git a/requirements.txt b/requirements.txt index 878569c..eead8be 100644 --- a/requirements.txt +++ b/requirements.txt @@ -2,9 +2,10 @@ pyyaml requests pytest numpy -unidecode +unidecode==1.0.23 regex monty pymatgen gensim chemdataextractor +pandas