Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
3b3b273
add similar_materials to rester
lwestn Jan 4, 2019
21e8fb5
Merge pull request #17 from LeighWeston86/master
lwestn Jan 4, 2019
473a0a7
changed rest.py to add journal resource
Apr 8, 2019
76ba8c9
added get_ner_tags to Rester
lwestn Apr 8, 2019
33d2b4c
Merge pull request #18 from LeighWeston86/master
lwestn Apr 8, 2019
c70ea89
added materials_search_ents to Rester
lwestn Apr 12, 2019
5817440
Merge pull request #19 from LeighWeston86/master
lwestn Apr 12, 2019
6cf8ef1
Added search_text_with_ents API hook
amalietrewartha Apr 15, 2019
22f56f8
Merge pull request #20 from AmalieT/master
AmalieT Apr 16, 2019
9206e1a
Update requirements.txt
jdagdelen May 20, 2019
75d3c23
Merge branch 'journal_suggestion'
jdagdelen May 21, 2019
4c8bd01
adding journal suggestion
jdagdelen May 21, 2019
ae52859
Fixed text_search_with_ents method in rest.py
amalietrewartha May 22, 2019
e8b7859
Merge pull request #21 from AmalieT/master
AmalieT May 22, 2019
8e25800
add API documentation to README.md
lwestn May 22, 2019
dfc373d
Update README.md
lwestn May 22, 2019
618ae08
formatting for README.md
lwestn May 22, 2019
82286e5
endpoint
lwestn May 23, 2019
b007bfd
Merge branch 'master' of https://github.com/LeighWeston86/matscholar
lwestn May 23, 2019
937ccc0
Merge pull request #22 from LeighWeston86/master
lwestn May 23, 2019
3c5f064
Small changes to README.md
lwestn May 23, 2019
88e6a41
fix typo in README
computron Jun 1, 2019
cc518c2
fixing unidecode version error.
jdagdelen Jun 3, 2019
0e34c55
fix credits in rest.py
lwestn Jun 4, 2019
70dba7e
Merge pull request #27 from LeighWeston86/master
lwestn Jun 4, 2019
a0de448
minor renaming
cs464osu Jun 5, 2019
40771e1
minor renaming
cs464osu Jun 5, 2019
58259dd
minor renaming
cs464osu Jun 5, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 143 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,163 @@
<img src="docs/matscholar_logo.png" alt="matscholar logo" width="300px">

`matscholar` (Materials Scholar) is a Python library for materials-focused natural language processing (NLP). It is maintained by a team of researchers at UC Berkeley and Lawrence Berkeley National Laboratory as part of a project funded by the Toyota Research Institute.
`matscholar` (Materials Scholar) is a Python library for materials-focused natural language
processing (NLP). It is maintained by a team of researchers at UC Berkeley and Lawrence Berkeley
National Laboratory as part of a project funded by the Toyota Research Institute.

This library provides a Python interface for interacting with the Materials Scholar API, performing basic NLP tasks on scientific text, and example notebooks on using these tools for materials discovery and design.
This library provides a Python interface for interacting with the Materials Scholar API, performing
basic NLP tasks on scientific text, and example notebooks on using these tools for materials
discovery and design.


## Setup

We *highly* recommend using a [conda environment](https://conda.io/docs/user-guide/tasks/manage-environments.html) when working with materials scholar tools.
We *highly* recommend using a [conda environment](https://conda.io/docs/user-guide/tasks/manage-environments.html)
when working with materials scholar tools.

1. Clone or download this repo
2. Navigate to the root directory (matscholar)
3. `pip install -r requirements.txt`
4. `pip install .` [or](https://stackoverflow.com/questions/15724093/difference-between-python-setup-py-install-and-pip-install) `python setup.py install`
4. `pip install .` [or](https://stackoverflow.com/questions/15724093/difference-between-python-setup-py-install-and-pip-install)
`python setup.py install`


## Configuring Your API Key
The Materials Scholar API can only be accessed by providing an API key in `x-api-key` request header field.
To receive an API key to access the Materials Scholar API, please contact John Dagdelen at jdagdelen@lbl.gov.

Once you have an API key, you can add it as an environment variable `MATSCHOLAR_API_KEY` for ease of use.
## API Usage

For convenience, the Materials Scholar API can be accessed via a python wrapper.

### Instantiating the Rester

If an API key has already been obtained, the rester is instantiated as follows:

```python
from matscholar.rest import Rester

rester = Rester(api_key="your-api-key", endpoint="api.matscholar.com")
```

To avoid passing the API key and endpoint as arguments, set the following environment variables
for ease of use: `MATSCHOLAR_API_KEY`, `MATERIALS_SCHOLAR_ENDPOINT`.

### Resources

The methods of the Rester class can be used to access resources of the Materials Scholar API.

**Searching documents**

Our corpus of materials science abstracts can be searched based on text matching
(ElasticSearch) or by filtering based on the Named Entities extracted from each document.
Entity based searches support the following entity types: material, property, application,
descriptor, characterization, synthesis, phase.

To get the raw text of abstracts matching a given query:

```python
# text match for "solid oxide fuel cells"
example_text = "solid oxide fuel cells"

# entity filters: include documents mentioning BaZrO3 and nanoparticles;
# exclude documents mentioning thin films
example_entities = {"material": ["BaZrO3"], "descriptor": ["nanoparticle", "-thin film"]}

docs = rester.search_documents(text=example_text, filters=example_entities)
```

This will return a list of dictionaries containing the raw-text for each abstracts along with
associated metadata.

**Searching entities**

We have extracted materials-science named entities from nearly 3.5 million materials science
absracts. Details on how this was performed can be found in Ref. [1].

The extracted named entities for each document associated with a query are returned by the
search_entities method. This method takes as input a dictionary with entity types as keys and a list of entities
as values. For example, to find all of the entities that co-occur with the material
"GaN":

```python
docs = rester.search_entities(query={"material": ["GaN"]})
```

This wil return a list of dictionaries representing documents matching the query; each dict will contain
the DOI as well as each unique entity found in the corresponding abstract.

A summary of the entities associated with a query can be generated using the search_entities_summary method. To get
statistics for entities co-occuring with GaN,

```python
summary = rester.search_entities_summary(query={"material": ["GaN"]})
```
This will return a dictionary with entity types as keys; the values will be a list of the top entities
that occur in documents matching the query, each item in the list will be [entity, document count, fraction].

To perform a fast literature review, the search_materials_by_entities method may be used. For a chosen application,
this will return a list of all materials that co-occur with that application in our corpus. For example,
to see which materials co-occur with the word thermoelectric in a document,

```python
mat_list = rester.search_materials_by_entities(["thermoelectric"], elements=["-Pb"], cutoff=None)
```

The above search will find all materials co-occurring with thermoelectric that do not contain lead.
The result will be a list, with each element containing a list of [material, co-occurence counts, co-occurrence dois].

**Word embeddings**

Materials science word embeddings trained using word2vec; details on how the embeddings were trained,
and their application in materials science discovery can be found in Ref. [2].

To get the word embedding for a given word,
```python
embedding = rester.get_embedding("photovoltaics")
```

This will return a dict containing the embedding. The word embedding will be a 200-dimensional array.

The rester also has a get_close_words method (based on cosine similarity of embeddings) which can be used to
explore the semantic similarity of materials science terms; this approach can be used discover materials
for a new application (as outlined in the reference above),

To find words with a similar embedding to photovolatic:

```python
close_words = rester.get_close_words("photovoltaics", top_k=1000)
```

This will return the 1000 closest words to photovoltaics. The result will be a dictionary containing
the close words and their cosine similarity to the input word.

**Named Entity Recognition**

In addition to the pre-processed entities present in our corpus, users can performed Named Entity
Recognition on any raw materials science text. The details of the model can be found in Ref. [1].

The input should be a list of documents with the text represented as a string:

```python
doc_1 = "The bands gap of TiO2 is 3.2 eV. This was measured via photoluminescence"
doc_2 = "We deposit GaN thin films using MOCVD"
docs = [doc_1, doc_2]
tagged_docs = rester.perform_ner(docs, return_type="concatenated")
```

The arguement return_type may be set to iob, concatenated, or normalized. The latter will replace
entities with their most frequently occurring synonym. A list of tagged documents will be returned.
Each doc is a list of sentences; each sentence is a list of (word, tag) pairs.

## Citation

If you use any of the API functionality in your research, please consider citing the following papers
where relevent:

[1] Weston et al., coming soon

[2] Tshitoyan et al., Nature (accepted)


## Contributors
@jdagdelen, @vtshitoyan, @lweston
123 changes: 101 additions & 22 deletions matscholar/rest.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"""

__author__ = "John Dagdelen"
__credits__ = "Shyue Ping Ong, Shreyas Cholia, Anubhav Jain"
__credits__ = "Leigh Weston, Amalie Trewartha, Vahe Tshitoyan"
__copyright__ = "Copyright 2018, Materials Intelligence"
__version__ = "0.1"
__maintainer__ = "John Dagdelen"
Expand Down Expand Up @@ -66,6 +66,7 @@ def __exit__(self, exc_type, exc_val, exc_tb):
def _make_request(self, sub_url, payload=None, method="GET"):
response = None
url = self.preamble + sub_url
print(url)
try:
if method == "POST":
response = self.session.post(url, json=payload, verify=True)
Expand All @@ -88,7 +89,7 @@ def _make_request(self, sub_url, payload=None, method="GET"):
if hasattr(response, "content") else str(ex)
raise MatScholarRestError(msg)

def materials_search(self, positive, negative=None, ignore_missing=True, top_k=10):
def search_materials(self, positive, negative=None, ignore_missing=True, top_k=10):
"""
Given input strings or lists of positive and negative words / phrases, returns a ranked list of materials with
corresponding scores and numbers of mentions
Expand All @@ -111,7 +112,7 @@ def materials_search(self, positive, negative=None, ignore_missing=True, top_k=1

return self._make_request(sub_url, payload=payload, method=method)

def close_words(self, positive, negative=None, ignore_missing=True, top_k=10):
def get_close_words(self, positive, negative=None, ignore_missing=True, top_k=10):
"""
Given input strings or lists of positive and negative words / phrases, returns a list of most similar words /
phrases according to cosine similarity
Expand Down Expand Up @@ -187,46 +188,124 @@ def materials_map(self, highlight, limit=None, ignore_missing=True, number_to_su

return self._make_request(sub_url, payload=payload, method=method)

def search_ents(self, query):
'''
def search_entities(self, query):
"""
Get the entities in each document associated with a given query

:param query: dict; e.g., {'material': ['GaN', '-InN']), 'application': ['LED']}
:return: list of dicts; each dict represents a document and contains the extracted entities
'''
method = 'POST'
sub_url = '/ent_search'
"""

method = "POST"
sub_url = "/ent_search"
payload = query

return self._make_request(sub_url, payload=payload, method=method)

def get_summary(self, query):
def get_close_journals(self, query):
'''

:param query: string: a paragraph
:return: list: [['journal name', 'cosine similarity'], ...]
'''

method = 'POST'
sub_url = '/journal_suggestion'
payload = {'abstract': query}

return self._make_request(sub_url, payload=payload, method=method)


def search_entities_summary(self, query):
"""
Get a summary of the entities associated with a given query

:param query: dict; e.g., {'material': ['GaN', '-InN']), 'application': ['LED']}
:return: dict; a summary dict with keys for each entity type
'''
method = 'POST'
sub_url = '/ent_search/summary'
"""

method = "POST"
sub_url = "/ent_search/summary"
payload = query

return self._make_request(sub_url, payload=payload, method=method)

def get_close_materials(self, material):
"""
Finds the most similar compositions in the corpus.

:param material: string; a chemical composition
:return: list; the most similar compositions
"""
method = "GET"
sub_url = '/materials/similar/{}'.format(material)
return self._make_request(sub_url, method=method)

def perform_ner(self, docs, return_type="concatenated"):
"""
Performs Named Entity Recognition.

:param docs: list; a list of documents; each document is represented as a single string
:param return_type: string; output format, can be "iob", "concatenated", or "normalized"
:return: list; tagged documents
"""

method = "POST"
sub_url = "/ner"
payload = {
"docs": docs,
"return_type": return_type
}
return self._make_request(sub_url, payload=payload, method=method)

def search_materials_by_entities(self, entities, elements, cutoff=None):
"""
Finds materials that co-occur with specified entities. The returned materials can be screened
by specifying elements that must be included/excluded from the stoichiometry.

:param entities: list of strings; each string is a property or application
:param elements: list of strings; each string is a chemical element. Materials
will only be returned if they contain these elements; the opposite can also be
achieved - materials can be removed from the returned list by placing a negative
sign in from of the element, e.g., "-Ti"
:param cutoff: int or None; if int, specifies the number of materials to
return; if None, returns all materials
:return: list; a list of chemical compositions
"""

method = "POST"
sub_url = "/search/material_search"
payload = {
"entities": entities,
"elements": elements,
"cutoff": cutoff
}
return self._make_request(sub_url, payload=payload, method=method)

def search_documents(self, text, filters, cutoff=None):
"""
Search abstracts by text with filters for entities
:param text: string; text to search
:param filters: dict; e.g., {'material': ['GaN', '-InN']), 'application': ['LED']}
:param cutoff: int or None; if int, specifies the number of matches to
return; if None, returns all matches
:return: list; a list of chemical compositions
"""

method = "POST"
sub_url = "/search"
filters['text'] = text
payload = {
"query": filters,
"limit": cutoff
}

return self._make_request(sub_url, payload=payload, method=method)


class MatScholarRestError(Exception):
"""
Exception class for MatstractRester.
Raised when the query has problems, e.g., bad query format.
"""
pass


if __name__ == '__main__':
query = {
'material' : ['GaN', '-InN'],
'application' : ['LED']
}
query = json.dumps(query)
rest = Rester()
print(rest.get_summary(query))
Loading