paperscraper.pubmed
paperscraper.pubmed
¶
dump_papers(papers: pd.DataFrame, filepath: str) -> None
¶
Receives a pd.DataFrame, one paper per row and dumps it into a .jsonl file with one paper per line.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
papers
|
DataFrame
|
A dataframe of paper metadata, one paper per row. |
required |
filepath
|
str
|
Path to dump the papers, has to end with |
required |
Source code in paperscraper/utils.py
get_emails(paper: PubMedArticle) -> List
¶
Extracts author email addresses from PubMedArticle.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper
|
PubMedArticle
|
An object of type PubMedArticle. Requires to have an 'author' field. |
required |
Returns:
Name | Type | Description |
---|---|---|
List |
List
|
A possibly empty list of emails associated to authors of the paper. |
Source code in paperscraper/pubmed/utils.py
get_query_from_keywords_and_date(keywords: List[Union[str, List]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the pubmed API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
If start_date and end_date are left as default, the function is
identical to get_query_from_keywords.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to pubmed API. |
Source code in paperscraper/pubmed/utils.py
get_pubmed_papers(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 9998, *args, **kwargs) -> pd.DataFrame
¶
Performs PubMed API request of a query and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Query to PubMed API. Needs to match PubMed API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. NOTE: If 'emails' is passed, an attempt is made to extract author mail addresses. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results retrieved from DB. Defaults to 9998, higher values likely raise problems due to PubMedAPI, see: https://stackoverflow.com/questions/75353091/biopython-entrez-article-limit |
9998
|
args
|
additional arguments for pubmed.query |
()
|
|
kwargs
|
additional arguments for pubmed.query |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame. One paper per row. |
Source code in paperscraper/pubmed/pubmed.py
get_and_dump_pubmed_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', *args, **kwargs) -> None
¶
Combines get_pubmed_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords to request pubmed API. The outer list level will be considered as AND separated keys. The inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. NOTE: If 'emails' is passed, an attempt is made to extract author mail addresses. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Source code in paperscraper/pubmed/pubmed.py
pubmed
¶
get_pubmed_papers(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 9998, *args, **kwargs) -> pd.DataFrame
¶
Performs PubMed API request of a query and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Query to PubMed API. Needs to match PubMed API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. NOTE: If 'emails' is passed, an attempt is made to extract author mail addresses. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results retrieved from DB. Defaults to 9998, higher values likely raise problems due to PubMedAPI, see: https://stackoverflow.com/questions/75353091/biopython-entrez-article-limit |
9998
|
args
|
additional arguments for pubmed.query |
()
|
|
kwargs
|
additional arguments for pubmed.query |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame. One paper per row. |
Source code in paperscraper/pubmed/pubmed.py
get_and_dump_pubmed_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', *args, **kwargs) -> None
¶
Combines get_pubmed_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords to request pubmed API. The outer list level will be considered as AND separated keys. The inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. NOTE: If 'emails' is passed, an attempt is made to extract author mail addresses. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Source code in paperscraper/pubmed/pubmed.py
utils
¶
get_query_from_keywords(keywords: List[Union[str, List]]) -> str
¶
Receives a list of keywords and returns the query for the pubmed API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to pubmed API. |
Source code in paperscraper/pubmed/utils.py
get_query_from_keywords_and_date(keywords: List[Union[str, List]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the pubmed API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
If start_date and end_date are left as default, the function is
identical to get_query_from_keywords.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to pubmed API. |
Source code in paperscraper/pubmed/utils.py
get_emails(paper: PubMedArticle) -> List
¶
Extracts author email addresses from PubMedArticle.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper
|
PubMedArticle
|
An object of type PubMedArticle. Requires to have an 'author' field. |
required |
Returns:
Name | Type | Description |
---|---|---|
List |
List
|
A possibly empty list of emails associated to authors of the paper. |