API Reference¶
paperscraper
¶
Initialize the module.
dump_queries(keywords: List[List[Union[str, List[str]]]], dump_root: str) -> None
¶
Performs keyword search on all available servers and dump the results.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[List[Union[str, List[str]]]]
|
List of lists of keywords Each second-level list is considered a separate query. Within each query, each item (whether str or List[str]) are considered AND separated. If an item is again a list, strs are considered synonyms (OR separated). |
required |
dump_root
|
str
|
Path to root for dumping. |
required |
Source code in paperscraper/__init__.py
arxiv
¶
XRXivQuery
¶
Query class.
Source code in paperscraper/xrxiv/xrxiv_query.py
__init__(dump_filepath: str, fields: List[str] = ['title', 'doi', 'authors', 'abstract', 'date', 'journal'])
¶
Initialize the query class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dump_filepath
|
str
|
filepath to the dump to be queried. |
required |
fields
|
List[str]
|
fields to contained in the dump per paper. Defaults to ['title', 'doi', 'authors', 'abstract', 'date', 'journal']. |
['title', 'doi', 'authors', 'abstract', 'date', 'journal']
|
Source code in paperscraper/xrxiv/xrxiv_query.py
search_keywords(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |
Source code in paperscraper/xrxiv/xrxiv_query.py
dump_papers(papers: pd.DataFrame, filepath: str) -> None
¶
Receives a pd.DataFrame, one paper per row and dumps it into a .jsonl file with one paper per line.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
papers
|
DataFrame
|
A dataframe of paper metadata, one paper per row. |
required |
filepath
|
str
|
Path to dump the papers, has to end with |
required |
Source code in paperscraper/utils.py
get_query_from_keywords(keywords: List[Union[str, List[str]]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the arxiv API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY-MM-DD, e.g. '2020-07-20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to arxiv API. |
Source code in paperscraper/arxiv/utils.py
get_arxiv_papers_local(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |
Source code in paperscraper/arxiv/arxiv.py
get_arxiv_papers_api(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 99999, client_options: Dict = {'num_retries': 10}, search_options: Dict = dict(), verbose: bool = True) -> pd.DataFrame
¶
Performs arxiv API request of a given query and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Query to arxiv API. Needs to match the arxiv API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results, defaults to 99999. |
99999
|
client_options
|
Dict
|
Optional arguments for |
{'num_retries': 10}
|
search_options
|
Dict
|
Optional arguments for |
dict()
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: One row per paper. |
Source code in paperscraper/arxiv/arxiv.py
get_and_dump_arxiv_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', backend: Literal['api', 'local', 'infer'] = 'api', *args, **kwargs)
¶
Combines get_arxiv_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords for arxiv search. The outer list level will be considered as AND separated keys, the inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
backend
|
Literal['api', 'local', 'infer']
|
If |
'api'
|
Source code in paperscraper/arxiv/arxiv.py
arxiv
¶
get_arxiv_papers_local(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |
Source code in paperscraper/arxiv/arxiv.py
get_arxiv_papers_api(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 99999, client_options: Dict = {'num_retries': 10}, search_options: Dict = dict(), verbose: bool = True) -> pd.DataFrame
¶
Performs arxiv API request of a given query and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Query to arxiv API. Needs to match the arxiv API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results, defaults to 99999. |
99999
|
client_options
|
Dict
|
Optional arguments for |
{'num_retries': 10}
|
search_options
|
Dict
|
Optional arguments for |
dict()
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: One row per paper. |
Source code in paperscraper/arxiv/arxiv.py
get_and_dump_arxiv_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', backend: Literal['api', 'local', 'infer'] = 'api', *args, **kwargs)
¶
Combines get_arxiv_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords for arxiv search. The outer list level will be considered as AND separated keys, the inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
backend
|
Literal['api', 'local', 'infer']
|
If |
'api'
|
Source code in paperscraper/arxiv/arxiv.py
utils
¶
format_date(date_str: str) -> str
¶
Converts a date in YYYY-MM-DD format to arXiv's YYYYMMDDTTTT format.
get_query_from_keywords(keywords: List[Union[str, List[str]]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the arxiv API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY-MM-DD, e.g. '2020-07-20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to arxiv API. |
Source code in paperscraper/arxiv/utils.py
async_utils
¶
optional_async(func: Callable[..., Awaitable[T]]) -> Callable[..., Union[T, Awaitable[T]]]
¶
Allows an async function to be called from sync code (blocks until done) or from within an async context (returns a coroutine to await).
Source code in paperscraper/async_utils.py
retry_with_exponential_backoff(*, max_retries: int = 5, base_delay: float = 1.0) -> Callable[[F], F]
¶
Decorator factory that retries an async def
on HTTP 429, with exponential backoff.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_retries
|
int
|
how many times to retry before giving up. |
5
|
base_delay
|
float
|
initial delay in seconds; next delays will be duplication of previous. |
1.0
|
@retry_with_exponential_backoff(max_retries=3, base_delay=0.5)
async def fetch_data(...):
...
Source code in paperscraper/async_utils.py
citations
¶
citations
¶
get_citations_by_doi(doi: str) -> int
¶
Get the number of citations of a paper according to semantic scholar.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
the DOI of the paper. |
required |
Returns:
Type | Description |
---|---|
int
|
The number of citations |
Source code in paperscraper/citations/citations.py
get_citations_from_title(title: str) -> int
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
title
|
str
|
Title of paper to be searched on Scholar. |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If sth else than str is passed. |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of citations of paper. |
Source code in paperscraper/citations/citations.py
entity
¶
core
¶
Entity
¶
An abstract entity class with a set of utilities shared by the objects that perform self-linking analyses, such as Paper and Researcher.
Source code in paperscraper/citations/entity/core.py
paper
¶
Paper
¶
Bases: Entity
Source code in paperscraper/citations/entity/paper.py
__init__(input: str, mode: ModeType = 'infer')
¶Set up a Paper object for analysis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
str
|
Paper identifier. This can be the title, DOI or semantic scholar ID of the paper. |
required |
mode
|
ModeType
|
The format in which the ID was provided. Defaults to "infer". |
'infer'
|
Raises:
Type | Description |
---|---|
ValueError
|
If unknown mode is given. |
Source code in paperscraper/citations/entity/paper.py
self_references()
¶Extracts the self references of a paper, for each author.
self_citations()
¶Extracts the self citations of a paper, for each author.
get_result() -> Optional[PaperResult]
¶Provides the result of the analysis.
Returns: PaperResult if available.
Source code in paperscraper/citations/entity/paper.py
researcher
¶
Researcher
¶
Bases: Entity
Source code in paperscraper/citations/entity/researcher.py
__init__(input: str, mode: ModeType = 'infer')
¶Construct researcher object for self citation/reference analysis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
str
|
A researcher to search for. |
required |
mode
|
ModeType
|
This can be a |
'infer'
|
Raises:
Type | Description |
---|---|
ValueError
|
Unknown mode |
Source code in paperscraper/citations/entity/researcher.py
self_references()
¶Sifts through all papers of a researcher and extracts the self references.
self_citations()
¶
orcid
¶
orcid_to_author_name(orcid_id: str) -> Optional[str]
¶
Given an ORCID ID (as a string, e.g. '0000-0002-1825-0097'), returns the full name of the author from the ORCID public API.
Source code in paperscraper/citations/orcid.py
self_citations
¶
self_citations_paper(inputs: Union[str, List[str]], verbose: bool = False) -> Union[CitationResult, List[CitationResult]]
async
¶
Analyze self-citations for one or more papers by DOI or Semantic Scholar ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
Union[str, List[str]]
|
A single DOI/SSID string or a list of them. |
required |
verbose
|
bool
|
If True, logs detailed information for each paper. |
False
|
Returns:
Type | Description |
---|---|
Union[CitationResult, List[CitationResult]]
|
A single CitationResult if a string was passed, else a list of CitationResults. |
Source code in paperscraper/citations/self_citations.py
self_references
¶
self_references_paper(inputs: Union[str, List[str]], verbose: bool = False) -> Union[ReferenceResult, List[ReferenceResult]]
async
¶
Analyze self-references for one or more papers by DOI or Semantic Scholar ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
inputs
|
Union[str, List[str]]
|
A single DOI/SSID string or a list of them. |
required |
verbose
|
bool
|
If True, logs detailed information for each paper. |
False
|
Returns:
Type | Description |
---|---|
Union[ReferenceResult, List[ReferenceResult]]
|
A single ReferenceResult if a string was passed, else a list of ReferenceResults. |
Raises:
Type | Description |
---|---|
ValueError
|
If no references are found for a given identifier. |
Source code in paperscraper/citations/self_references.py
tests
¶
test_self_references
¶
TestSelfReferences
¶
Source code in paperscraper/citations/tests/test_self_references.py
test_compare_async_and_sync_performance(dois)
¶Compares the execution time of asynchronous and synchronous self_references
for a list of DOIs.
Source code in paperscraper/citations/tests/test_self_references.py
utils
¶
get_doi_from_title(title: str) -> Optional[str]
¶
Searches the DOI of a paper based on the paper title
Parameters:
Name | Type | Description | Default |
---|---|---|---|
title
|
str
|
Paper title |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
DOI according to semantic scholar API |
Source code in paperscraper/citations/utils.py
get_doi_from_ssid(ssid: str, max_retries: int = 10) -> Optional[str]
¶
Given a Semantic Scholar paper ID, returns the corresponding DOI if available.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ssid
|
str
|
The paper ID on Semantic Scholar. |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
str or None: The DOI of the paper, or None if not found or in case of an error. |
Source code in paperscraper/citations/utils.py
get_title_and_id_from_doi(doi: str) -> Dict[str, Any]
¶
Given a DOI, retrieves the paper's title and semantic scholar paper ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
The DOI of the paper (e.g., "10.18653/v1/N18-3011"). |
required |
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
dict or None: A dictionary with keys 'title' and 'ssid'. |
Source code in paperscraper/citations/utils.py
author_name_to_ssaid(author_name: str) -> str
¶
Given an author name, returns the Semantic Scholar author ID.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
author_name
|
str
|
The full name of the author. |
required |
Returns:
Type | Description |
---|---|
str
|
str or None: The Semantic Scholar author ID or None if no author is found. |
Source code in paperscraper/citations/utils.py
determine_paper_input_type(input: str) -> Literal['ssid', 'doi', 'title']
¶
Determines the intended input type by the user if not explicitly given (infer
).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input
|
str
|
Either a DOI or a semantic scholar paper ID or an author name. |
required |
Returns:
Type | Description |
---|---|
Literal['ssid', 'doi', 'title']
|
The input type |
Source code in paperscraper/citations/utils.py
get_papers_for_author(ss_author_id: str) -> List[str]
async
¶
Given a Semantic Scholar author ID, returns a list of all Semantic Scholar paper IDs for that author.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ss_author_id
|
str
|
The Semantic Scholar author ID (e.g., "1741101"). |
required |
Returns:
Type | Description |
---|---|
List[str]
|
A list of paper IDs (as strings) authored by the given author. |
Source code in paperscraper/citations/utils.py
find_matching(first: List[Dict[str, str]], second: List[Dict[str, str]]) -> List[str]
¶
Ingests two sets of authors and returns a list of those that match (either based on name or on author ID).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
first
|
List[Dict[str, str]]
|
First set of authors given as list of dict with two keys ( |
required |
second
|
List[Dict[str, str]]
|
Second set of authors given as list of dict with two same keys. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List of names of authors in first list where a match was found. |
Source code in paperscraper/citations/utils.py
check_overlap(n1: str, n2: str) -> bool
¶
Check whether two author names are identical. TODO: This can be made more robust
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n1
|
str
|
first name |
required |
n2
|
str
|
second name |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
Whether names are identical. |
Source code in paperscraper/citations/utils.py
clean_name(s: str) -> str
¶
Clean up a str by removing special characters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s
|
str
|
Input possibly containing special symbols |
required |
Returns:
Type | Description |
---|---|
str
|
Homogenized string. |
Source code in paperscraper/citations/utils.py
get_dumps
¶
arxiv
¶
Dump arxiv data in JSONL format.
arxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path)
¶
Fetches papers from arXiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, fetches papers from the earliest possible date to the current date. The fetched papers are stored in JSONL format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
str
|
Start date in format YYYY-MM-DD. Defaults to None. |
None
|
end_date
|
str
|
End date in format YYYY-MM-DD. Defaults to None. |
None
|
save_path
|
str
|
Path to save the JSONL dump. Defaults to save_path. |
save_path
|
Source code in paperscraper/get_dumps/arxiv.py
biorxiv
¶
Dump bioRxiv data in JSONL format.
biorxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path, max_retries: int = 10)
¶
Fetches papers from biorxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, papers will be fetched from biorxiv from the launch date of biorxiv until the current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
max_retries
|
int
|
Number of retries when API shows connection issues. Defaults to 10. |
10
|
Source code in paperscraper/get_dumps/biorxiv.py
chemrxiv
¶
Dump chemRxiv data in JSONL format.
chemrxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path) -> None
¶
Fetches papers from bichemrxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, papers will be fetched from chemrxiv from the launch date of chemrxiv until the current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
Source code in paperscraper/get_dumps/chemrxiv.py
medrxiv
¶
Dump medrxiv data in JSONL format.
medrxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path, max_retries: int = 10)
¶
Fetches papers from medrxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, then papers will be fetched from medrxiv starting from the launch date of medrxiv until current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
max_retries
|
int
|
Number of retries when API shows connection issues. Defaults to 10. |
10
|
Source code in paperscraper/get_dumps/medrxiv.py
utils
¶
chemrxiv
¶
get_author(author_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract author list
Parameters:
Name | Type | Description | Default |
---|---|---|---|
author_list
|
list
|
List of dicts, one per author. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
;-concatenated author list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_categories(category_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract the categories of the paper
Parameters:
Name | Type | Description | Default |
---|---|---|---|
category_list
|
list
|
List of dicts, one per category. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
;-concatenated category list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_date(datestring: str) -> str
¶
Get the date of a chemrxiv dump enry.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datestring
|
str
|
String in the format: 2021-10-15T05:12:32.356Z |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Date in the format: YYYY-MM-DD. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_metrics(metrics_list: List[Dict]) -> Dict
¶
Parse ChemRxiv dump entry to extract the access metrics of the paper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metrics_list
|
List[Dict]
|
A list of single-keyed, dictionaries each containing key and value for exactly one metric. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dict |
Dict
|
A flattened dictionary with all metrics and a timestamp |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump(source_path: str, target_path: str) -> None
¶
Parses the dump as generated by the chemrXiv API and this repo: https://github.com/cthoyt/chemrxiv-summarize into a format that is equal to that of biorXiv and medRxiv.
NOTE: This is a lazy parser trying to store all data in memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_path
|
str
|
Path to the source dump |
required |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
chemrxiv_api
¶
ChemrxivAPI
¶Handle OpenEngage API requests, using access. Adapted from https://github.com/fxcoudert/tools/blob/master/chemRxiv/chemRxiv.py.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
|
__init__(start_date: Optional[str] = None, end_date: Optional[str] = None, page_size: Optional[int] = None, max_retries: int = 10)
¶Initialize API class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
Optional[str]
|
begin date expressed as YYYY-MM-DD. Defaults to None. |
None
|
end_date
|
Optional[str]
|
end date expressed as YYYY-MM-DD. Defaults to None. |
None
|
page_size
|
int
|
The batch size used to fetch the records from chemrxiv. |
None
|
max_retries
|
int
|
Number of retries in case of error |
10
|
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
request(url, method, params=None)
¶Send an API request to open Engage.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
query(query, method='get', params=None)
¶
query_generator(query, method: str = 'get', params: Dict = {})
¶Query for a list of items, with paging. Returns a generator.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
all_preprints()
¶
preprint(article_id)
¶Information on a given preprint. .. seealso:: https://docs.figshare.com/#public_article
utils
¶
Misc utils to download chemRxiv dump
get_author(author_list: List[Dict]) -> str
¶Parse ChemRxiv dump entry to extract author list
Parameters:
Name | Type | Description | Default |
---|---|---|---|
author_list
|
list
|
List of dicts, one per author. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
;-concatenated author list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_categories(category_list: List[Dict]) -> str
¶Parse ChemRxiv dump entry to extract the categories of the paper
Parameters:
Name | Type | Description | Default |
---|---|---|---|
category_list
|
list
|
List of dicts, one per category. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
;-concatenated category list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_date(datestring: str) -> str
¶Get the date of a chemrxiv dump enry.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datestring
|
str
|
String in the format: 2021-10-15T05:12:32.356Z |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Date in the format: YYYY-MM-DD. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_metrics(metrics_list: List[Dict]) -> Dict
¶Parse ChemRxiv dump entry to extract the access metrics of the paper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metrics_list
|
List[Dict]
|
A list of single-keyed, dictionaries each containing key and value for exactly one metric. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dict |
Dict
|
A flattened dictionary with all metrics and a timestamp |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump(source_path: str, target_path: str) -> None
¶Parses the dump as generated by the chemrXiv API and this repo: https://github.com/cthoyt/chemrxiv-summarize into a format that is equal to that of biorXiv and medRxiv.
NOTE: This is a lazy parser trying to store all data in memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_path
|
str
|
Path to the source dump |
required |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
impact
¶
Impactor
¶
Source code in paperscraper/impact.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
__init__()
¶
Initialize the Impactor class with an instance of the Factor class. This allows access to the database of journal impact factors.
Source code in paperscraper/impact.py
search(query: str, threshold: int = 100, sort_by: Optional[str] = None, min_impact: float = 0.0, max_impact: float = float('inf'), return_all: bool = False) -> List[Dict[str, Any]]
¶
Search for journals matching the given query with an optional fuzziness level and sorting.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The journal name or abbreviation to search for. |
required |
threshold
|
int
|
The threshold for fuzzy matching. If set to 100, exact matching is performed. If set below 100, fuzzy matching is used. Defaults to 100. |
100
|
sort_by
|
Optional[str]
|
Criterion for sorting results, one of 'impact', 'journal' and 'score'. |
None
|
min_impact
|
float
|
Minimum impact factor for journals to be considered, defaults to 0. |
0.0
|
max_impact
|
float
|
Maximum impact factor for journals to be considered, defaults to infinity. |
float('inf')
|
return_all
|
bool
|
If True, returns all columns of the DataFrame for each match. |
False
|
Returns:
Type | Description |
---|---|
List[Dict[str, Any]]
|
List[dict]: A list of dictionaries containing the journal information. |
Source code in paperscraper/impact.py
pdf
¶
fallbacks
¶
Functionalities to scrape PDF files of publications.
fallback_wiley_api(paper_metadata: Dict[str, Any], output_path: Path, api_keys: Dict[str, str], max_attempts: int = 2) -> bool
¶
Attempt to download the PDF via the Wiley TDM API (popular publisher which blocks standard scraping attempts; API access free for academic users).
This function uses the WILEY_TDM_API_TOKEN environment variable to authenticate with the Wiley TDM API and attempts to download the PDF for the given paper. See https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining for a description on how to get your WILEY_TDM_API_TOKEN.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper_metadata
|
dict
|
Dictionary containing paper metadata. Must include the 'doi' key. |
required |
output_path
|
Path
|
A pathlib.Path object representing the path where the PDF will be saved. |
required |
api_keys
|
dict
|
Preloaded API keys. |
required |
max_attempts
|
int
|
The maximum number of attempts to retry API call. |
2
|
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the PDF file was successfully downloaded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
fallback_bioc_pmc(doi: str, output_path: Path) -> bool
¶
Attempt to download the XML via the BioC-PMC fallback.
This function first converts a given DOI to a PMCID using the NCBI ID Converter API. If a PMCID is found, it constructs the corresponding PMC XML URL and attempts to download the full-text XML.
PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
The DOI of the paper to retrieve. |
required |
output_path
|
Path
|
A pathlib.Path object representing the path where the XML file will be saved. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the XML file was successfully downloaded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
fallback_elsevier_api(paper_metadata: Dict[str, Any], output_path: Path, api_keys: Dict[str, str]) -> bool
¶
Attempt to download the full text via the Elsevier TDM API. For more information, see: https://www.elsevier.com/about/policies-and-standards/text-and-data-mining (Requires an institutional subscription and an API key provided in the api_keys dictionary under the key "ELSEVIER_TDM_API_KEY".)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper_metadata
|
Dict[str, Any]
|
Dictionary containing paper metadata. Must include the 'doi' key. |
required |
output_path
|
Path
|
A pathlib.Path object representing the path where the XML file will be saved. |
required |
api_keys
|
Dict[str, str]
|
A dictionary containing API keys. Must include the key "ELSEVIER_TDM_API_KEY". |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the XML file was successfully downloaded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
fallback_elife_xml(doi: str, output_path: Path) -> bool
¶
Attempt to download the XML via the eLife XML repository on GitHub.
eLife provides open access to their XML files on GitHub, which can be used as a fallback. When multiple versions exist (revised papers), it takes the latest version (e.g., v3 instead of v1).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
The DOI of the eLife paper to download. |
required |
output_path
|
Path
|
A pathlib.Path object representing the path where the XML file will be saved. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the XML file was successfully downloaded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
get_elife_xml_index() -> dict
¶
Fetch the eLife XML index from GitHub and return it as a dictionary.
This function retrieves and caches the list of available eLife articles in XML format from the eLife GitHub repository. It ensures that the latest version of each article is accessible for downloading. The index is cached in memory to avoid repeated network requests when processing multiple eLife papers.
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary where keys are article numbers (as strings) and values are lists of tuples (version, download_url). Each list is sorted by version number. |
Source code in paperscraper/pdf/fallbacks.py
month_folder(doi: str) -> str
¶
Query bioRxiv API to get the posting date of a given DOI. Convert a date to the BioRxiv S3 folder name, rolling over if it's the month's last day. E.g., if date is the last day of April, treat as May_YYYY.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
The DOI for which to retrieve the date. |
required |
Returns:
Type | Description |
---|---|
str
|
Month and year in format |
Source code in paperscraper/pdf/fallbacks.py
list_meca_keys(s3_client: BaseClient, bucket: str, prefix: str) -> list
¶
List all .meca object keys under a given prefix in a requester-pays bucket.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s3_client
|
BaseClient
|
S3 client to get the data from. |
required |
bucket
|
str
|
bucket to get data from. |
required |
prefix
|
str
|
prefix to get data from. |
required |
Returns:
Type | Description |
---|---|
list
|
List of keys, one per existing .meca in the bucket. |
Source code in paperscraper/pdf/fallbacks.py
find_meca_for_doi(s3_client: BaseClient, bucket: str, key: str, doi_token: str) -> bool
¶
Efficiently inspect manifest.xml within a .meca zip by fetching only necessary bytes. Parse via ZipFile to read manifest.xml and match DOI token.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s3_client
|
BaseClient
|
S3 client to get the data from. |
required |
bucket
|
str
|
bucket to get data from. |
required |
key
|
str
|
prefix to get data from. |
required |
doi_token
|
str
|
the DOI that should be matched |
required |
Returns:
Type | Description |
---|---|
bool
|
Whether or not the DOI could be matched |
Source code in paperscraper/pdf/fallbacks.py
fallback_s3(doi: str, output_path: Union[str, Path], api_keys: dict, workers: int = 32) -> bool
¶
Download a BioRxiv PDF via the requester-pays S3 bucket using range requests.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
The DOI for which to retrieve the PDF (e.g. '10.1101/798496'). |
required |
output_path
|
Union[str, Path]
|
Path where the PDF will be saved (with .pdf suffix added). |
required |
api_keys
|
dict
|
Dict containing 'AWS_ACCESS_KEY_ID' and 'AWS_SECRET_ACCESS_KEY'. |
required |
Returns:
Type | Description |
---|---|
bool
|
True if download succeeded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 |
|
pdf
¶
Functionalities to scrape PDF files of publications.
save_pdf(paper_metadata: Dict[str, Any], filepath: Union[str, Path], save_metadata: bool = False, api_keys: Optional[Union[str, Dict[str, str]]] = None) -> None
¶
Save a PDF file of a paper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper_metadata
|
Dict[str, Any]
|
A dictionary with the paper metadata. Must contain the |
required |
filepath
|
Union[str, Path]
|
Path to the PDF file to be saved (with or without suffix). |
required |
save_metadata
|
bool
|
A boolean indicating whether to save paper metadata as a separate json. |
False
|
api_keys
|
Optional[Union[str, Dict[str, str]]]
|
Either a dictionary containing API keys (if already loaded) or a string (path to API keys file).
If None, will try to load from |
None
|
Source code in paperscraper/pdf/pdf.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
|
save_pdf_from_dump(dump_path: str, pdf_path: str, key_to_save: str = 'doi', save_metadata: bool = False, api_keys: Optional[str] = None) -> None
¶
Receives a path to a .jsonl
dump with paper metadata and saves the PDF files of
each paper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dump_path
|
str
|
Path to a |
required |
pdf_path
|
str
|
Path to a folder where the files will be stored. |
required |
key_to_save
|
str
|
Key in the paper metadata to use as filename.
Has to be |
'doi'
|
save_metadata
|
bool
|
A boolean indicating whether to save paper metadata as a separate json. |
False
|
api_keys
|
Optional[str]
|
Path to a file with API keys. If None, API-based fallbacks will be skipped. |
None
|
Source code in paperscraper/pdf/pdf.py
utils
¶
load_api_keys(filepath: Optional[str] = None) -> Dict[str, str]
¶
Reads API keys from a file and returns them as a dictionary. The file should have each API key on a separate line in the format: KEY_NAME=API_KEY_VALUE
Example
WILEY_TDM_API_TOKEN=your_wiley_token_here ELSEVIER_TDM_API_KEY=your_elsevier_key_here
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
Optional[str]
|
Optional path to the file containing API keys. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, str]
|
Dict[str, str]: A dictionary where keys are API key names and values are their respective API keys. |
Source code in paperscraper/pdf/utils.py
plotting
¶
plot_comparison(data_dict: dict, keys: List[str], x_ticks: List[str] = ['2015', '2016', '2017', '2018', '2019', '2020'], show_preprint: bool = False, title_text: str = '', keyword_text: Optional[List[str]] = None, figpath: str = 'comparison_plot.pdf') -> None
¶
Plot temporal evolution of number of papers per keyword
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_dict
|
dict
|
A dictionary with keywords as keys. Each value should be a dictionary itself, with keys for the different APIs. For example data_dict = { 'covid_19.jsonl': { 'pubmed': [0, 0, 0, 12345], 'arxiv': [0, 0, 0, 1234], ... } 'coronavirus.jsonl': 'pubmed': [234, 345, 456, 12345], 'arxiv': [123, 234, 345, 1234], ... } } |
required |
keys
|
List[str]
|
List of keys which should be plotted. This has to be a subset of data_dict.keys(). |
required |
x_ticks
|
List[str]
|
List of strings to be used for the x-ticks. Should have same length as data_dict[key][database]. Defaults to ['2015', '2016', '2017', '2018', '2019', '2020'], meaning that papers are aggregated per year. |
['2015', '2016', '2017', '2018', '2019', '2020']
|
show_preprint
|
bool
|
Whether preprint servers are aggregated or not. Defaults to False. |
False
|
title_text
|
str
|
Title for the produced figure. Defaults to ''. |
''
|
keyword_text
|
Optional[List[str]]
|
Figure caption per keyword. Defaults to None, i.e. empty strings will be used. |
None
|
figpath
|
str
|
Name under which figure is saved. Relative or absolute paths can be given. Defaults to 'comparison_plot.pdf'. |
'comparison_plot.pdf'
|
Raises:
Type | Description |
---|---|
KeyError
|
If a database is missing in data_dict. |
Source code in paperscraper/plotting.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
|
plot_single(data_dict: dict, keys: str, x_ticks: List[str] = ['2015', '2016', '2017', '2018', '2019', '2020'], show_preprint: bool = False, title_text: str = '', figpath: str = 'comparison_plot.pdf', logscale: bool = False) -> None
¶
Plot temporal evolution of number of papers per keyword
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_dict
|
dict
|
A dictionary with keywords as keys. Each value should be a dictionary itself, with keys for the different APIs. For example data_dict = { 'covid_19.jsonl': { 'pubmed': [0, 0, 0, 12345], 'arxiv': [0, 0, 0, 1234], ... } 'coronavirus.jsonl': 'pubmed': [234, 345, 456, 12345], 'arxiv': [123, 234, 345, 1234], ... } } |
required |
keys
|
str
|
A key which should be plotted. This has to be a subset of data_dict.keys(). |
required |
x_ticks
|
List[str]
|
List of strings to be used for the x-ticks. Should have same length as data_dict[key][database]. Defaults to ['2015', '2016', '2017', '2018', '2019', '2020'], meaning that papers are aggregated per year. |
['2015', '2016', '2017', '2018', '2019', '2020']
|
show_preprint
|
bool
|
Whether preprint servers are aggregated or not. Defaults to False. |
False
|
title_text
|
str
|
Title for the produced figure. Defaults to ''. |
''
|
figpath
|
str
|
Name under which figure is saved. Relative or absolute paths can be given. Defaults to 'comparison_plot.pdf'. |
'comparison_plot.pdf'
|
logscale
|
bool
|
Whether y-axis is plotted on logscale. Defaults to False. |
False
|
Raises:
Type | Description |
---|---|
KeyError
|
If a database is missing in data_dict. |
Source code in paperscraper/plotting.py
176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 |
|
plot_venn_two(sizes: List[int], labels: List[str], figpath: str = 'venn_two.pdf', title: str = '', **kwargs) -> None
¶
Plot a single Venn Diagram with two terms.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sizes
|
List[int]
|
List of ints of length 3. First two elements correspond to the labels, third one to the intersection. |
required |
labels
|
[type]
|
List of str of length 2, containing names of circles. |
required |
figpath
|
str
|
Name under which figure is saved. Defaults to 'venn_two.pdf', i.e. it is inferred from labels. |
'venn_two.pdf'
|
title
|
str
|
Title of the plot. Defaults to '', i.e. it is inferred from labels. |
''
|
**kwargs
|
Additional keyword arguments for venn2. |
{}
|
Source code in paperscraper/plotting.py
plot_venn_three(sizes: List[int], labels: List[str], figpath: str = '', title: str = '', **kwargs) -> None
¶
Plot a single Venn Diagram with two terms.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sizes
|
List[int]
|
List of ints of length 3. First two elements correspond to the labels, third one to the intersection. |
required |
labels
|
List[str]
|
List of str of length 2, containing names of circles. |
required |
figpath
|
str
|
Name under which figure is saved. Defaults to '', i.e. it is inferred from labels. |
''
|
title
|
str
|
Title of the plot. Defaults to '', i.e. it is inferred from labels. |
''
|
**kwargs
|
Additional keyword arguments for venn3. |
{}
|
Source code in paperscraper/plotting.py
plot_multiple_venn(sizes: List[List[int]], labels: List[List[str]], figname: str, titles: List[str], suptitle: str = '', gridspec_kw: dict = {}, figsize: Iterable = (8, 4.5), **kwargs) -> None
¶
Plots multiple Venn Diagrams next to each other
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sizes
|
List[List[int]]
|
List of lists with sizes, one per Venn Diagram. Lengths of lists should be either 3 (plot_venn_two) or 7 (plot_venn_two). |
required |
labels
|
List[List[str]]
|
List of Lists of str containing names of circles. Lengths of lists should be either 2 or 3. |
required |
figname
|
str
|
Name under which figure is saved. Defaults to '', i.e. it is inferred from labels. |
required |
titles
|
List[str]
|
Titles of subplots. Should have same length like labels and sizes. |
required |
suptitle
|
str
|
Title of entire plot. Defaults to '', i.e. no title. |
''
|
gridspec_kw
|
dict
|
Additional keyword args for plt.subplots. Useful to adjust width of plots. E.g. gridspec_kw={'width_ratios': [1, 2]} will make the second Venn Diagram double as wide as first one. |
{}
|
**kwargs
|
Additional keyword arguments for venn3. |
{}
|
Source code in paperscraper/plotting.py
postprocessing
¶
aggregate_paper(data: List[Dict[str, str]], start_year: int = 2016, bins_per_year: int = 4, filtering: bool = False, filter_keys: List = list(), unwanted_keys: List = list(), return_filtered: bool = False, filter_abstract: bool = True, last_year: int = 2021)
¶
Consumes a list of unstructured keyword results from a .jsonl and aggregates papers into several bins per year.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
List[Dict[str, str]]
|
Content of a .jsonl file, i.e., a list of dictionaries, one per paper. |
required |
start_year
|
int
|
First year of interest. Defaults to 2016. |
2016
|
bins_per_year
|
int
|
Defaults to 4 (quarterly aggregation). |
4
|
filtering
|
bool
|
Whether or not all papers in .jsonl are perceived as matches or whether an additional sanity checking for the keywords is performed in abstract/title. Defaults to False. |
False
|
filter_keys
|
list
|
List of str used for filtering. Only applies if filtering is True. Defaults to empty list. |
list()
|
unwanted_keys
|
list
|
List of str that must not occur in either title or abstract. Only applies if filtering is True. |
list()
|
return_filtered
|
bool
|
Whether the filtered matches are also returned. Only applies if filtering is True. Defaults to False. |
False
|
filter_abstract
|
bool
|
Whether the keyword is searched in the abstract or not. Defaults to True. |
True
|
last_year
|
int
|
Most recent year for the aggregation. Defaults to current year. All newer entries are discarded. |
2021
|
Returns:
Name | Type | Description |
---|---|---|
bins |
array
|
Vector of length number of years (2020 - start_year) x bins_per_year. |
Source code in paperscraper/postprocessing.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
|
pubmed
¶
dump_papers(papers: pd.DataFrame, filepath: str) -> None
¶
Receives a pd.DataFrame, one paper per row and dumps it into a .jsonl file with one paper per line.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
papers
|
DataFrame
|
A dataframe of paper metadata, one paper per row. |
required |
filepath
|
str
|
Path to dump the papers, has to end with |
required |
Source code in paperscraper/utils.py
get_emails(paper: PubMedArticle) -> List
¶
Extracts author email addresses from PubMedArticle.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper
|
PubMedArticle
|
An object of type PubMedArticle. Requires to have an 'author' field. |
required |
Returns:
Name | Type | Description |
---|---|---|
List |
List
|
A possibly empty list of emails associated to authors of the paper. |
Source code in paperscraper/pubmed/utils.py
get_query_from_keywords_and_date(keywords: List[Union[str, List]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the pubmed API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
If start_date and end_date are left as default, the function is
identical to get_query_from_keywords.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to pubmed API. |
Source code in paperscraper/pubmed/utils.py
get_pubmed_papers(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 9998, *args, **kwargs) -> pd.DataFrame
¶
Performs PubMed API request of a query and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Query to PubMed API. Needs to match PubMed API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. NOTE: If 'emails' is passed, an attempt is made to extract author mail addresses. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results retrieved from DB. Defaults to 9998, higher values likely raise problems due to PubMedAPI, see: https://stackoverflow.com/questions/75353091/biopython-entrez-article-limit |
9998
|
args
|
additional arguments for pubmed.query |
()
|
|
kwargs
|
additional arguments for pubmed.query |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame. One paper per row. |
Source code in paperscraper/pubmed/pubmed.py
get_and_dump_pubmed_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', *args, **kwargs) -> None
¶
Combines get_pubmed_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords to request pubmed API. The outer list level will be considered as AND separated keys. The inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. NOTE: If 'emails' is passed, an attempt is made to extract author mail addresses. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Source code in paperscraper/pubmed/pubmed.py
pubmed
¶
get_pubmed_papers(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 9998, *args, **kwargs) -> pd.DataFrame
¶
Performs PubMed API request of a query and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Query to PubMed API. Needs to match PubMed API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. NOTE: If 'emails' is passed, an attempt is made to extract author mail addresses. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results retrieved from DB. Defaults to 9998, higher values likely raise problems due to PubMedAPI, see: https://stackoverflow.com/questions/75353091/biopython-entrez-article-limit |
9998
|
args
|
additional arguments for pubmed.query |
()
|
|
kwargs
|
additional arguments for pubmed.query |
{}
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame. One paper per row. |
Source code in paperscraper/pubmed/pubmed.py
get_and_dump_pubmed_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', *args, **kwargs) -> None
¶
Combines get_pubmed_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords to request pubmed API. The outer list level will be considered as AND separated keys. The inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. NOTE: If 'emails' is passed, an attempt is made to extract author mail addresses. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Source code in paperscraper/pubmed/pubmed.py
utils
¶
get_query_from_keywords(keywords: List[Union[str, List]]) -> str
¶
Receives a list of keywords and returns the query for the pubmed API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to pubmed API. |
Source code in paperscraper/pubmed/utils.py
get_query_from_keywords_and_date(keywords: List[Union[str, List]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the pubmed API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
If start_date and end_date are left as default, the function is
identical to get_query_from_keywords.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to pubmed API. |
Source code in paperscraper/pubmed/utils.py
get_emails(paper: PubMedArticle) -> List
¶
Extracts author email addresses from PubMedArticle.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper
|
PubMedArticle
|
An object of type PubMedArticle. Requires to have an 'author' field. |
required |
Returns:
Name | Type | Description |
---|---|---|
List |
List
|
A possibly empty list of emails associated to authors of the paper. |
Source code in paperscraper/pubmed/utils.py
scholar
¶
get_citations_from_title(title: str) -> int
¶
Parameters:
Name | Type | Description | Default |
---|---|---|---|
title
|
str
|
Title of paper to be searched on Scholar. |
required |
Raises:
Type | Description |
---|---|
TypeError
|
If sth else than str is passed. |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of citations of paper. |
Source code in paperscraper/citations/citations.py
dump_papers(papers: pd.DataFrame, filepath: str) -> None
¶
Receives a pd.DataFrame, one paper per row and dumps it into a .jsonl file with one paper per line.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
papers
|
DataFrame
|
A dataframe of paper metadata, one paper per row. |
required |
filepath
|
str
|
Path to dump the papers, has to end with |
required |
Source code in paperscraper/utils.py
get_scholar_papers(title: str, fields: List = ['title', 'authors', 'year', 'abstract', 'journal', 'citations'], *args, **kwargs) -> pd.DataFrame
¶
Performs Google Scholar API request of a given title and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
title
|
str
|
Query to arxiv API. Needs to match the arxiv API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'year', 'abstract', 'journal', 'citations']
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame. One paper per row. |
Source code in paperscraper/scholar/scholar.py
get_and_dump_scholar_papers(title: str, output_filepath: str, fields: List = ['title', 'authors', 'year', 'abstract', 'journal', 'citations']) -> None
¶
Combines get_scholar_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
title
|
str
|
Paper to search for on Google Scholar. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'year', 'abstract', 'journal', 'citations']
|
Source code in paperscraper/scholar/scholar.py
scholar
¶
get_scholar_papers(title: str, fields: List = ['title', 'authors', 'year', 'abstract', 'journal', 'citations'], *args, **kwargs) -> pd.DataFrame
¶
Performs Google Scholar API request of a given title and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
title
|
str
|
Query to arxiv API. Needs to match the arxiv API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'year', 'abstract', 'journal', 'citations']
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame. One paper per row. |
Source code in paperscraper/scholar/scholar.py
get_and_dump_scholar_papers(title: str, output_filepath: str, fields: List = ['title', 'authors', 'year', 'abstract', 'journal', 'citations']) -> None
¶
Combines get_scholar_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
title
|
str
|
Paper to search for on Google Scholar. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'year', 'abstract', 'journal', 'citations']
|
Source code in paperscraper/scholar/scholar.py
server_dumps
¶
Folder for the metadata dumps from biorxiv, medrxiv and chemrxiv API.
No code here but will be populated with your local .jsonl
files.
tests
¶
test_pdf
¶
TestPDF
¶
Source code in paperscraper/tests/test_pdf.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 |
|
test_api_keys_none_pmc()
¶
Test that save_pdf works properly even when no API keys are provided. Paper in PMC.
Source code in paperscraper/tests/test_pdf.py
test_api_keys_none_oa()
¶
Test that save_pdf works properly even when no API keys are provided. Paper available open-access.
Source code in paperscraper/tests/test_pdf.py
test_fallback_bioc_pmc_real_api()
¶
Test the BioC-PMC fallback with a real API call.
Source code in paperscraper/tests/test_pdf.py
test_fallback_bioc_pmc_no_pmcid()
¶
Test BioC-PMC fallback when no PMCID is available.
Source code in paperscraper/tests/test_pdf.py
test_fallback_elife_xml_real_api()
¶
Test the eLife XML fallback with a real API call.
Source code in paperscraper/tests/test_pdf.py
test_fallback_elife_nonexistent_article()
¶
Test eLife XML fallback with a DOI that looks like eLife but doesn't exist.
Source code in paperscraper/tests/test_pdf.py
test_fallback_wiley_api_mock(mock_get)
¶
Test Wiley API fallback with mocked response.
Source code in paperscraper/tests/test_pdf.py
test_fallback_wiley_api_returns_boolean()
¶
Test that fallback_wiley_api properly returns a boolean value.
Source code in paperscraper/tests/test_pdf.py
test_fallback_elsevier_api_mock(mock_get)
¶
Test Elsevier API fallback with mocked response.
Source code in paperscraper/tests/test_pdf.py
test_fallback_elsevier_api_invalid_key(caplog)
¶
Test real Elsevier API connectivity by verifying invalid key response pattern.
Source code in paperscraper/tests/test_pdf.py
utils
¶
dump_papers(papers: pd.DataFrame, filepath: str) -> None
¶
Receives a pd.DataFrame, one paper per row and dumps it into a .jsonl file with one paper per line.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
papers
|
DataFrame
|
A dataframe of paper metadata, one paper per row. |
required |
filepath
|
str
|
Path to dump the papers, has to end with |
required |
Source code in paperscraper/utils.py
get_filename_from_query(query: List[str]) -> str
¶
Convert a keyword query into filenames to dump the paper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
list
|
List of string with keywords. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Filename. |
Source code in paperscraper/utils.py
load_jsonl(filepath: str) -> List[Dict[str, str]]
¶
Load data from a .jsonl
file, i.e., a file with one dictionary per line.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
str
|
Path to |
required |
Returns:
Type | Description |
---|---|
List[Dict[str, str]]
|
List[Dict[str, str]]: A list of dictionaries, one per paper. |
Source code in paperscraper/utils.py
xrxiv
¶
bioRxiv and medRxiv utilities.
xrxiv_api
¶
API for bioRxiv and medRXiv.
XRXivApi
¶
API class.
Source code in paperscraper/xrxiv/xrxiv_api.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
__init__(server: str, launch_date: str, api_base_url: str = 'https://api.biorxiv.org', max_retries: int = 10)
¶
Initialize API class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
server
|
str
|
name of the preprint server to access. |
required |
launch_date
|
str
|
launch date expressed as YYYY-MM-DD. |
required |
api_base_url
|
str
|
Base url for the API. Defaults to 'api.biorxiv.org'. |
'https://api.biorxiv.org'
|
max_retries
|
int
|
Maximal number of retries for a request before an error is raised. Defaults to 10. |
10
|
Source code in paperscraper/xrxiv/xrxiv_api.py
get_papers(start_date: Optional[str] = None, end_date: Optional[str] = None, fields: List[str] = ['title', 'doi', 'authors', 'abstract', 'date', 'journal'], max_retries: int = 10) -> Generator
¶
Get paper metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
Optional[str]
|
begin date. Defaults to None, a.k.a. launch date. |
None
|
end_date
|
Optional[str]
|
end date. Defaults to None, a.k.a. today. |
None
|
fields
|
List[str]
|
fields to return per paper. Defaults to ['title', 'doi', 'authors', 'abstract', 'date', 'journal']. |
['title', 'doi', 'authors', 'abstract', 'date', 'journal']
|
max_retries
|
int
|
Number of retries on connection failure. Defaults to 10. |
10
|
Yields:
Name | Type | Description |
---|---|---|
Generator |
Generator
|
a generator of paper metadata (dict) with the desired fields. |
Source code in paperscraper/xrxiv/xrxiv_api.py
retry_multi()
¶
Retry a function several times
Source code in paperscraper/xrxiv/xrxiv_api.py
xrxiv_query
¶
Query dumps from bioRxiv and medRXiv.
XRXivQuery
¶
Query class.
Source code in paperscraper/xrxiv/xrxiv_query.py
__init__(dump_filepath: str, fields: List[str] = ['title', 'doi', 'authors', 'abstract', 'date', 'journal'])
¶
Initialize the query class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dump_filepath
|
str
|
filepath to the dump to be queried. |
required |
fields
|
List[str]
|
fields to contained in the dump per paper. Defaults to ['title', 'doi', 'authors', 'abstract', 'date', 'journal']. |
['title', 'doi', 'authors', 'abstract', 'date', 'journal']
|
Source code in paperscraper/xrxiv/xrxiv_query.py
search_keywords(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |