paperscraper.get_dumps
paperscraper.get_dumps
¶
arxiv
¶
Dump arxiv data in JSONL format.
arxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path)
¶
Fetches papers from arXiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, fetches papers from the earliest possible date to the current date. The fetched papers are stored in JSONL format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
str
|
Start date in format YYYY-MM-DD. Defaults to None. |
None
|
end_date
|
str
|
End date in format YYYY-MM-DD. Defaults to None. |
None
|
save_path
|
str
|
Path to save the JSONL dump. Defaults to save_path. |
save_path
|
Source code in paperscraper/get_dumps/arxiv.py
biorxiv
¶
Dump bioRxiv data in JSONL format.
biorxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path, max_retries: int = 10)
¶
Fetches papers from biorxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, papers will be fetched from biorxiv from the launch date of biorxiv until the current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
max_retries
|
int
|
Number of retries when API shows connection issues. Defaults to 10. |
10
|
Source code in paperscraper/get_dumps/biorxiv.py
chemrxiv
¶
Dump chemRxiv data in JSONL format.
chemrxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path) -> None
¶
Fetches papers from bichemrxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, papers will be fetched from chemrxiv from the launch date of chemrxiv until the current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
Source code in paperscraper/get_dumps/chemrxiv.py
medrxiv
¶
Dump medrxiv data in JSONL format.
medrxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path, max_retries: int = 10)
¶
Fetches papers from medrxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, then papers will be fetched from medrxiv starting from the launch date of medrxiv until current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
max_retries
|
int
|
Number of retries when API shows connection issues. Defaults to 10. |
10
|
Source code in paperscraper/get_dumps/medrxiv.py
utils
¶
chemrxiv
¶
get_author(author_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract author list
Parameters:
Name | Type | Description | Default |
---|---|---|---|
author_list
|
list
|
List of dicts, one per author. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
;-concatenated author list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_categories(category_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract the categories of the paper
Parameters:
Name | Type | Description | Default |
---|---|---|---|
category_list
|
list
|
List of dicts, one per category. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
;-concatenated category list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_date(datestring: str) -> str
¶
Get the date of a chemrxiv dump enry.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datestring
|
str
|
String in the format: 2021-10-15T05:12:32.356Z |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Date in the format: YYYY-MM-DD. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_metrics(metrics_list: List[Dict]) -> Dict
¶
Parse ChemRxiv dump entry to extract the access metrics of the paper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metrics_list
|
List[Dict]
|
A list of single-keyed, dictionaries each containing key and value for exactly one metric. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dict |
Dict
|
A flattened dictionary with all metrics and a timestamp |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump(source_path: str, target_path: str) -> None
¶
Parses the dump as generated by the chemrXiv API and this repo: https://github.com/cthoyt/chemrxiv-summarize into a format that is equal to that of biorXiv and medRxiv.
NOTE: This is a lazy parser trying to store all data in memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_path
|
str
|
Path to the source dump |
required |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
chemrxiv_api
¶
ChemrxivAPI
¶
Handle OpenEngage API requests, using access. Adapted from https://github.com/fxcoudert/tools/blob/master/chemRxiv/chemRxiv.py.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
|
__init__(start_date: Optional[str] = None, end_date: Optional[str] = None, page_size: Optional[int] = None, max_retries: int = 10)
¶Initialize API class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
Optional[str]
|
begin date expressed as YYYY-MM-DD. Defaults to None. |
None
|
end_date
|
Optional[str]
|
end date expressed as YYYY-MM-DD. Defaults to None. |
None
|
page_size
|
int
|
The batch size used to fetch the records from chemrxiv. |
None
|
max_retries
|
int
|
Number of retries in case of error |
10
|
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
request(url, method, params=None)
¶Send an API request to open Engage.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
query(query, method='get', params=None)
¶
query_generator(query, method: str = 'get', params: Dict = {})
¶Query for a list of items, with paging. Returns a generator.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
all_preprints()
¶
preprint(article_id)
¶Information on a given preprint. .. seealso:: https://docs.figshare.com/#public_article
utils
¶
Misc utils to download chemRxiv dump
get_author(author_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract author list
Parameters:
Name | Type | Description | Default |
---|---|---|---|
author_list
|
list
|
List of dicts, one per author. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
;-concatenated author list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_categories(category_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract the categories of the paper
Parameters:
Name | Type | Description | Default |
---|---|---|---|
category_list
|
list
|
List of dicts, one per category. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
;-concatenated category list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_date(datestring: str) -> str
¶
Get the date of a chemrxiv dump enry.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
datestring
|
str
|
String in the format: 2021-10-15T05:12:32.356Z |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Date in the format: YYYY-MM-DD. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_metrics(metrics_list: List[Dict]) -> Dict
¶
Parse ChemRxiv dump entry to extract the access metrics of the paper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
metrics_list
|
List[Dict]
|
A list of single-keyed, dictionaries each containing key and value for exactly one metric. |
required |
Returns:
Name | Type | Description |
---|---|---|
Dict |
Dict
|
A flattened dictionary with all metrics and a timestamp |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump(source_path: str, target_path: str) -> None
¶
Parses the dump as generated by the chemrXiv API and this repo: https://github.com/cthoyt/chemrxiv-summarize into a format that is equal to that of biorXiv and medRxiv.
NOTE: This is a lazy parser trying to store all data in memory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source_path
|
str
|
Path to the source dump |
required |