paperscraper.get_dumps
paperscraper.get_dumps
¶
arxiv
¶
Dump arxiv data in JSONL format.
arxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path)
¶
Fetches papers from arXiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, fetches papers from the earliest possible date to the current date. The fetched papers are stored in JSONL format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
Start date in format YYYY-MM-DD. Defaults to None. |
None
|
end_date
|
str
|
End date in format YYYY-MM-DD. Defaults to None. |
None
|
save_path
|
str
|
Path to save the JSONL dump. Defaults to save_path. |
save_path
|
Source code in paperscraper/get_dumps/arxiv.py
biorxiv
¶
Dump bioRxiv data in JSONL format.
biorxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path, max_retries: int = 10)
¶
Fetches papers from biorxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, papers will be fetched from biorxiv from the launch date of biorxiv until the current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
max_retries
|
int
|
Number of retries when API shows connection issues. Defaults to 10. |
10
|
Source code in paperscraper/get_dumps/biorxiv.py
chemrxiv
¶
Dump chemRxiv data in JSONL format.
chemrxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path) -> None
¶
Fetches papers from bichemrxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, papers will be fetched from chemrxiv from the launch date of chemrxiv until the current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
Source code in paperscraper/get_dumps/chemrxiv.py
medrxiv
¶
Dump medrxiv data in JSONL format.
medrxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path, max_retries: int = 10)
¶
Fetches papers from medrxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, then papers will be fetched from medrxiv starting from the launch date of medrxiv until current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
max_retries
|
int
|
Number of retries when API shows connection issues. Defaults to 10. |
10
|
Source code in paperscraper/get_dumps/medrxiv.py
utils
¶
chemrxiv
¶
get_author(author_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract author list
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
author_list
|
list
|
List of dicts, one per author. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
;-concatenated author list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_categories(category_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract the categories of the paper
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category_list
|
list
|
List of dicts, one per category. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
;-concatenated category list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_date(datestring: str) -> str
¶
Get the date of a chemrxiv dump enry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datestring
|
str
|
String in the format: 2021-10-15T05:12:32.356Z |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Date in the format: YYYY-MM-DD. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_metrics(metrics_list: List[Dict]) -> Dict
¶
Parse ChemRxiv dump entry to extract the access metrics of the paper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics_list
|
List[Dict]
|
A list of single-keyed, dictionaries each containing key and value for exactly one metric. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Dict |
Dict
|
A flattened dictionary with all metrics and a timestamp |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump(source_path: str, target_path: str) -> None
¶
Parses the dump as generated by the chemrXiv API and this repo: https://github.com/cthoyt/chemrxiv-summarize into a format that is equal to that of biorXiv and medRxiv.
NOTE: This is a lazy parser trying to store all data in memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_path
|
str
|
Path to the source dump |
required |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
chemrxiv_api
¶
ChemrxivAPI
¶
Handle OpenEngage API requests, using access. Adapted from https://github.com/fxcoudert/tools/blob/master/chemRxiv/chemRxiv.py.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | |
__init__(start_date: Optional[str] = None, end_date: Optional[str] = None, page_size: Optional[int] = None, max_retries: int = 10)
¶Initialize API class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
Optional[str]
|
begin date expressed as YYYY-MM-DD. Defaults to None. |
None
|
end_date
|
Optional[str]
|
end date expressed as YYYY-MM-DD. Defaults to None. |
None
|
page_size
|
int
|
The batch size used to fetch the records from chemrxiv. |
None
|
max_retries
|
int
|
Number of retries in case of error |
10
|
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
request(url, method, params=None, parse_json: bool = False)
¶Send an API request to open Engage.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
query(query, method='get', params=None)
¶ query_generator(query, method: str = 'get', params: Optional[Dict] = None)
¶Query for a list of items, with paging. Returns a generator.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
all_preprints()
¶ preprint(article_id)
¶Information on a given preprint. .. seealso:: https://docs.figshare.com/#public_article
utils
¶
Misc utils to download chemRxiv dump
get_author(author_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract author list
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
author_list
|
list
|
List of dicts, one per author. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
;-concatenated author list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_categories(category_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract the categories of the paper
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category_list
|
list
|
List of dicts, one per category. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
;-concatenated category list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_date(datestring: str) -> str
¶
Get the date of a chemrxiv dump enry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datestring
|
str
|
String in the format: 2021-10-15T05:12:32.356Z |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Date in the format: YYYY-MM-DD. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_metrics(metrics_list: List[Dict]) -> Dict
¶
Parse ChemRxiv dump entry to extract the access metrics of the paper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics_list
|
List[Dict]
|
A list of single-keyed, dictionaries each containing key and value for exactly one metric. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Dict |
Dict
|
A flattened dictionary with all metrics and a timestamp |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump(source_path: str, target_path: str) -> None
¶
Parses the dump as generated by the chemrXiv API and this repo: https://github.com/cthoyt/chemrxiv-summarize into a format that is equal to that of biorXiv and medRxiv.
NOTE: This is a lazy parser trying to store all data in memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_path
|
str
|
Path to the source dump |
required |