paperscraper.get_dumps
paperscraper.get_dumps
¶
arxiv
¶
Dump arxiv data in JSONL format.
arxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path)
¶
Fetches papers from arXiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, fetches papers from the earliest possible date to the current date. The fetched papers are stored in JSONL format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
Start date in format YYYY-MM-DD. Defaults to None. |
None
|
end_date
|
str
|
End date in format YYYY-MM-DD. Defaults to None. |
None
|
save_path
|
str
|
Path to save the JSONL dump. Defaults to save_path. |
save_path
|
Source code in paperscraper/get_dumps/arxiv.py
biorxiv
¶
Dump bioRxiv data in JSONL format.
biorxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path, max_retries: int = 10)
¶
Fetches papers from biorxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, papers will be fetched from biorxiv from the launch date of biorxiv until the current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
max_retries
|
int
|
Number of retries when API shows connection issues. Defaults to 10. |
10
|
Source code in paperscraper/get_dumps/biorxiv.py
chemrxiv
¶
Dump chemRxiv data in JSONL format.
chemrxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = SAVE_PATH) -> None
¶
Fetches papers from bichemrxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, papers will be fetched from chemrxiv from the launch date of chemrxiv until the current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to SAVE_PATH. |
SAVE_PATH
|
Source code in paperscraper/get_dumps/chemrxiv.py
medrxiv
¶
Dump medrxiv data in JSONL format.
medrxiv(start_date: Optional[str] = None, end_date: Optional[str] = None, save_path: str = save_path, max_retries: int = 10)
¶
Fetches papers from medrxiv based on time range, i.e., start_date and end_date. If the start_date and end_date are not provided, then papers will be fetched from medrxiv starting from the launch date of medrxiv until current date. The fetched papers will be stored in jsonl format in save_path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
begin date expressed as YYYY-MM-DD. Defaults to None, i.e., earliest possible. |
None
|
end_date
|
str
|
end date expressed as YYYY-MM-DD. Defaults to None, i.e., today. |
None
|
save_path
|
str
|
Path where the dump is stored. Defaults to save_path. |
save_path
|
max_retries
|
int
|
Number of retries when API shows connection issues. Defaults to 10. |
10
|
Source code in paperscraper/get_dumps/medrxiv.py
utils
¶
chemrxiv
¶
CrossrefChemrxivAPI
¶
Fetch ChemRxiv metadata from Crossref.
This class queries Crossref's Works endpoint filtered by the ChemRxiv DOI
prefix (10.26434) and date range. Results are fetched using cursor-based
pagination.
Source code in paperscraper/get_dumps/utils/chemrxiv/crossref_api.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
__init__(start_date: str, end_date: str, page_size: int = 1000, max_retries: int = 10, mailto: Optional[str] = None, request_delay_seconds: float = 0.35)
¶
Initialize the Crossref fallback client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
Start of the posted-date range (YYYY-MM-DD). |
required |
end_date
|
str
|
End of the posted-date range (YYYY-MM-DD). |
required |
page_size
|
int
|
Number of results per page (Crossref max is 1000). |
1000
|
max_retries
|
int
|
Max retries for transient HTTP status codes. |
10
|
mailto
|
Optional[str]
|
Optional contact email to include in the request (Crossref recommends this for polite usage). |
None
|
request_delay_seconds
|
float
|
Delay between page requests. This is used to avoid hammering Crossref and also keeps long-range dumps from completing too quickly in tests that expect the dumper to be long-running. |
0.35
|
Source code in paperscraper/get_dumps/utils/chemrxiv/crossref_api.py
iter_items() -> Generator[Dict, None, None]
¶
Iterate over raw Crossref work items for the configured date range.
Yields:
| Type | Description |
|---|---|
Dict
|
A dict for each work item as returned by Crossref's Works API. |
Raises:
| Type | Description |
|---|---|
HTTPError
|
If the request fails with a non-retryable status code, or if retries are exhausted. |
Source code in paperscraper/get_dumps/utils/chemrxiv/crossref_api.py
crossref_item_to_paper(item: Dict) -> Dict
¶
Convert a Crossref work item into the ChemRxiv dump schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
Dict
|
A single work item dict from Crossref's Works API. |
required |
Returns:
| Type | Description |
|---|---|
Dict
|
A dict compatible with the JSONL dump schema used for ChemRxiv in this |
Dict
|
package. |
Source code in paperscraper/get_dumps/utils/chemrxiv/crossref_api.py
get_author(author_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract author list
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
author_list
|
list
|
List of dicts, one per author. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
;-concatenated author list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_categories(category_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract the categories of the paper
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category_list
|
list
|
List of dicts, one per category. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
;-concatenated category list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_date(datestring: str) -> str
¶
Get the date of a chemrxiv dump enry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datestring
|
str
|
String in the format: 2021-10-15T05:12:32.356Z |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Date in the format: YYYY-MM-DD. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_metrics(metrics_list: List[Dict]) -> Dict
¶
Parse ChemRxiv dump entry to extract the access metrics of the paper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics_list
|
List[Dict]
|
A list of single-keyed, dictionaries each containing key and value for exactly one metric. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Dict |
Dict
|
A flattened dictionary with all metrics and a timestamp |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump(source_path: str, target_path: str) -> None
¶
Parses the dump as generated by the chemrXiv API and this repo: https://github.com/cthoyt/chemrxiv-summarize into a format that is equal to that of biorXiv and medRxiv.
NOTE: This is a lazy parser trying to store all data in memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_path
|
str
|
Path to the source dump |
required |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
download_full_crossref(save_dir: str, api: Optional[CrossrefChemrxivAPI] = None) -> None
¶
Download ChemRxiv records via Crossref into per-item JSON payloads.
This mirrors the behavior of the OpenEngage backend by
storing one JSON payload per record in save_dir. The payloads are raw
Crossref work items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
save_dir
|
str
|
Directory where per-item payloads are stored. |
required |
api
|
Optional[CrossrefChemrxivAPI]
|
Crossref API client. If None, uses the widest possible date range. |
None
|
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump_crossref(source_path: str, target_path: str) -> None
¶
Parse Crossref payloads into the ChemRxiv JSONL dump format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_path
|
str
|
Directory containing per-item Crossref JSON payloads. |
required |
target_path
|
str
|
JSONL output path. |
required |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
chemrxiv_api
¶
ChemrxivAPI
¶
Handle OpenEngage API requests, using access. Adapted from https://github.com/fxcoudert/tools/blob/master/chemRxiv/chemRxiv.py.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | |
__init__(start_date: Optional[str] = None, end_date: Optional[str] = None, page_size: Optional[int] = None, max_retries: int = 10)
¶Initialize API class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
Optional[str]
|
begin date expressed as YYYY-MM-DD. Defaults to None. |
None
|
end_date
|
Optional[str]
|
end date expressed as YYYY-MM-DD. Defaults to None. |
None
|
page_size
|
int
|
The batch size used to fetch the records from chemrxiv. |
None
|
max_retries
|
int
|
Number of retries in case of error |
10
|
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
request(url, method, params=None, parse_json: bool = False)
¶Send an API request to open Engage.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
query(query, method='get', params=None)
¶ query_generator(query, method: str = 'get', params: Optional[Dict] = None)
¶Query for a list of items, with paging. Returns a generator.
Source code in paperscraper/get_dumps/utils/chemrxiv/chemrxiv_api.py
all_preprints()
¶ preprint(article_id)
¶Information on a given preprint. .. seealso:: https://docs.figshare.com/#public_article
crossref_api
¶
Crossref-based fallback for ChemRxiv dumps.
ChemRxiv's primary OpenEngage API can be blocked by Cloudflare (HTTP 403) in some
environments. This module provides a fallback based on Crossref's public API
using the ChemRxiv DOI prefix (10.26434).
NOTE
Crossref does not expose ChemRxiv abstracts, categories, or usage metrics. Those fields are therefore left empty in the converted dump format.
CrossrefChemrxivAPI
¶
Fetch ChemRxiv metadata from Crossref.
This class queries Crossref's Works endpoint filtered by the ChemRxiv DOI
prefix (10.26434) and date range. Results are fetched using cursor-based
pagination.
Source code in paperscraper/get_dumps/utils/chemrxiv/crossref_api.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | |
__init__(start_date: str, end_date: str, page_size: int = 1000, max_retries: int = 10, mailto: Optional[str] = None, request_delay_seconds: float = 0.35)
¶Initialize the Crossref fallback client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
str
|
Start of the posted-date range (YYYY-MM-DD). |
required |
end_date
|
str
|
End of the posted-date range (YYYY-MM-DD). |
required |
page_size
|
int
|
Number of results per page (Crossref max is 1000). |
1000
|
max_retries
|
int
|
Max retries for transient HTTP status codes. |
10
|
mailto
|
Optional[str]
|
Optional contact email to include in the request (Crossref recommends this for polite usage). |
None
|
request_delay_seconds
|
float
|
Delay between page requests. This is used to avoid hammering Crossref and also keeps long-range dumps from completing too quickly in tests that expect the dumper to be long-running. |
0.35
|
Source code in paperscraper/get_dumps/utils/chemrxiv/crossref_api.py
iter_items() -> Generator[Dict, None, None]
¶Iterate over raw Crossref work items for the configured date range.
Yields:
| Type | Description |
|---|---|
Dict
|
A dict for each work item as returned by Crossref's Works API. |
Raises:
| Type | Description |
|---|---|
HTTPError
|
If the request fails with a non-retryable status code, or if retries are exhausted. |
Source code in paperscraper/get_dumps/utils/chemrxiv/crossref_api.py
crossref_item_to_paper(item: Dict) -> Dict
¶
Convert a Crossref work item into the ChemRxiv dump schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
item
|
Dict
|
A single work item dict from Crossref's Works API. |
required |
Returns:
| Type | Description |
|---|---|
Dict
|
A dict compatible with the JSONL dump schema used for ChemRxiv in this |
Dict
|
package. |
Source code in paperscraper/get_dumps/utils/chemrxiv/crossref_api.py
utils
¶
Misc utils to download chemRxiv dump
get_author(author_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract author list
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
author_list
|
list
|
List of dicts, one per author. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
;-concatenated author list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_categories(category_list: List[Dict]) -> str
¶
Parse ChemRxiv dump entry to extract the categories of the paper
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category_list
|
list
|
List of dicts, one per category. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
;-concatenated category list. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_date(datestring: str) -> str
¶
Get the date of a chemrxiv dump enry.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datestring
|
str
|
String in the format: 2021-10-15T05:12:32.356Z |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
Date in the format: YYYY-MM-DD. |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
get_metrics(metrics_list: List[Dict]) -> Dict
¶
Parse ChemRxiv dump entry to extract the access metrics of the paper.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metrics_list
|
List[Dict]
|
A list of single-keyed, dictionaries each containing key and value for exactly one metric. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Dict |
Dict
|
A flattened dictionary with all metrics and a timestamp |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump(source_path: str, target_path: str) -> None
¶
Parses the dump as generated by the chemrXiv API and this repo: https://github.com/cthoyt/chemrxiv-summarize into a format that is equal to that of biorXiv and medRxiv.
NOTE: This is a lazy parser trying to store all data in memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_path
|
str
|
Path to the source dump |
required |
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
download_full_crossref(save_dir: str, api: Optional[CrossrefChemrxivAPI] = None) -> None
¶
Download ChemRxiv records via Crossref into per-item JSON payloads.
This mirrors the behavior of the OpenEngage backend by
storing one JSON payload per record in save_dir. The payloads are raw
Crossref work items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
save_dir
|
str
|
Directory where per-item payloads are stored. |
required |
api
|
Optional[CrossrefChemrxivAPI]
|
Crossref API client. If None, uses the widest possible date range. |
None
|
Source code in paperscraper/get_dumps/utils/chemrxiv/utils.py
parse_dump_crossref(source_path: str, target_path: str) -> None
¶
Parse Crossref payloads into the ChemRxiv JSONL dump format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source_path
|
str
|
Directory containing per-item Crossref JSON payloads. |
required |
target_path
|
str
|
JSONL output path. |
required |