paperscraper.pdf
paperscraper.pdf
¶
fallbacks
¶
Functionalities to scrape PDF files of publications.
fallback_wiley_api(paper_metadata: Dict[str, Any], output_path: Path, api_keys: Dict[str, str], max_attempts: int = 2) -> bool
¶
Attempt to download the PDF via the Wiley TDM API (popular publisher which blocks standard scraping attempts; API access free for academic users).
This function uses the WILEY_TDM_API_TOKEN environment variable to authenticate with the Wiley TDM API and attempts to download the PDF for the given paper. See https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining for a description on how to get your WILEY_TDM_API_TOKEN.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper_metadata
|
dict
|
Dictionary containing paper metadata. Must include the 'doi' key. |
required |
output_path
|
Path
|
A pathlib.Path object representing the path where the PDF will be saved. |
required |
api_keys
|
dict
|
Preloaded API keys. |
required |
max_attempts
|
int
|
The maximum number of attempts to retry API call. |
2
|
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the PDF file was successfully downloaded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
fallback_bioc_pmc(doi: str, output_path: Path) -> bool
¶
Attempt to download the XML via the BioC-PMC fallback.
This function first converts a given DOI to a PMCID using the NCBI ID Converter API. If a PMCID is found, it constructs the corresponding PMC XML URL and attempts to download the full-text XML.
PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
The DOI of the paper to retrieve. |
required |
output_path
|
Path
|
A pathlib.Path object representing the path where the XML file will be saved. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the XML file was successfully downloaded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
fallback_elsevier_api(paper_metadata: Dict[str, Any], output_path: Path, api_keys: Dict[str, str]) -> bool
¶
Attempt to download the full text via the Elsevier TDM API. For more information, see: https://www.elsevier.com/about/policies-and-standards/text-and-data-mining (Requires an institutional subscription and an API key provided in the api_keys dictionary under the key "ELSEVIER_TDM_API_KEY".)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper_metadata
|
Dict[str, Any]
|
Dictionary containing paper metadata. Must include the 'doi' key. |
required |
output_path
|
Path
|
A pathlib.Path object representing the path where the XML file will be saved. |
required |
api_keys
|
Dict[str, str]
|
A dictionary containing API keys. Must include the key "ELSEVIER_TDM_API_KEY". |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the XML file was successfully downloaded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
fallback_elife_xml(doi: str, output_path: Path) -> bool
¶
Attempt to download the XML via the eLife XML repository on GitHub.
eLife provides open access to their XML files on GitHub, which can be used as a fallback. When multiple versions exist (revised papers), it takes the latest version (e.g., v3 instead of v1).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
The DOI of the eLife paper to download. |
required |
output_path
|
Path
|
A pathlib.Path object representing the path where the XML file will be saved. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the XML file was successfully downloaded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
get_elife_xml_index() -> dict
¶
Fetch the eLife XML index from GitHub and return it as a dictionary.
This function retrieves and caches the list of available eLife articles in XML format from the eLife GitHub repository. It ensures that the latest version of each article is accessible for downloading. The index is cached in memory to avoid repeated network requests when processing multiple eLife papers.
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary where keys are article numbers (as strings) and values are lists of tuples (version, download_url). Each list is sorted by version number. |
Source code in paperscraper/pdf/fallbacks.py
month_folder(doi: str) -> str
¶
Query bioRxiv API to get the posting date of a given DOI. Convert a date to the BioRxiv S3 folder name, rolling over if it's the month's last day. E.g., if date is the last day of April, treat as May_YYYY.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
The DOI for which to retrieve the date. |
required |
Returns:
Type | Description |
---|---|
str
|
Month and year in format |
Source code in paperscraper/pdf/fallbacks.py
list_meca_keys(s3_client: BaseClient, bucket: str, prefix: str) -> list
¶
List all .meca object keys under a given prefix in a requester-pays bucket.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s3_client
|
BaseClient
|
S3 client to get the data from. |
required |
bucket
|
str
|
bucket to get data from. |
required |
prefix
|
str
|
prefix to get data from. |
required |
Returns:
Type | Description |
---|---|
list
|
List of keys, one per existing .meca in the bucket. |
Source code in paperscraper/pdf/fallbacks.py
find_meca_for_doi(s3_client: BaseClient, bucket: str, key: str, doi_token: str) -> bool
¶
Efficiently inspect manifest.xml within a .meca zip by fetching only necessary bytes. Parse via ZipFile to read manifest.xml and match DOI token.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
s3_client
|
BaseClient
|
S3 client to get the data from. |
required |
bucket
|
str
|
bucket to get data from. |
required |
key
|
str
|
prefix to get data from. |
required |
doi_token
|
str
|
the DOI that should be matched |
required |
Returns:
Type | Description |
---|---|
bool
|
Whether or not the DOI could be matched |
Source code in paperscraper/pdf/fallbacks.py
fallback_s3(doi: str, output_path: Union[str, Path], api_keys: dict, workers: int = 32) -> bool
¶
Download a BioRxiv PDF via the requester-pays S3 bucket using range requests.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
doi
|
str
|
The DOI for which to retrieve the PDF (e.g. '10.1101/798496'). |
required |
output_path
|
Union[str, Path]
|
Path where the PDF will be saved (with .pdf suffix added). |
required |
api_keys
|
dict
|
Dict containing 'AWS_ACCESS_KEY_ID' and 'AWS_SECRET_ACCESS_KEY'. |
required |
Returns:
Type | Description |
---|---|
bool
|
True if download succeeded, False otherwise. |
Source code in paperscraper/pdf/fallbacks.py
385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 |
|
pdf
¶
Functionalities to scrape PDF files of publications.
save_pdf(paper_metadata: Dict[str, Any], filepath: Union[str, Path], save_metadata: bool = False, api_keys: Optional[Union[str, Dict[str, str]]] = None) -> None
¶
Save a PDF file of a paper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
paper_metadata
|
Dict[str, Any]
|
A dictionary with the paper metadata. Must contain the |
required |
filepath
|
Union[str, Path]
|
Path to the PDF file to be saved (with or without suffix). |
required |
save_metadata
|
bool
|
A boolean indicating whether to save paper metadata as a separate json. |
False
|
api_keys
|
Optional[Union[str, Dict[str, str]]]
|
Either a dictionary containing API keys (if already loaded) or a string (path to API keys file).
If None, will try to load from |
None
|
Source code in paperscraper/pdf/pdf.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 |
|
save_pdf_from_dump(dump_path: str, pdf_path: str, key_to_save: str = 'doi', save_metadata: bool = False, api_keys: Optional[str] = None) -> None
¶
Receives a path to a .jsonl
dump with paper metadata and saves the PDF files of
each paper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dump_path
|
str
|
Path to a |
required |
pdf_path
|
str
|
Path to a folder where the files will be stored. |
required |
key_to_save
|
str
|
Key in the paper metadata to use as filename.
Has to be |
'doi'
|
save_metadata
|
bool
|
A boolean indicating whether to save paper metadata as a separate json. |
False
|
api_keys
|
Optional[str]
|
Path to a file with API keys. If None, API-based fallbacks will be skipped. |
None
|
Source code in paperscraper/pdf/pdf.py
utils
¶
load_api_keys(filepath: Optional[str] = None) -> Dict[str, str]
¶
Reads API keys from a file and returns them as a dictionary. The file should have each API key on a separate line in the format: KEY_NAME=API_KEY_VALUE
Example
WILEY_TDM_API_TOKEN=your_wiley_token_here ELSEVIER_TDM_API_KEY=your_elsevier_key_here
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filepath
|
Optional[str]
|
Optional path to the file containing API keys. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, str]
|
Dict[str, str]: A dictionary where keys are API key names and values are their respective API keys. |