paperscraper.xrxiv
paperscraper.xrxiv
¶
bioRxiv and medRxiv utilities.
xrxiv_api
¶
API for bioRxiv and medRXiv.
XRXivApi
¶
API class.
Source code in paperscraper/xrxiv/xrxiv_api.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
|
__init__(server: str, launch_date: str, api_base_url: str = 'https://api.biorxiv.org', max_retries: int = 10)
¶
Initialize API class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
server
|
str
|
name of the preprint server to access. |
required |
launch_date
|
str
|
launch date expressed as YYYY-MM-DD. |
required |
api_base_url
|
str
|
Base url for the API. Defaults to 'api.biorxiv.org'. |
'https://api.biorxiv.org'
|
max_retries
|
int
|
Maximal number of retries for a request before an error is raised. Defaults to 10. |
10
|
Source code in paperscraper/xrxiv/xrxiv_api.py
get_papers(start_date: Optional[str] = None, end_date: Optional[str] = None, fields: List[str] = ['title', 'doi', 'authors', 'abstract', 'date', 'journal'], max_retries: int = 10) -> Generator
¶
Get paper metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
start_date
|
Optional[str]
|
begin date. Defaults to None, a.k.a. launch date. |
None
|
end_date
|
Optional[str]
|
end date. Defaults to None, a.k.a. today. |
None
|
fields
|
List[str]
|
fields to return per paper. Defaults to ['title', 'doi', 'authors', 'abstract', 'date', 'journal']. |
['title', 'doi', 'authors', 'abstract', 'date', 'journal']
|
max_retries
|
int
|
Number of retries on connection failure. Defaults to 10. |
10
|
Yields:
Name | Type | Description |
---|---|---|
Generator |
Generator
|
a generator of paper metadata (dict) with the desired fields. |
Source code in paperscraper/xrxiv/xrxiv_api.py
retry_multi()
¶
Retry a function several times
Source code in paperscraper/xrxiv/xrxiv_api.py
xrxiv_query
¶
Query dumps from bioRxiv and medRXiv.
XRXivQuery
¶
Query class.
Source code in paperscraper/xrxiv/xrxiv_query.py
__init__(dump_filepath: str, fields: List[str] = ['title', 'doi', 'authors', 'abstract', 'date', 'journal'])
¶
Initialize the query class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dump_filepath
|
str
|
filepath to the dump to be queried. |
required |
fields
|
List[str]
|
fields to contained in the dump per paper. Defaults to ['title', 'doi', 'authors', 'abstract', 'date', 'journal']. |
['title', 'doi', 'authors', 'abstract', 'date', 'journal']
|
Source code in paperscraper/xrxiv/xrxiv_query.py
search_keywords(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |