paperscraper.arxiv
paperscraper.arxiv
¶
XRXivQuery
¶
Query class.
Source code in paperscraper/xrxiv/xrxiv_query.py
__init__(dump_filepath: str, fields: List[str] = ['title', 'doi', 'authors', 'abstract', 'date', 'journal'])
¶
Initialize the query class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dump_filepath
|
str
|
filepath to the dump to be queried. |
required |
fields
|
List[str]
|
fields to contained in the dump per paper. Defaults to ['title', 'doi', 'authors', 'abstract', 'date', 'journal']. |
['title', 'doi', 'authors', 'abstract', 'date', 'journal']
|
Source code in paperscraper/xrxiv/xrxiv_query.py
search_keywords(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |
Source code in paperscraper/xrxiv/xrxiv_query.py
dump_papers(papers: pd.DataFrame, filepath: str) -> None
¶
Receives a pd.DataFrame, one paper per row and dumps it into a .jsonl file with one paper per line.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
papers
|
DataFrame
|
A dataframe of paper metadata, one paper per row. |
required |
filepath
|
str
|
Path to dump the papers, has to end with |
required |
Source code in paperscraper/utils.py
get_query_from_keywords(keywords: List[Union[str, List[str]]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the arxiv API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY-MM-DD, e.g. '2020-07-20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to arxiv API. |
Source code in paperscraper/arxiv/utils.py
get_arxiv_papers_local(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |
Source code in paperscraper/arxiv/arxiv.py
get_arxiv_papers_api(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 99999, client_options: Dict = {'num_retries': 10}, search_options: Dict = dict(), verbose: bool = True) -> pd.DataFrame
¶
Performs arxiv API request of a given query and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Query to arxiv API. Needs to match the arxiv API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results, defaults to 99999. |
99999
|
client_options
|
Dict
|
Optional arguments for |
{'num_retries': 10}
|
search_options
|
Dict
|
Optional arguments for |
dict()
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: One row per paper. |
Source code in paperscraper/arxiv/arxiv.py
get_and_dump_arxiv_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', backend: Literal['api', 'local', 'infer'] = 'api', *args, **kwargs)
¶
Combines get_arxiv_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords for arxiv search. The outer list level will be considered as AND separated keys, the inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
backend
|
Literal['api', 'local', 'infer']
|
If |
'api'
|
Source code in paperscraper/arxiv/arxiv.py
arxiv
¶
get_arxiv_papers_local(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |
Source code in paperscraper/arxiv/arxiv.py
get_arxiv_papers_api(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 99999, client_options: Dict = {'num_retries': 10}, search_options: Dict = dict(), verbose: bool = True) -> pd.DataFrame
¶
Performs arxiv API request of a given query and returns list of papers with fields as desired.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Query to arxiv API. Needs to match the arxiv API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results, defaults to 99999. |
99999
|
client_options
|
Dict
|
Optional arguments for |
{'num_retries': 10}
|
search_options
|
Dict
|
Optional arguments for |
dict()
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: One row per paper. |
Source code in paperscraper/arxiv/arxiv.py
get_and_dump_arxiv_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', backend: Literal['api', 'local', 'infer'] = 'api', *args, **kwargs)
¶
Combines get_arxiv_papers and dump_papers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords for arxiv search. The outer list level will be considered as AND separated keys, the inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
backend
|
Literal['api', 'local', 'infer']
|
If |
'api'
|
Source code in paperscraper/arxiv/arxiv.py
utils
¶
format_date(date_str: str) -> str
¶
Converts a date in YYYY-MM-DD format to arXiv's YYYYMMDDTTTT format.
get_query_from_keywords(keywords: List[Union[str, List[str]]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the arxiv API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY-MM-DD, e.g. '2020-07-20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
query to enter to arxiv API. |