paperscraper.arxiv
paperscraper.arxiv
¶
XRXivQuery
¶
Query class.
Source code in paperscraper/xrxiv/xrxiv_query.py
__init__(dump_filepath: str, fields: List[str] = ['title', 'doi', 'authors', 'abstract', 'date', 'journal'])
¶
Initialize the query class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dump_filepath
|
str
|
filepath to the dump to be queried. |
required |
fields
|
List[str]
|
fields to contained in the dump per paper. Defaults to ['title', 'doi', 'authors', 'abstract', 'date', 'journal']. |
['title', 'doi', 'authors', 'abstract', 'date', 'journal']
|
Source code in paperscraper/xrxiv/xrxiv_query.py
search_keywords(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |
Source code in paperscraper/xrxiv/xrxiv_query.py
dump_papers(papers: pd.DataFrame, filepath: str) -> None
¶
Receives a pd.DataFrame, one paper per row and dumps it into a .jsonl file with one paper per line.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
papers
|
DataFrame
|
A dataframe of paper metadata, one paper per row. |
required |
filepath
|
str
|
Path to dump the papers, has to end with |
required |
Source code in paperscraper/utils.py
get_server_dumps_dir() -> str
¶
get_query_from_keywords(keywords: List[Union[str, List[str]]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the arxiv API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY-MM-DD, e.g. '2020-07-20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
query to enter to arxiv API. |
Source code in paperscraper/arxiv/utils.py
get_arxiv_papers_local(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |
Source code in paperscraper/arxiv/arxiv.py
get_arxiv_papers_api(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 99999, client_options: Dict = {'num_retries': 10}, search_options: Dict = dict(), verbose: bool = True) -> pd.DataFrame
¶
Performs arxiv API request of a given query and returns list of papers with fields as desired.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Query to arxiv API. Needs to match the arxiv API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results, defaults to 99999. |
99999
|
client_options
|
Dict
|
Optional arguments for |
{'num_retries': 10}
|
search_options
|
Dict
|
Optional arguments for |
dict()
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: One row per paper. |
Source code in paperscraper/arxiv/arxiv.py
get_and_dump_arxiv_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', backend: Literal['api', 'local', 'infer'] = 'api', *args, **kwargs)
¶
Combines get_arxiv_papers and dump_papers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords for arxiv search. The outer list level will be considered as AND separated keys, the inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
backend
|
Literal['api', 'local', 'infer']
|
If |
'api'
|
Source code in paperscraper/arxiv/arxiv.py
arxiv
¶
get_arxiv_papers_local(keywords: List[Union[str, List[str]]], fields: List[str] = None, output_filepath: str = None) -> pd.DataFrame
¶
Search for papers in the dump using keywords.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
fields
|
List[str]
|
fields to be used in the query search. Defaults to None, a.k.a. search in all fields excluding date. |
None
|
output_filepath
|
str
|
optional output filepath where to store the hits in JSONL format. Defaults to None, a.k.a., no export to a file. |
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A dataframe with one paper per row. |
Source code in paperscraper/arxiv/arxiv.py
get_arxiv_papers_api(query: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], max_results: int = 99999, client_options: Dict = {'num_retries': 10}, search_options: Dict = dict(), verbose: bool = True) -> pd.DataFrame
¶
Performs arxiv API request of a given query and returns list of papers with fields as desired.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Query to arxiv API. Needs to match the arxiv API notation. |
required |
fields
|
List
|
List of strings with fields to keep in output. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
max_results
|
int
|
Maximal number of results, defaults to 99999. |
99999
|
client_options
|
Dict
|
Optional arguments for |
{'num_retries': 10}
|
search_options
|
Dict
|
Optional arguments for |
dict()
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: One row per paper. |
Source code in paperscraper/arxiv/arxiv.py
get_and_dump_arxiv_papers(keywords: List[Union[str, List[str]]], output_filepath: str, fields: List = ['title', 'authors', 'date', 'abstract', 'journal', 'doi'], start_date: str = 'None', end_date: str = 'None', backend: Literal['api', 'local', 'infer'] = 'api', *args, **kwargs)
¶
Combines get_arxiv_papers and dump_papers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keywords
|
List[Union[str, List[str]]]
|
List of keywords for arxiv search. The outer list level will be considered as AND separated keys, the inner level as OR separated. |
required |
output_filepath
|
str
|
Path where the dump will be saved. |
required |
fields
|
List
|
List of strings with fields to keep in output. Defaults to ['title', 'authors', 'date', 'abstract', 'journal', 'doi']. |
['title', 'authors', 'date', 'abstract', 'journal', 'doi']
|
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY/MM/DD, e.g. '2020/07/20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
backend
|
Literal['api', 'local', 'infer']
|
If |
'api'
|
Source code in paperscraper/arxiv/arxiv.py
kaggle
¶
Kaggle-backed arXiv metadata dumping utilities.
arxiv_kaggle(start_date: datetime, end_date: datetime, save_path: str, kaggle_filepath: Optional[str] = None) -> int
¶
Convert a Kaggle arXiv metadata snapshot to paperscraper JSONL format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
start_date
|
datetime
|
Earliest paper submission date to include. |
required |
end_date
|
datetime
|
Latest paper submission date to include. |
required |
save_path
|
str
|
Destination JSONL path for converted papers. |
required |
kaggle_filepath
|
Optional[str]
|
Existing Kaggle snapshot file. If provided, no Kaggle download is attempted. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Number of papers written to |
Source code in paperscraper/arxiv/kaggle.py
download_kaggle_snapshot() -> str
¶
Download the Kaggle arXiv metadata snapshot if needed.
Returns:
| Type | Description |
|---|---|
str
|
Path to the local Kaggle snapshot JSON file. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If the |
RuntimeError
|
If Kaggle authentication is missing or invalid. |
FileNotFoundError
|
If the download succeeds but no snapshot JSON is found. |
Source code in paperscraper/arxiv/kaggle.py
default_kaggle_dir() -> str
¶
Return the default temporary directory for Kaggle arXiv downloads.
Returns:
| Type | Description |
|---|---|
str
|
Path to the default Kaggle download directory. |
find_kaggle_snapshot(kaggle_dir: str) -> Optional[str]
¶
Find the arXiv metadata snapshot JSON in a Kaggle download directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kaggle_dir
|
str
|
Directory to search. |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
Path to the largest candidate JSON file, or None if no candidate exists. |
Source code in paperscraper/arxiv/kaggle.py
get_kaggle_paper_date(record: dict) -> Optional[datetime]
¶
Extract the first submission date from a Kaggle arXiv record.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
record
|
dict
|
Raw Kaggle arXiv metadata record. |
required |
Returns:
| Type | Description |
|---|---|
Optional[datetime]
|
Naive UTC-normalized submission date at midnight, or None if no usable |
Optional[datetime]
|
date is available. |
Source code in paperscraper/arxiv/kaggle.py
normalize_kaggle_record(record: dict, paper_date: datetime) -> dict
¶
Normalize a Kaggle arXiv record to paperscraper dump fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
record
|
dict
|
Raw Kaggle arXiv metadata record. |
required |
paper_date
|
datetime
|
Submission date returned by |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary with paperscraper's standard |
dict
|
|
Source code in paperscraper/arxiv/kaggle.py
utils
¶
format_date(date_str: str) -> str
¶
Converts a date in YYYY-MM-DD format to arXiv's YYYYMMDDTTTT format.
get_query_from_keywords(keywords: List[Union[str, List[str]]], start_date: str = 'None', end_date: str = 'None') -> str
¶
Receives a list of keywords and returns the query for the arxiv API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
keywords
|
List[str, List[str]]
|
Items will be AND separated. If items are lists themselves, they will be OR separated. |
required |
start_date
|
str
|
Start date for the search. Needs to be in format: YYYY-MM-DD, e.g. '2020-07-20'. Defaults to 'None', i.e. no specific dates are used. |
'None'
|
end_date
|
str
|
End date for the search. Same notation as start_date. |
'None'
|
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
query to enter to arxiv API. |