paperscraper¶
paperscraper is a Python package for reproducible searches over scholarly
metadata, accessible full-text retrieval, citation lookup, and small
bibliometric workflows across PubMed, arXiv, bioRxiv, medRxiv, and ChemRxiv.
or:
What It Does¶
-
Search scholarly metadata
Query PubMed and arXiv directly, or search local JSONL dumps from arXiv, bioRxiv, medRxiv, and ChemRxiv with one nested keyword convention.
-
Build local preprint dumps
Download local xRxiv snapshots once, then run reproducible repeated searches without depending on live search results for every query.
-
Retrieve accessible full text
Save PDFs or XML from DOI metadata using direct links and supported fallback paths when access is available.
-
Inspect citation behavior
Query citation counts, author metrics, journal impact factors, and paper-level or researcher-level self-citation and self-reference rates.
Quick Example¶
from paperscraper import dump_queries
ai = ["Artificial intelligence", "Machine learning"]
qc = ["Quantum computing", "Quantum information", "Quantum algorithm"]
chemistry = ["Chemistry", "Chemical", "Molecule", "Materials science"]
dump_queries([[ai, qc, chemistry]], ".")
Nested lists encode Boolean logic: outer lists are combined with AND, while
inner lists define synonyms combined with OR.
Where To Go Next¶
- Start with Getting Started for installation and local dump setup.
- Use Paper Keyword Analysis for multi-source literature trend workflows.
- Use PDF Retrieval for full-text download options and supported fallbacks.
- Use Scholar Metrics Analysis for citation counts, author metrics, and journal metrics.
- Use Self-Citation Analysis for self-citation and self-reference workflows.
API details are available under API Documentation.