paperscraper¶

paperscraper is a Python package for reproducible searches over scholarly metadata, accessible full-text retrieval, citation lookup, and small bibliometric workflows across PubMed, arXiv, bioRxiv, medRxiv, and ChemRxiv.

pip install paperscraper

or:

uv add paperscraper

What It Does¶

Search scholarly metadata

Query PubMed and arXiv directly, or search local JSONL dumps from arXiv, bioRxiv, medRxiv, and ChemRxiv with one nested keyword convention.

Paper keyword analysis
Build local preprint dumps

Download local xRxiv snapshots once, then run reproducible repeated searches without depending on live search results for every query.

Getting started
Retrieve accessible full text

Save PDFs or XML from DOI metadata using direct links and supported fallback paths when access is available.

PDF retrieval
Inspect citation behavior

Query citation counts, author metrics, journal impact factors, and paper-level or researcher-level self-citation and self-reference rates.

Scholar metrics

Self-citation analysis

Quick Example¶

from paperscraper import dump_queries

ai = ["Artificial intelligence", "Machine learning"]
qc = ["Quantum computing", "Quantum information", "Quantum algorithm"]
chemistry = ["Chemistry", "Chemical", "Molecule", "Materials science"]

dump_queries([[ai, qc, chemistry]], ".")

Nested lists encode Boolean logic: outer lists are combined with AND, while inner lists define synonyms combined with OR.

Where To Go Next¶

Start with Getting Started for installation and local dump setup.
Use Paper Keyword Analysis for multi-source literature trend workflows.
Use PDF Retrieval for full-text download options and supported fallbacks.
Use Scholar Metrics Analysis for citation counts, author metrics, and journal metrics.
Use Self-Citation Analysis for self-citation and self-reference workflows.

API details are available under API Documentation.