Skip to content

build build build License: MIT PyPI version Downloads codecov

paperscraper

paperscraper is a Python package for reproducible searches over scholarly metadata, accessible full-text retrieval, citation lookup, and small bibliometric workflows across PubMed, arXiv, bioRxiv, medRxiv, and ChemRxiv.

pip install paperscraper

or:

uv add paperscraper

What It Does

  • Search scholarly metadata

    Query PubMed and arXiv directly, or search local JSONL dumps from arXiv, bioRxiv, medRxiv, and ChemRxiv with one nested keyword convention.

    Paper keyword analysis

  • Build local preprint dumps

    Download local xRxiv snapshots once, then run reproducible repeated searches without depending on live search results for every query.

    Getting started

  • Retrieve accessible full text

    Save PDFs or XML from DOI metadata using direct links and supported fallback paths when access is available.

    PDF retrieval

  • Inspect citation behavior

    Query citation counts, author metrics, journal impact factors, and paper-level or researcher-level self-citation and self-reference rates.

    Scholar metrics

    Self-citation analysis

Quick Example

from paperscraper import dump_queries

ai = ["Artificial intelligence", "Machine learning"]
qc = ["Quantum computing", "Quantum information", "Quantum algorithm"]
chemistry = ["Chemistry", "Chemical", "Molecule", "Materials science"]

dump_queries([[ai, qc, chemistry]], ".")

Nested lists encode Boolean logic: outer lists are combined with AND, while inner lists define synonyms combined with OR.

Where To Go Next

API details are available under API Documentation.