scrapemed package

Submodules

scrapemed.paper module

ScrapeMed’s Paper Module

The scrapemed paper module is intended as the primary point of contact for scrapemed end users.

Paper objects are defined here, as well end-user functionality for scraping data from PubMed Central without stressing about the details.

..warnings::
  • emptyTextWarning - Warned when trying to perform a text

    operation on a Paper which has no text.

  • pubmedHTTPError - Warned when unable to retrieve a PMC XML

    repeatedly. Can occasionally happen with PMC due to high traffic. Also may be caused by broken XML formatting.

class scrapemed.paper.Paper(paper_dict: dict)[source]

Bases: object

Class for storing paper data downloaded from PMC.

This class provides methods for initializing papers via PMCID and directly from XML, paper chunking and vectorization, conversion to relational format (pandas Series), printing methods, and equality checking.

Class data members include all of the data defined via the method info().

Raises:
  • pubmedHTTPError – Raised if there are HTTP errors when retrieving data from PMC.

  • emptyTextWarning – Raised if an attempt is made to vectorize a paper with no text.

Example:

To initialize a Paper object with paper_dict:

>>> paper = Paper(paper_dict)
abstract_as_str() str[source]

Return a string representation of the abstract of a paper.

This method retrieves the abstract text without MHTML data references.

Returns:

A string containing the abstract text.

Return type:

str

body_as_str() str[source]

Return a string representation of the body of a paper.

Returns:

A string containing the body text.

Return type:

str

expanded_query(query: str, n_results: int = 1, n_before: int = 2, n_after: int = 2) Dict[str, str][source]

Query the paper with an expanded natural language question/query.

This method matches a natural language query with the vectorized Paper. It retrieves and expands the text sections around the most semantically similar result chunk(s).

Parameters:
  • query (str) – The natural language query.

  • n_results (int) – The number of most semantically similar paper sections to retrieve.

  • n_before (int) – The number of chunks before the match to include in the combined output.

  • n_after (int) – The number of chunks after the match to include in the combined output.

Returns:

A dictionary with keys representing the most semantically similar result chunk(s) and values representing the expanded paper text(s) around the result chunk(s).

Return type:

dict[str, str]

classmethod from_pmc(pmcid: int, email: str, download: bool = False, validate: bool = True, verbose: bool = False, suppress_warnings: bool = False, suppress_errors: bool = False)[source]

Generate a Paper from a PMCID with optional parameters.

Parameters:
  • pmcid (int) – Unique PMCID for the article to parse.

  • email (str) – Provide your email address for authentication with PMC.

  • download (bool) – Whether or not to download the XML retrieved from PMC.

  • validate (bool) – Whether or not to validate the XML from PMC against NLM articleset 2.0 DTD (HIGHLY RECOMMENDED).

  • verbose (bool) – Whether or not to have verbose output for testing.

  • suppress_warnings (bool) – Whether to suppress warnings while parsing XML. Note: Warnings are frequent, because of the variable nature of PMC XML data. Recommended to suppress when parsing many XMLs at once.

  • suppress_errors (bool) – Return None on failed XML parsing, instead of raising an error.

Returns:

A Paper object initialized via the passed PMCID and optional parameters.

Return type:

Paper

classmethod from_xml(pmcid: int, root: Element, verbose: bool = False, suppress_warnings: bool = False, suppress_errors: bool = False)[source]

Generate a Paper straight from PMC XML.

Parameters:
  • pmcid (int) – PMCID for the XML. THis is required intentionally, to ensure trustworthy unique indexing of PMC XMLs.

  • root (ET.Element) – Root element of the PMC XML tree.

  • verbose (bool) – Report verbose output or not. Intended for testing.

  • suppress_warnings (bool) – Suppress warnings while parsing XML or not. Note: Warnings are frequent, because of the variable nature of PMC XML data. Recommended to suppress when parsing many XMLs at once.

  • suppress_errors (bool) – Return None on failed XML parsing, instead of raising an error. Recommended to suppress when parsing many XMLs at once, unless failure is not an option.

Returns:

A Paper object initialized via the passed XML.

Return type:

Paper

full_text(print_text: bool = False)[source]

Return the full abstract and/or body text of this Paper as a string.

Optionally, you can choose to print the text.

Parameters:

print_text (bool) – If True, print the text; if False, return it as a string.

Returns:

A string containing the full text of the abstract and/or body.

Return type:

str

info() Dict[str, str][source]

Return the data definition dictionary.

Returns:

A dictionary containing paper information.

Return type:

dict[str, str]

print_abstract() str[source]

Print and return a string representation of the abstract.

Returns:

A string containing the abstract text.

Return type:

str

print_body() str[source]

Print and return a string representation of the body of a paper.

This method retrieves the body text without MHTML data references.

Returns:

A string containing the body text.

Return type:

str

query(query: str, n_results: int = 1, n_before: int = 2, n_after: int = 2) Dict[str, str][source]

Query the paper with natural language questions.

Parameters:
  • query (str) – The natural language question/query.

  • n_results (int) – The number of most semantically similar paper sections to retrieve.

  • n_before (int) – The number of chunks before the match to include in the combined output.

  • n_after (int) – The number of chunks after the match to include in the combined output.

Returns:

A dictionary with keys representing the most semantically similar result chunk(s) and values representing the paper text(s) around the most semantically similar result chunk(s). The text length is determined by the chunk size used in self.vectorize() and the params n_before and n_after.

Return type:

dict[str, str]

to_relational() Series[source]

Generate a pandas Series representation of the paper.

This method creates a pandas Series containing a relational representation of the paper’s data. Some data may be lost in this process, but most useful text data and metadata will be retained in a structured form.

Returns:

A pandas Series representing the paper’s data.

Return type:

pd.Series

vectorize(chunk_size: int = 100, chunk_overlap: int = 20, refresh: bool = False)[source]

Generate an in-memory vector database representation of the paper.

This method generates an in-memory vector database representation of the paper, stored in paper.vector_collection. It focuses on vectorizing the abstract and body text.

Parameters:
  • chunk_size (int) – An approximate chunk size to split the paper into (measured in characters).

  • chunk_overlap (int) – An approximate desired chunk overlap (measured in characters).

  • refresh (bool) – Whether or not to clear and re-vectorize the paper with new settings.

Returns:

None

exception scrapemed.paper.emptyTextWarning[source]

Bases: Warning

Warned when trying to perform a text operation on a Paper which has no text.

exception scrapemed.paper.pubmedHTTPError[source]

Bases: Warning

Warned when unable to retrieve a PMC XML repeatedly. Can occasionally happen with PMC due to high traffic. Also may be caused by broken XML formatting.

scrapemed.paperSet module

ScrapeMed’s PaperSet Module

Module for building PMC paper datasets, known as paperSets. This is the main endpoint for ScrapeMed users like data scientists and data engineers.

Functions for structured data generation, scraping PMC via both PMCID lists and advanced PMC term searches.

class scrapemed.paperSet.paperSet(papers: List[Paper])[source]

Bases: object

A collection of Paper objects with various operations for managing and analyzing them.

Parameters:

papers (List[Paper]) – A list of Paper objects to initialize the paperSet.

This class represents a collection of Paper objects and provides methods for creating, adding, and visualizing papers.

Methods: - from_search(email, term, retmax=10, verbose=False,

suppress_warnings=True, suppress_errors=True): Generate a paperSet via a PMC search.

  • from_pmcid_list(pmcids, email, download=False, validate=True,

    strip_text_styling=True, verbose=False, suppress_warnings=True, suppress_errors=True): Generate a paperSet via a list of PMCIDs.

  • to_df(): Return a pandas DataFrame representation of the paperSet.

  • add_paper(paper): Add a Paper to the paperSet.

  • add_papers(papers): Add multiple Papers to the paperSet.

  • add_pmcid(pmcid, email, download=False, validate=True,

    strip_text_styling=True, verbose=False, suppress_warnings=True, suppress_errors=True): Add a Paper to the paperSet via a PMCID.

  • add_pmcids(pmcids, email, download=False, validate=True,

    strip_text_styling=True, verbose=False, suppress_warnings=True, suppress_errors=True): Add Papers to the paperSet via a list of PMCIDs.

  • visualize(): Generate a general visualization of the paperSet.

  • visualize_unique_values(columns_to_visualize=[“Last_Updated”,

    “Journal_Title”]): Visualize unique values in specified columns.

  • visualize_title_wordcloud(): Visualize a word cloud of all the

    Paper titles in the paperSet.

add_paper(paper: Paper)[source]

Add a Paper to the paperSet directly. Returns True if the paper was added, False if the paper was already in the paperSet.

Parameters:

paper (Paper) – The Paper to add to the paperSet.

Returns:

True if the paper was added, False if it was already in the paperSet.

Return type:

bool

add_papers(papers: List[Paper])[source]

Add Papers to the paperSet directly. Returns the number of papers added.

Parameters:

papers (List[Paper]) – List of Papers to add to the paperSet.

Returns:

The number of papers added.

Return type:

int

add_pmcid(pmcid: int | str, email: str, download: bool = False, validate: bool = True, strip_text_styling: bool = True, verbose: bool = False, suppress_warnings: bool = True, suppress_errors: bool = True)[source]

Add a Paper to the paperSet via PMCID. Returns True if the paper was added, False if it was already in the paperSet.

Parameters:
  • pmcid (Union[int, str]) – The PMCID to generate the Paper to add.

  • email (str) – Use your email to authenticate with PMC.

  • download (bool) – Whether or not to download the XMLs corresponding to pmcid (default is False).

  • validate (bool) – Whether or not to validate the XMLs corresponding to pmcid (default is True).

  • strip_text_styling (bool) – Whether or not to clean common HTML and other text styling out of the XML (default is True).

  • verbose (bool) – Whether to display verbose output (default is False).

  • suppress_warnings (bool) – Whether to suppress warnings while parsing XML (default is True).

  • suppress_errors (bool) – Whether to return None on failed XML parsing, instead of raising an error (default is True).

Returns:

True if the paper was added, False if it was already in the paperSet.

Return type:

bool

add_pmcids(pmcids: List[int | str], email: str, download: bool = False, validate: bool = True, strip_text_styling: bool = True, verbose: bool = False, suppress_warnings: bool = True, suppress_errors: bool = True)[source]

Add Papers to the paperSet via a list of PMCIDs. Returns the number of papers added.

Parameters:
  • pmcids (List[Union[int, str]]) – List of PMCIDs to populate the paperSet.

  • email (str) – Use your email to authenticate with PMC.

  • download (bool) – Whether or not to download the XMLs corresponding to pmcids (default is False).

  • validate (bool) – Whether or not to validate the XMLs corresponding to pmcids (default is True).

  • strip_text_styling (bool) – Whether or not to clean common HTML and other text styling out of the XMLs (default is True).

  • verbose (bool) – Whether to display verbose output (default is False).

  • suppress_warnings (bool) – Whether to suppress warnings while parsing XML (default is True).

  • suppress_errors (bool) – Whether to return None on failed XML parsing, instead of raising an error (default is True).

Returns:

The number of papers added.

Return type:

int

classmethod from_pmcid_list(pmcids: List[int], email: str, download: bool = False, validate: bool = True, strip_text_styling: bool = True, verbose: bool = False, suppress_warnings: bool = True, suppress_errors: bool = True)[source]

Generate a paperSet via a list of PMCIDs.

Parameters:
  • pmcids (List[int]) – List of PMCIDs to populate the paperSet.

  • email (str) – Use your email to authenticate with PMC.

  • download (bool) – Whether or not to download the XMLs corresponding to PMCIDs (default is False).

  • validate (bool) – Whether or not to validate the XMLs corresponding to PMCIDs (default is True).

  • strip_text_styling (bool) – Whether or not to clean common HTML and other text styling out of the XMLs (default is True).

  • verbose (bool) – Whether to display verbose output (default is False).

  • suppress_warnings (bool) – Whether to suppress warnings while parsing XML (default is True).

  • suppress_errors (bool) – Whether to return None on failed XML parsing, instead of raising an error (default is True).

Returns:

A paperSet generated from the list of PMCIDs.

Return type:

paperSet

Generate a paperSet via a PMC search.

Parameters:
  • email (str) – Use your email to authenticate with PMC.

  • term (str) – Search term.

  • retmax (int) – Maximum number of PMCIDs to return (default is 10).

  • verbose (bool) – Whether to display verbose output (default is False).

  • suppress_warnings (bool) – Whether to suppress warnings while parsing XML (default is True).

  • suppress_errors (bool) – Whether to return None on failed XML parsing, instead of raising an error (default is True).

Returns:

A paperSet generated from the PMC search results.

Return type:

paperSet

to_df()[source]

Return a pandas DataFrame representation of the paperSet.

Returns:

DataFrame of the paperSet.

Return type:

pd.DataFrame

visualize()[source]

Generates a general visualization of the paperSet, including unique value visualization and a wordcloud for all of the Paper titles.

visualize_title_wordcloud() None[source]

Visualize a wordcloud of all the Paper titles in the paperSet

visualize_unique_values(columns_to_visualize=['Last_Updated', 'Journal_Title']) None[source]

Visualize unique values in specified columns of the paperSet DataFrame.

Parameters:

columns_to_visualize (List[str]) – A list of column names to visualize (default is [“Last_Updated”, “Journal_Title”]).

scrapemed._parse module

ScrapeMed’s _parse Module

Parse module for grabbing metadata, text, tables, figures, etc. from XML trees representing PMC articles.

DTD for the XML should be NLM articleset 2.0. Otherwise the behavior here may not be as expected.

Middleman between the scrape module and the paper module for ScrapeMed.

..warnings::
exception scrapemed._parse.badTextFormattingWarning[source]

Bases: Warning

scrapemed._parse.define_data_dict() dict[source]

Returns a static definition of each of the elements returned in a Paper dictionary.

Returns:

A dictionary where keys are the elements in the Paper dictionary, and values are descriptions of those elements.

Return type:

dict

scrapemed._parse.gather_abstract(root: Element, ref_map: basicBiMap) List[TextSection | TextParagraph][source]

Extract all abstract text sections from an XML document and return them as a list of TextSections and/or TextParagraphs.

Parameters:
  • root (ET.Element) – The root element of the PMC paper XML tree.

  • ref_map (basicBiMap) – A reference map used for decoding data references within the text.

Returns:

A list of TextSections and/or TextParagraphs representing the abstract text sections in the XML.

Return type:

List[Union[TextSection, TextParagraph]]

scrapemed._parse.gather_acknowledgements(root: Element) List[str] | str[source]

Extract acknowledgements information from a PMC XML document.

This function retrieves a list of acknowledgements found in the article’s XML tree.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A list of strings representing acknowledgements, or a string indicating that no acknowledgements were found.

Return type:

Union[List[str], str]

scrapemed._parse.gather_article_categories(root: Element) List[str][source]

Extract Other Article Categories from a PMC XML document.

This function retrieves article categories that are not marked as ‘heading’ in the subj-group-type attribute.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A list of dictionaries containing other article categories with the subj-group-type as keys and corresponding category values as values.

Return type:

List[Dict[str, str]]

scrapemed._parse.gather_article_id(root: Element) Dict[str, str][source]

Gather Article IDs from PMC XML.

scrapemed._parse.gather_article_types(root: Element) List[str][source]

Extract Article Types from a PMC XML document.

Article Types are article-categories marked by the subj-group-type ‘heading’.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A list of strings representing the Article Types found in the XML.

Return type:

List[str]

scrapemed._parse.gather_authors(root: Element) DataFrame[source]

Extract authors, their emails, and affiliations from a PMC XML.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A DataFrame containing author information with columns: - Contributor_Type: Type of contributor (e.g., ‘author’). - First_Name: First name of the author. - Last_Name: Last name of the author. - Email_Address: Email address of the author. - Affiliations: Affiliations of the author.

Return type:

pd.DataFrame

scrapemed._parse.gather_body(root: Element, ref_map: basicBiMap) List[TextSection | TextParagraph][source]

Extract all body text sections from an XML document and return them as a list of TextSections and/or TextParagraphs.

Parameters:
  • root (ET.Element) – The root element of the PMC paper XML tree.

  • ref_map (basicBiMap) – A reference map used for decoding data references within the text.

Returns:

A list of TextSections and/or TextParagraphs representing the body text sections in the XML.

Return type:

List[Union[TextSection, TextParagraph]]

scrapemed._parse.gather_custom_metadata(root: Element) Dict[str, str][source]

Extract custom metadata key-value pairs from a PMC XML document.

This function retrieves custom metadata key-value pairs found in the article’s XML tree. Custom metadata consists of user-defined key-value pairs.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A dictionary containing custom metadata key-value pairs, or None if no custom metadata is found.

Return type:

Dict[str, str] or None

scrapemed._parse.gather_footnote(root: Element) str[source]

Extract footnote information from a PMC XML document.

This function retrieves and concatenates footnotes found in the article’s back matter.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A string containing the concatenated footnotes, or None if no footnotes are found.

Return type:

str or None

scrapemed._parse.gather_fpage(root: Element) str[source]

Extract the First Page Number of this article in its parent publication from a PMC XML document.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A string representing the First Page Number of the article in its parent publication, or None if no First Page # is found.

Return type:

str or None

scrapemed._parse.gather_funding(root: Element) List[str][source]

Extract funding information from a PMC XML document.

This function retrieves a list of funding institutions mentioned in the article metadata.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A list of strings representing funding institutions, or None if no funding information is found.

Return type:

List[str] or None

scrapemed._parse.gather_issn(root: Element) dict[source]

Extract ISSN values from a PMC XML document.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A dictionary containing ISSN values with the publication type as keys and corresponding values as the ISSN numbers.

Return type:

dict

scrapemed._parse.gather_issue(root: Element) str[source]

Extract the Issue # of the Parent Publication from a PMC XML document.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A string representing the Issue # of the parent publication, or None if no Issue # is found.

Return type:

str or None

scrapemed._parse.gather_journal_id(root: Element) dict[source]

Extract Journal IDs from a PMC XML document.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A dictionary containing Journal IDs with the ID type as keys and corresponding values as the ID values.

Return type:

dict

scrapemed._parse.gather_journal_title(root: Element) List[str] | str[source]

Extract Journal Title(s) from a PMC XML document.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

Either a string representing the Journal Title if there’s only one, a list of strings representing multiple Journal Titles if there are multiple, or None if no journal title is found.

Return type:

Union[List[str], str, None]

scrapemed._parse.gather_lpage(root: Element) str[source]

Extract the Last Page Number of this article in its parent publication from a PMC XML document.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A string representing the Last Page Number of the article in its parent publication, or None if no Last Page # is found.

Return type:

str or None

scrapemed._parse.gather_non_author_contributors(root: Element) str | DataFrame[source]

Extract non-author contributors from a PMC XML.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

Either a string indicating that no non-author contributors were found, or a DataFrame containing contributor information with columns: - Contributor_Type: Type of contributor. - First_Name: First name of the contributor. - Last_Name: Last name of the contributor. - Email_Address: Email address of the contributor. - Affiliations: Affiliations of the contributor.

Return type:

Union[str, pd.DataFrame]

scrapemed._parse.gather_notes(root: Element) str[source]

Extract notes information from a PMC XML document.

This function retrieves a list of notes found in the article’s XML tree.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A list of strings representing notes, or an empty list if no notes are found.

Return type:

List[str]

scrapemed._parse.gather_permissions(root: Element) Dict[str, str][source]

Extract permissions information from a PMC XML document.

This function retrieves the copyright statement, license type, and license text from the article metadata.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A dictionary containing the following keys: - “Copyright Statement”: A string representing the copyright statement. - “License Type”: A string representing the license type. - “License Text”: A string containing the license text.

Return type:

Dict[str, str]

scrapemed._parse.gather_published_date(root: Element) Dict[str, datetime][source]

Extract Publishing Dates from a PMC XML document.

This function gathers electronic publishing, print publishing, and other dates from the article metadata.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A dictionary containing publishing dates with the publication type as keys and corresponding datetime values as values.

Return type:

Dict[str, datetime]

scrapemed._parse.gather_publisher_location(root: Element) str | List[str][source]

Extract Publisher Location(s) from a PMC XML document.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

Either a string representing the Publisher Location if there’s only one, or a list of strings representing multiple Publisher Locations if there are multiple.

Return type:

Union[str, List[str]]

scrapemed._parse.gather_publisher_name(root: Element) str | List[str][source]

Extract Publisher Name(s) from a PMC XML document.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

Either a string representing the Publisher Name if there’s only one, or a list of strings representing multiple Publisher Names if there are multiple.

Return type:

Union[str, List[str]]

scrapemed._parse.gather_title(root: Element) str[source]

Extract the title of a PMC paper from its XML root.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

The title of the PMC paper.

Return type:

str

scrapemed._parse.gather_volume(root: Element) str[source]

Extract the Volume # of the Parent Publication from a PMC XML document.

Parameters:

root (ET.Element) – The root element of the PMC paper XML tree.

Returns:

A string representing the Volume # of the parent publication, or None if no Volume # is found.

Return type:

str or None

scrapemed._parse.generate_paper_dict(pmcid: int, paper_root: Element, verbose: bool = False, suppress_warnings: bool = False, suppress_errors: bool = False) dict[source]

Given the root of an XML tree, parse through it and generate a flattened dictionary of relevant PMC paper XML information.

This function expects the XML to be in NLM articleset 2.0 DTD format.

Optionally, you can suppress warnings and/or errors. If errors are suppressed, None will be returned upon failed parsing.

Parameters:
  • pmcid (int) – Unique PMCID for the article being parsed.

  • paper_root (ET.Element) – The root element of the PMC paper XML tree.

  • verbose (bool) – Whether or not to have verbose output for debugging.

  • suppress_warnings (bool) – Whether to suppress warnings while parsing XML. Note: Warnings are frequent due to the variable nature of PMC XML data. Recommended to suppress when parsing many XMLs at once.

  • suppress_errors (bool) – Whether to suppress errors during parsing. If suppressed, None will be returned upon a failed parsing attempt.

Returns:

A flattened dictionary containing relevant PMC paper XML information.

Return type:

dict or None if errors are suppressed and parsing fails.

scrapemed._parse.paper_dict_from_pmc(pmcid: int, email: str, download: bool = False, validate: bool = True, verbose: bool = False, suppress_warnings: bool = False, suppress_errors: bool = False) dict[source]

Wrapper that scrapes a PMC article specified by PMCID from the web, then parses the retrieved XML into a dictionary of useful values.

This function serves as a middleman between the scrape.py module and Paper.from_pmc method in paper.py, facilitating the conversion of PMC XML data to a dictionary.

Parameters:
  • pmcid (int) – Unique PMCID for the article to scrape and parse.

  • email (str) – Provide your email address for authentication with PMC.

  • download (bool) – Whether or not to download the XML retrieved from PMC.

  • validate (bool) – Whether or not to validate the XML from PMC against NLM articleset 2.0 DTD (HIGHLY RECOMMENDED).

  • verbose (bool) – Whether or not to have verbose output for debugging.

  • suppress_warnings (bool) – Whether to suppress warnings while parsing XML. Note: Warnings are frequent, given the variable nature of PMC XML data. Recommended to suppress when parsing many XMLs at once.

  • suppress_errors (bool) – Return None on failed XML parsing, instead of raising an error.

Returns:

A dictionary containing useful values parsed from the PMC article.

Return type:

dict

scrapemed._parse.stringify_note(root: Element) str[source]

Recursively convert a notes section into a string.

This function traverses the XML tree of a notes section and recursively converts it into a string. It includes the <title>, <p>, and child <notes> content.

Parameters:

root (ET.Element) – The root element of the notes section in the PMC paper XML tree.

Returns:

A string representation of the notes section.

Return type:

str

exception scrapemed._parse.unexpectedMultipleMatchWarning[source]

Bases: Warning

Warned when one match expected, but multiple found.

exception scrapemed._parse.unexpectedZeroMatchWarning[source]

Bases: Warning

Warned when one or more matches expected, and none are found.

exception scrapemed._parse.unmatchedCitationWarning[source]

Bases: Warning

Warned when a citation reference is made but not matched to an actual <ref> tag.

exception scrapemed._parse.unmatchedFigureWarning[source]

Bases: Warning

Warned when a figure reference is made but not matched to an actual <fig> tag.

exception scrapemed._parse.unmatchedTableWarning[source]

Bases: Warning

Warned when a table reference is made but not matched to an actual <table-wrap> tag.

scrapemed.scrape module

ScrapeMed’s Scrape Module

ScrapeMed’s scrape module handles PubMed Central data searching and downloads.

This module also handles conversion of raw XML data to lxml.etree.ElementTree objects.

..warnings::
scrapemed.scrape.get_xml(pmcid: int, email: str, download=False, validate=True, strip_text_styling=True, verbose=False) ElementTree[source]

Retrieve XML of a research paper from PMC, given a PMCID. Also validates and cleans the XML by default.

Parameters:
  • pmcid (int) – PMCID of the article to retrieve.

  • email (str) – Use your email to authenticate with PMC.

  • download (bool) – Whether or not to download the XML. Default is False.

  • validate (bool) – Whether or not to validate the retrieved XML (HIGHLY RECOMMENDED). Default is True.

  • strip_text_styling (bool) – Whether or not to clean common HTML text styling from the text (HIGHLY RECOMMENDED). Default is True.

  • verbose (bool) – Whether to display verbose output. Default is False.

Returns:

ElementTree of the validated XML record.

Return type:

ET.ElementTree

scrapemed.scrape.get_xmls(pmcids: List[int], email: str, download=False, validate=True, strip_text_styling=True, verbose=False) List[ElementTree][source]

Retrieve XMLs of research papers from PMC, given a list of PMCIDs. Also validates and cleans the XMLs by default.

Parameters:
  • pmcids (List[int]) – List of PMCIDs of articles to retrieve.

  • email (str) – Use your email to authenticate with PMC.

  • download (bool) – Whether or not to download the XMLs. Default is False.

  • validate (bool) – Whether or not to validate the retrieved XMLs (HIGHLY RECOMMENDED). Default is True.

  • strip_text_styling (bool) – Whether or not to clean common HTML text styling from the text (HIGHLY RECOMMENDED). Default is True.

  • verbose (bool) – Whether to display verbose output. Default is False.

Returns:

List of ElementTrees of the XMLs corresponding to the provided PMCIDs.

Return type:

List[ET.ElementTree]

scrapemed.scrape.search_pmc(email: str, term: str, retmax: int = 10, verbose: bool = False) dict[source]

Wrapper for Bio.Entrez’s esearch function to retrieve PMC search results.

Parameters:
  • email (str) – Use your email to authenticate with PMC.

  • term (str) – The search term.

  • retmax (int) – The maximum number of PMCIDs to return. Default is 10.

  • verbose (bool) – Whether to display verbose output. Default is False.

Returns:

A dictionary containing search results, including PMCIDs.

Return type:

dict

exception scrapemed.scrape.validationWarning[source]

Bases: Warning

Warned when downloading PMC XML without validating.

scrapemed.scrape.xml_tree_from_string(xml_string: str, strip_text_styling, verbose=False) ElementTree[source]

Converts a string representing XML to an lxml ElementTree.

Parameters:
  • xml_string (str) – A string or bytestream representing XML.

  • strip_text_styling (bool) – Whether to remove HTML text styling tags or not.

  • verbose (bool) – Whether to display verbose output. Default is False.

Returns:

An lxml.etree.ElementTree of the passed string.

Return type:

ET.ElementTree

scrapemed._clean module

ScrapeMed’s Markup Language Cleaning Utilities

Scrapemed module for markup language cleaning utilities.

scrapemed._clean.clean_xml_string(xml_string: str, strip_text_styling=True, verbose=False)[source]

Clean an XML string.

Parameters:
  • xml_string (str) – The XML string to be cleaned.

  • strip_text_styling (bool) – Whether to remove or replace HTML text styling tags.

  • verbose (bool) – Whether to print verbose output.

Returns:

The cleaned XML string.

Return type:

str

scrapemed._clean.split_text_and_refs(tree_text: str, ref_map: basicBiMap, id=None, on_unknown='keep')[source]

Split HTML tags out of text.

  • HTML text styling tags will be removed if they aren’t already.

  • <xref>, <table-wrap>, and <fig> tags will be converted to MHTML tags containing the key to use when searching for these references, tables, and figures.

Returns the cleaned text and updates the passed BiMap for any new key-tag pairs found.

Parameters:
  • tree_text (str) – A string representing a markup language tree containing HTML tags.

  • ref_map (basicBiMap) – A BiMap containing keys connected to reference tag values. BiMap forward keys should be reference keys to place into the text in lieu of the tag for later BiMap table lookup. BiMap forward values should be the actual tags. The provided BiMap will be modified to reflect any new tag values found, and keys will be appended as necessary.

  • id (Any, optional) – Optionally provide an id for traceback of any issues.

  • on_unknown (str) – Behavior when encountering an unknown tag. Determines what happens to the tag contents. Default is ‘keep’. Options: [‘drop’, ‘keep’]

Returns:

A tuple containing the cleaned text and the updated BiMap.

Return type:

Tuple[str, basicBiMap]

exception scrapemed._clean.unexpectedTagWarning[source]

Bases: Warning

Warned when an unexpected tag enclosed in angle brackets is found.

scrapemed._validate module

ScrapeMed’s _validate Module

Validation module for determining whether XML conforms to a format supported by the scrapemed package (NLM Articleset 2.0 DTD).

Custom Exception:
  • noDTDFoundError: Raised when no DTD specification can be found in the

    downloaded XML.

exception scrapemed._validate.noDTDFoundError[source]

Bases: Exception

Raised when no DTD specification can be found in a downloaded XML, preventing validation.

scrapemed._validate.validate_xml(xml: ElementTree) bool[source]

Validate an XML ElementTree against a supported Document Type Definition (DTD).

This function validates the provided XML ElementTree against a supported DTD (Document Type Definition). The supported DTDs are defined by the files in the ‘scrapemed/data/DTDs’ directory. Currently only NLM Articleset 2.0 (The DTD used by PubMed Central) is supported.

Parameters:

xml (ET.ElementTree) – An XML ElementTree to be validated.

Returns:

True if the XML is validated successfully against a supported DTD, False otherwise.

Return type:

bool

Raises:

noDTDFoundError – If no DTD is specified for validation in the XML doctype.

scrapemed._text module

ScrapeMed’s _text Module

The _text module of ScrapeMed is designed for organizing text found in markup languages.

In the context of the ScrapeMed package, this module is used to facilitate the organization of text found within paragraph (<p>) and section (<sec>) tags in downloaded XML from PubMedCentral (PMC).

class scrapemed._text.TextElement(root: Element, parent: TextElement = None, ref_map: basicBiMap = {})[source]

Bases: object

Base class for elements parsed from XML/HTML markup language.

This class is initialized with a root element of the XML text element, as well as a reference map to populate with references found in the text, (and used to generate the replacement MHTML tags).

Parameters:
  • root (ET.Element) – The root element of the XML text element.

  • parent (TextElement) – The parent TextElement if applicable.

  • ref_map (basicBiMap) – The reference map for storing references found in the text.

Attributes:
  • root (ET.Element): The root element of the XML text element.

  • parent (TextElement, optional): The parent TextElement if applicable.

  • ref_map (basicBiMap): The reference map for storing references

    found in the text.

Methods:
  • get_ref_map(): Return the shared BiMap for reference data.

  • set_ref_map(ref_map: basicBiMap): Set the shared BiMap for reference data.

This class serves as the base class for more complex text classes.

get_ref_map() basicBiMap[source]

Return the shared BiMap for reference data.

Returns:

The shared BiMap containing reference data.

Return type:

basicBiMap

set_ref_map(ref_map: basicBiMap)[source]

Set the shared BiMap for reference data.

Parameters:

ref_map (basicBiMap) – The BiMap containing reference data to be set.

class scrapemed._text.TextFigure(fig_root: Element, parent=None, ref_map: basicBiMap = {})[source]

Bases: TextElement

Initialize and parse a figure found in a text element of a PMC XML.

Parses figures into a dictionary with their information (label, caption, and link). Unfortunately, the links are relative and cannot be reliably traced to a public URI. This means I have not found a way to download the actual figures to store via Pillow, etc.

Parameters:
  • fig_root (ET.Element) – The root element of the figure found in PMC XML.

  • parent (TextElement) – The parent TextElement if applicable.

  • ref_map (basicBiMap) – The reference map for storing references found in the text.

Attributes:
  • fig_dict (dict): A dictionary containing figure information with keys:
    • ‘Label’: The label of the figure.

    • ‘Caption’: The caption of the figure.

    • ‘Link’: The link (relative) to the figure.

class scrapemed._text.TextParagraph(p_root: Element, parent=None, ref_map: basicBiMap = {})[source]

Bases: TextElement

Class representation of the data found in an XML <p> tag.

Parameters:
  • p_root (ET.Element) – The root element of the XML <p> tag.

  • parent (TextElement) – The parent TextElement if applicable.

  • ref_map (basicBiMap) – The reference map for storing references found in the text.

Attributes:
  • id (str): The identifier for the <p> tag.

  • text_with_refs (str): The text content of the <p> tag with references.

  • text (str): The clean text content of the <p> tag without references.

Methods:
  • __str__(): Return the clean text content as a string.

  • __eq__(other): Check if two TextParagraph objects are equal based

    on their text content.

class scrapemed._text.TextSection(sec_root: Element, parent=None, ref_map: basicBiMap = {})[source]

Bases: TextElement

Class representation of the data found in an XML <sec> tag.

Parameters:
  • sec_root (ET.Element) – The root element of the XML <sec> tag.

  • parent (TextElement) – The parent TextElement if applicable.

  • ref_map (basicBiMap) – The reference map for storing references found in the text.

Attributes:
  • title (str): The title of the section if available.

  • children (list): A list of TextSection, TextParagraph, TextTable, or

    TextFigure objects representing subsections, paragraphs, tables, or figures within the section.

  • text (str): The clean text content of the section without references.

  • text_with_refs (str): The text content of the section with references.

Methods:
  • __str__(): Return a string representation of the section with proper

    indentation.

  • get_section_text(): Get a text representation of the entire section

    without references.

  • get_section_text_with_refs(): Get a text representation of the entire

    section with references.

  • __eq__(other): Check if two TextSection objects are equal based on

    their title and children.

get_section_text()[source]

Get a text representation of the entire text section, without references.

Returns:

The text content of the section.

Return type:

str

get_section_text_with_refs()[source]

Get a text representation of the entire text section, with references.

Returns:

The text content of the section with references.

Return type:

str

class scrapemed._text.TextTable(table_root: Element, parent=None, ref_map: basicBiMap = {})[source]

Bases: TextElement

Initialize and process a table-wrap found in a text element of PMC XML.

Uses pandas’ read_html function (which relies on lxml and falls back to html5lib) to process the HTML tables into dataframes.

Adds labels and captions if notated in the XML under //table-wrap/label and //table-wrap/caption/p tags.

Parameters:
  • table_root (ET.Element) – The root element of the table-wrap found in PMC XML.

  • parent (TextElement) – The parent TextElement if applicable.

  • ref_map (basicBiMap) – The reference map for storing references found in the text.

Attributes:
  • df (pandas.DataFrame): The dataframe representation of the table.

exception scrapemed._text.multipleTitleWarning[source]

Bases: Warning

Warned when one title expected, but multiple found.

exception scrapemed._text.readHTMLFailure[source]

Bases: Warning

Warned when pandas read_html function fails (rare). May happen with tables void of readable data (ie. tables that contain only graphics).

scrapemed._text.stringify_children(node, encoding='utf-8')[source]

Returns a string representation of a node and all its children (recursively), including markup language tags.

Turns any byte strings in the subtree representation to regular strings, following the provided encoding.

Parameters:
  • node (ET.Element) – The XML node to stringify.

  • encoding (str) – The encoding to use for decoding byte strings (default is ‘utf-8’).

Returns:

A string representation of the node and its children, including markup language tags.

Return type:

str

exception scrapemed._text.unhandledTextTagWarning[source]

Bases: Warning

Warned when a tag is encountered in a text section of the XML, but is not explicitly handled by ScrapeMed.

The tag contents will be ignored if not handled manually.

Feel free to submit a PR if you add a non-breaking code addition to handle these types of tags.

scrapemed._morehtml module

ScrapeMed’s Custom Markup Language - MoreHTML (MHTML)

Wrapper on basic functions for HTML manipulation.

Added on top of core html functionality: Non-markup significant unescape function, custom MHTML tag encoding and removal.

scrapemed._morehtml.generate_mhtml_tag(string: str) str[source]

Generates an MHTML tag from the provided string.

Parameters:

string (str) – The text to be tagged in MHTML format.

Returns:

An MHTML tag containing the input string, in format f”[MHTML::{string}]”

Return type:

str

scrapemed._morehtml.generate_typed_mhtml_tag(tag_type: str, string: str) str[source]

Generates a typed MHTML tag from the provided string.

Parameters:
  • tag_type (str) – The type of the MHTML tag.

  • string (str) – The text to be tagged in MHTML format.

Returns:

A typed MHTML tag containing the input string, in format [MHTML::type::string].

Return type:

str

scrapemed._morehtml.remove_mhtml_tags(text: str) str[source]

Removes all MHTML tags and typed MHTML tags found in the provided text.

Parameters:

text (str) – The text from which to remove MHTML tags.

Returns:

The text with MHTML tags removed.

Return type:

str

scrapemed._morehtml.unescape_except(s, **kwargs)[source]

Convert all named and numeric character references in the provided string to the corresponding Unicode characters, excluding any provided encodings to be ignored.

Parameters:
  • s (str) – The input string containing character references.

  • kwargs (dict) –

    Keyword arguments of the form key=encoding. These encodings will be ignored when unescaping.

    For keys with multiple encodings, use unique keynames.

    Encodings must be single code strings.

Returns:

A string with character references unescaped, except for the specified encodings to be ignored.

Return type:

str

This function uses the rules defined by the HTML 5 standard for both valid and invalid character references, and the list of HTML 5 named character references defined in html.entities.html5.

scrapemed.utils module

ScrapeMed’s Utility Module

The utils module contains various utilities necessary for PMC data parsing and cleaning, where the utilities are not directly related to text cleaning (_clean), recursive text processing (_text), or parsing (_parse).

At the moment, the module contains a helper function for cleaning up docstrings, and a class, basicBiMap, which is a two-way map similar to python’s dict class, used for efficient storage of data reference maps used throughout ScrapeMed.

Note: Data reference maps are used to pull citations, tables, and figures out of text for parsing elsewhere while retaining placeholders in the original text.

.warnings::
class scrapemed.utils.basicBiMap(*args, **kwargs)[source]

Bases: dict

BiMap class which extends Python’s dict class to also store a reverse of itself for efficiency.

This class allows for bidirectional mapping between keys and values. When a key-value pair is added, it can be accessed both by key and by value.

Methods are inherited from the dict class, and additional methods specific to bidirectional mapping are provided.

Parameters:
  • *args

    Positional arguments passed to the dict constructor.

  • **kwargs

    Keyword arguments passed to the dict constructor.

Example: ` bi_map = basicBiMap({'apple': 'red', 'banana': 'yellow'}) print(bi_map['apple'])  # Output: 'red' print(bi_map.reverse['red'])  # Output: 'apple' `

Variables:

reverse (dict) – A reverse mapping from values to keys.

scrapemed.utils.cleanerdoc(s)[source]

Wrapper for inspect.cleandoc which also removes newlines.

Parameters:

s (str) – The string to clean and remove newlines from.

Returns:

The cleaned and newline-removed string.

Return type:

str

exception scrapemed.utils.reversedBiMapComparisonWarning[source]

Bases: Warning

Warned when comparing basicBiMaps which are exactly the same but reversed.

scrapemed.trees module

ScrapeMed’s Trees Module

Scrapemed’s trees module handles PMC article tree visualizations, statistics, and descriptions.

scrapemed.trees.investigate_xml_tree(root: Element) None[source]

Print basic statistics and information about an XML tree provided its root.

Parameters:

root (ET.Element) – The root of an ElementTree of your XML.

This function prints the following information to stdout: - Number of elements in the XML tree. - Unique element types in the XML tree. - A dictionary with tag frequencies in the XML tree.

scrapemed.trees.visualize_element_tree(root: Element, title='data/element_tree.gv', test_mode=False) None[source]

Visualize an XML element tree using Graphviz.

Parameters:
  • root (ET.Element) – The root of the XML element tree to visualize.

  • title (str) – The title or filename for the output visualization. Default is “data/element_tree.gv”.

  • test_mode (bool) – Whether to render the visualization in test mode or not.

This function creates a visualization of the XML element tree using Graphviz and optionally renders it.

Module contents