Retrieving Information from the PDB using the Web API#

Overview

Questions:

  • What is a REST API?

  • How can I use Python to retrieve data from a REST API?

Objectives:

  • Manipulate URLs to use the PDB REST API.

  • Use Python requests to retrieve information from the PDB.

In this lesson, we will explore how to retrieve information from the PDB Web API.

Many databases can be accessed programmatically through something called a REST API. REST stands for Representational State Transfer. API stands for Application Programming Interface. A REST API is a type of web API that is used to allow different software systems to communicate with each other over the internet.

Usually a REST API is accessed by varying parameters in a web address.

We will work with two types of API from the PDB in this lesson.

File Download using Biopython#

Biopython has functions for retrieving structure files. To download a structure file from the PDB, you make a PDBList. Then, you use the function pdb_list.retrieve_pdb_file. This function takes the PDB ID as a parameter.  You specify the folder where you want the file saved using pdir=FOLDER.  If we want to download the structure as an MMCIF, we have to add file_format=”mmCif”`.

For example, the following cell downloads the PDB 4HHB, hemoglobin.

from Bio.PDB import PDBList

# Create an instance of the PDBList class
pdb_list = PDBList()

# Specify the PDB ID of the structure you want to download
pdb_id = "4hhb" #zinc finger

# Download the MMCIF file using the retrieve_pdb_file method
pdb_filename = pdb_list.retrieve_pdb_file(pdb_id, pdir="data/PDB_files", file_format="mmCif")

# Print the name of the downloaded file
print(pdb_filename)
Downloading PDB structure '4hhb'...
data/PDB_files/4hhb.cif

If you check your data folder, you will see that you now have a file called 4hhb.cif.

PDB Data API#

You can use the web to retrieve information about a molecule using its PDB ID using the PDB Data API. This will give you access to information about the PDB entry, rather than the structure file like with the Biopython file download.

As mentioned above, web APIs often work by varying text in a web address (also called URL).

The PDB Data API has the following format:

https://data.rcsb.org/rest/v1/core/<TYPE_OF_RESOURCE>/<IDENTIFIER>

For example, to get the full entry for 4hhb, you would do:

https://data.rcsb.org/rest/v1/core/entry/4hhb

Try clicking this link! It will display text in your browser with info about 4HHB.

There are many things you can do with the REST API that are beyond this workshop. However, one interesting thing you can do is to retrieve PubMed annotations for an entry’s primary citation.

https://data.rcsb.org/rest/v1/core/pubmed/4hhb

The data in these results is in a data format called JSON. This is a commonly used type of return format for web APIs because it can be processed using programming. Its format is similar to a Python dictionary, having keys and values.

Programmatic Access of APIs#

REST APIs start being more useful when you access them programmatically. We are going to use Python to retrieve the data at the URL and convert it to a format we can work with.

We will use a Python library called requests. Requests is used to retrieve information from websites and URLs.

import requests

To get information from a URL, we use the requests.get method. The argument to this function is the URL we’d like to retrieve information from.

data = requests.get("https://data.rcsb.org/rest/v1/core/entry/4hhb")

Our data variable now contains the results and other information about the request we made.

If our request was successful. It will have a status code of 200. This is a feature of web URLs. If they are successfully retrieved, they send a status code of 200. This is true for any website you receive. One status code you may be familiar with is 404, which occurs when a resources is not found.

Note that status_code is not a function associated with data, so we do not need to use ().

data.status_code
200

We can see the JSON associated with our request by calling the .json() method, which we will save in a variable called info_4hhb. Our variable is now similar to a Python dictionary, which is a data type that has key, value pairs.

info_4hhb = data.json()

Once we have loaded our returned values as json, we can see all of the keywords associated with an entry using variable.keys()

info_4hhb.keys()
dict_keys(['audit_author', 'cell', 'citation', 'diffrn', 'entry', 'exptl', 'exptl_crystal', 'pdbx_audit_revision_category', 'pdbx_audit_revision_details', 'pdbx_audit_revision_group', 'pdbx_audit_revision_history', 'pdbx_audit_revision_item', 'pdbx_database_pdbobs_spr', 'pdbx_database_related', 'pdbx_database_status', 'rcsb_accession_info', 'rcsb_entry_container_identifiers', 'rcsb_entry_info', 'rcsb_primary_citation', 'refine', 'refine_hist', 'struct', 'struct_keywords', 'symmetry', 'rcsb_id'])

The variable info_4hhb is a Python dictionary, meaning that we can access information in the dictionary using the syntax:

dictionary_name["key_name"]
info_4hhb["exptl"]
[{'method': 'X-RAY DIFFRACTION'}]
info_4hhb["struct_keywords"]
{'pdbx_keywords': 'OXYGEN TRANSPORT', 'text': 'OXYGEN TRANSPORT'}
info_4hhb["struct"]
{'title': 'THE CRYSTAL STRUCTURE OF HUMAN DEOXYHAEMOGLOBIN AT 1.74 ANGSTROMS RESOLUTION'}
info_4hhb["rcsb_entry_info"]
{'assembly_count': 1,
 'branched_entity_count': 0,
 'cis_peptide_count': 0,
 'deposited_atom_count': 4779,
 'deposited_deuterated_water_count': 0,
 'deposited_hydrogen_atom_count': 0,
 'deposited_model_count': 1,
 'deposited_modeled_polymer_monomer_count': 574,
 'deposited_nonpolymer_entity_instance_count': 6,
 'deposited_polymer_entity_instance_count': 4,
 'deposited_polymer_monomer_count': 574,
 'deposited_solvent_atom_count': 221,
 'deposited_unmodeled_polymer_monomer_count': 0,
 'disulfide_bond_count': 0,
 'entity_count': 5,
 'experimental_method': 'X-ray',
 'experimental_method_count': 1,
 'inter_mol_covalent_bond_count': 0,
 'inter_mol_metalic_bond_count': 4,
 'molecular_weight': 64.74,
 'na_polymer_entity_types': 'Other',
 'nonpolymer_bound_components': ['HEM'],
 'nonpolymer_entity_count': 2,
 'nonpolymer_molecular_weight_maximum': 0.62,
 'nonpolymer_molecular_weight_minimum': 0.09,
 'polymer_composition': 'heteromeric protein',
 'polymer_entity_count': 2,
 'polymer_entity_count_dna': 0,
 'polymer_entity_count_rna': 0,
 'polymer_entity_count_nucleic_acid': 0,
 'polymer_entity_count_nucleic_acid_hybrid': 0,
 'polymer_entity_count_protein': 2,
 'polymer_entity_taxonomy_count': 2,
 'polymer_molecular_weight_maximum': 15.89,
 'polymer_molecular_weight_minimum': 15.15,
 'polymer_monomer_count_maximum': 146,
 'polymer_monomer_count_minimum': 141,
 'resolution_combined': [1.74],
 'selected_polymer_entity_types': 'Protein (only)',
 'solvent_entity_count': 1,
 'structure_determination_methodology': 'experimental',
 'structure_determination_methodology_priority': 10,
 'diffrn_resolution_high': {'provenance_source': 'From refinement resolution cutoff',
  'value': 1.74}}

You can also use the PDB Data API to retrieve information about interfaces between polymeric entities (protein, or nucleic acids) using one of the API endpoints by changing the URL.

The format for querying about interfaces is

https://data.rcsb.org/rest/v1/core/interface/<pdb_id>/<assembly_id>/<interface_id>

In the cell below, we get the first interface for assembly 1 (there is only one assembly in this PDB entry).

interface = requests.get("https://data.rcsb.org/rest/v1/core/interface/4hhb/1/1")
interface.status_code
200
interface_info = interface.json()
interface_info["rcsb_interface_info"]
{'polymer_composition': 'Protein (only)',
 'interface_character': 'hetero',
 'interface_area': 847.773205021308,
 'num_interface_residues': 44,
 'num_core_interface_residues': 11}

Try changing the numbers in the URL. How does it change your results?

PDB Search API#

The API we were just working with is called the PDB “Data” API.

However, the PDB has another API called the “search” API that let’s you search based on keywords, host species, sequence similarity, and many other things. The search API can be complicated, but is well documented.

The format for this URL is:

https://search.rcsb.org/rcsbsearch/v2/query?json={search-request}

Where search-request is a JSON (similar to a Python dictionary) containing your search parameters. We will create a Python dictionary, then use the json library to convert it to a json (the data type required by the search API).

import json

my_query = {
  "query": {
    "type": "terminal",
    "service": "full_text",
    "parameters": {
        "value": '"oxygen storage"'
    }
  },
  
  "return_type": "entry"
}

my_query = json.dumps(my_query)

Now, we use requests.get with the our search query and the search API URL format.

data = requests.get(f"https://search.rcsb.org/rcsbsearch/v2/query?json={my_query}")
results = data.json()
results
{'query_id': 'b9a88423-cdbe-414b-b39c-7fd14309c303',
 'result_type': 'entry',
 'total_count': 668,
 'result_set': [{'identifier': '2BMM', 'score': 1.0},
  {'identifier': '1UVY', 'score': 0.9861872248797519},
  {'identifier': '1UVX', 'score': 0.9727507745902415},
  {'identifier': '1UX8', 'score': 0.9596755540123255},
  {'identifier': '2AWC', 'score': 0.9443018967660793},
  {'identifier': '7DDS', 'score': 0.9350267846701926},
  {'identifier': '2EB8', 'score': 0.9345519710130025},
  {'identifier': '2EF2', 'score': 0.9345519710130025},
  {'identifier': '1D8U', 'score': 0.922477162878988},
  {'identifier': '1DUK', 'score': 0.922477162878988}],
 'facets': []}

Our results tell us that our query returned 619 results (total_count). The results_set keyword has a list of results. The identifier key in each value in results_set tells us the PDB ID of a search result.

You will notice that even though the search has told us there are 619 results, we only have 10 values in our results variable. This is because the API always counts the number of results, but will only return 10 unless we ask for more.

We could now combine this with Biopython or the data API to get information about the structures. For example, we will retrieve the title of the paper where this structure was published we could do so:

first_result = results["result_set"][0]["identifier"]
print(first_result)
2BMM
data = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{first_result}")
result = data.json()
result["struct"]
{'title': 'X-ray structure of a novel thermostable hemoglobin from the actinobacterium Thermobifida fusca'}

You can also control the format of the returned results including the number of results and how the results are sorted. The query below searchs for the phrase “oxygen storage” in the keywords of the structure. We’ve specified that we want 50 values returned (instead of the usual 10), and for the results to be sorted by the initial release date in ascending order (oldest result first). If you wanted to see the 10 most recently released results, you would change “asc” to “desc” in the query below.

my_query = {
  "query": {
    "type": "terminal",
    "service": "text",
    "parameters": {
        "attribute": "struct_keywords.pdbx_keywords",
        "operator": "contains_phrase",
        "value": '"oxygen storage"'
    }
  },

  "request_options": {
    "paginate": {
      "start": 0,
      "rows": 50,
    },
    "sort": [
      {
      "sort_by": "rcsb_accession_info.initial_release_date",
      "direction": "asc" 
    }
    ]
  },
  
  "return_type": "entry"
}

my_query = json.dumps(my_query)

Now, we use requests.get with the our search query and the search API URL format.

data = requests.get(f"https://search.rcsb.org/rcsbsearch/v2/query?json={my_query}")
data.status_code
200
results = data.json()
results
{'query_id': '631ced89-1d3c-4b87-8d13-18d8865399b8',
 'result_type': 'entry',
 'total_count': 571,
 'result_set': [{'identifier': '1MBN', 'score': 1.0},
  {'identifier': '1MBD', 'score': 1.0},
  {'identifier': '1MBO', 'score': 1.0},
  {'identifier': '1MBC', 'score': 1.0},
  {'identifier': '4MBN', 'score': 1.0},
  {'identifier': '5MBN', 'score': 1.0},
  {'identifier': '1MBA', 'score': 1.0},
  {'identifier': '1PMB', 'score': 1.0},
  {'identifier': '3MBA', 'score': 1.0},
  {'identifier': '4MBA', 'score': 1.0},
  {'identifier': '2MB5', 'score': 1.0},
  {'identifier': '1MBI', 'score': 1.0},
  {'identifier': '5MBA', 'score': 1.0},
  {'identifier': '2FAL', 'score': 1.0},
  {'identifier': '2FAM', 'score': 1.0},
  {'identifier': '1MYG', 'score': 1.0},
  {'identifier': '1MYH', 'score': 1.0},
  {'identifier': '1MYI', 'score': 1.0},
  {'identifier': '1MYJ', 'score': 1.0},
  {'identifier': '1YCA', 'score': 1.0},
  {'identifier': '1YCB', 'score': 1.0},
  {'identifier': '2MGA', 'score': 1.0},
  {'identifier': '2MGB', 'score': 1.0},
  {'identifier': '2MGC', 'score': 1.0},
  {'identifier': '2MGD', 'score': 1.0},
  {'identifier': '2MGE', 'score': 1.0},
  {'identifier': '2MGF', 'score': 1.0},
  {'identifier': '2MGG', 'score': 1.0},
  {'identifier': '2MGH', 'score': 1.0},
  {'identifier': '2MGI', 'score': 1.0},
  {'identifier': '2MGJ', 'score': 1.0},
  {'identifier': '2MGK', 'score': 1.0},
  {'identifier': '2MGL', 'score': 1.0},
  {'identifier': '2MGM', 'score': 1.0},
  {'identifier': '2MYA', 'score': 1.0},
  {'identifier': '2MYB', 'score': 1.0},
  {'identifier': '2MYC', 'score': 1.0},
  {'identifier': '2MYD', 'score': 1.0},
  {'identifier': '2MYE', 'score': 1.0},
  {'identifier': '2SPL', 'score': 1.0},
  {'identifier': '2SPM', 'score': 1.0},
  {'identifier': '2SPN', 'score': 1.0},
  {'identifier': '2SPO', 'score': 1.0},
  {'identifier': '1MLF', 'score': 1.0},
  {'identifier': '1MLG', 'score': 1.0},
  {'identifier': '1MLH', 'score': 1.0},
  {'identifier': '1MLJ', 'score': 1.0},
  {'identifier': '1MLK', 'score': 1.0},
  {'identifier': '1MLL', 'score': 1.0},
  {'identifier': '1MLM', 'score': 1.0}],
 'facets': []}

Using Biopython to Analyze Search Results#

The API becomes really interesting when we use the results for analysis. Keeping our example of oxygen storage, we could use Biopython to analyze the structures we have retrieved to see if there are heme groups in the structures and if there are common motifs in the binding of the heme groups to proteins.

The following cells show a farily complicated analysis of the structures we have retrieved using Biopython in the cells just above this section. If you are new to programming, the next few cells will be hard to understand. However, they are a demonstration of the type of analysis that can be done by retrieving search results programmatically and analyzing data using Python. The end of the analysis prints the most commonly found neighboring residues to iron in the structures retrieved by the search.

The program has multiple steps:

  1. Biopython is used to download the mmcif files for all of the structures.

  2. Biopython is used to create structure objects, sort through atoms, and find residues neighboring iron atoms.

  3. Python functions (particular Counter ) are used to count residue combinations.

## Step 1
from Bio.PDB import PDBList

# Create an instance of the PDBList class
pdb_list = PDBList()

# Download all of the structure files
for result in results["result_set"]:
    pdb_id = result["identifier"].lower()

    # Download the MMCIF file using the retrieve_pdb_file method
    pdb_filename = pdb_list.retrieve_pdb_file(pdb_id, pdir="pdb_files", file_format="mmCif")


## Step 2

from Bio.PDB.MMCIFParser import MMCIFParser
from Bio.PDB import NeighborSearch
from collections import Counter

# Create an MMCIFParser object to parse mmCIF files.
parser = MMCIFParser(QUIET=True)

# Define the maximum distance (in Ångströms) for identifying neighboring residues.
cutoff_distance = 5

# Initialize a dictionary to store the neighboring residues for each protein structure.
residue_neighbors = {}

# The 'results' variable is a dictionary containing search results.
# Each result in 'results["result_set"]' represents a protein structure with a PDB ID.
for result in results["result_set"]:
    # Extract the PDB ID and convert it to lowercase.
    pdb_id = result["identifier"].lower()

    # Parse the mmCIF file for the protein structure using the PDB ID.
    # The file is expected to be located in the 'pdb_files' directory.
    structure = parser.get_structure(pdb_id, f"pdb_files/{pdb_id}.cif")
    
    # Extract all atoms from the protein structure.
    atoms = list(structure.get_atoms())
    
    # Create a NeighborSearch object to perform neighbor searches.
    neighbor_search = NeighborSearch(atoms)
    
    # Initialize a list to store the neighboring residues for this protein structure.
    neighbor_list = []

    # Loop through the atoms in the protein structure.
    for atom in atoms:
        # Check if the atom is an iron (Fe) atom.
        if atom.element == "FE":
            # Get the parent residue of the iron atom.
            iron_residue = atom.get_parent()

            # Find atoms within the cutoff distance from the iron atom.
            neighbors = neighbor_search.search(atom.get_coord(), cutoff_distance)
            
            # Loop through the neighboring atoms.
            for neighbor in neighbors:
                # Get the parent residue of the neighboring atom.
                residue = neighbor.get_parent()
    
                # Check if the neighboring residue is different from the iron-containing residue.
                if residue != iron_residue:
                    # Add the neighboring residue to the list.
                    neighbor_list.append(residue)
                    
    # Store the unique neighboring residues in the dictionary using the PDB ID as the key.
    residue_neighbors[pdb_id] = set(neighbor_list)

# The 'residue_neighbors' dictionary contains the neighboring residues for each protein structure.


## Step 3
## Now we will want to count the residue neighbor types.
# Initialize an empty Counter object to store the counts of residue combinations.
combination_counts = Counter()

# Iterate over the items in the 'residue_neighbors' dictionary.
# Each item consists of a PDB ID ('pdb_id') and a set of neighboring residues ('neighbors') to iron atoms.
for pdb_id, neighbors in residue_neighbors.items():
    # Extract the residue names ('resname') for each neighboring residue using the 'get_resname' method.
    resname = [x.get_resname() for x in neighbors if x.get_resname()]
    
    # Count the occurrences of each residue name in the current combination.
    res_counts = Counter(resname)
    
    # Convert the residue counts to a tuple of (residue, count) pairs, sorted by residue name.
    # This standardizes the representation of each combination.
    combination = tuple(sorted(res_counts.items()))
    
    # Update the combination_counts with the current combination.
    combination_counts.update([combination])

# Use the 'most_common' method to get the most common residue combinations.
# The result is a list of tuples, where each tuple contains a combination and its count.
most_common_combinations = combination_counts.most_common()

# For example, to get the top 5 most common combinations, use 'most_common(5)'.
top_5_combinations = combination_counts.most_common(5)
print("\nTop 5 most common residue combinations for iron neighbors:")
for combination, count in top_5_combinations:
    combination_str = ', '.join([f"{count} {residue}" for residue, count in combination])
    print(f"Combination: {combination_str}, Count: {count}")
Top 5 most common residue combinations for iron neighbors:
Combination: 2 HIS, 1 HOH, 1 VAL, Count: 4
Combination: 1 CMO, 2 HIS, 1 VAL, Count: 3
Combination: 2 HIS, 1 HOH, 1 OXY, 1 VAL, Count: 2
Combination: 5 HIS, 2 HOH, 2 VAL, Count: 2
Combination: 4 HIS, 2 HOH, 2 VAL, Count: 2