Introduction to RDKit#

Overview

Questions:

  • What is RDKit?

  • How can I get molecule information in Python?

  • What is a molecular descriptor?

Objectives:

  • Use RDKit to create molecules in Python

There are also libraries in Python that are made for working just with chemical data. One commonly used library in Python for data science (or cheminformatics) is called RDKit.

RDKit lets us create variables which represent molecules and retrieve information about the molecules.

We are going to use a part of RDKit called Chem. To use Chem, we first have to import it.

from rdkit import Chem

Creating Molecules with RDKit#

To get information about molecules in RDKit, we have to first create variables representing molecules. RDKit has molecule object that can be used to retrieve information or calculate properties. First, the molecule name has to be communicated to RDKit in a way that computers understand.

Creating molecules using strings - SMILES#

We are going to use a format called SMILES. SMILES stands for “Simplified Molecular-Input Line-entry System” and is a string representation of a molecule. You usually will be looking up SMILES or having a program generate them for you. We could also open info from molecular file formats like mol.

In SMILES, if we want methane, we specify that with “C”. We can create a representation of methane using RDKit by using the MolFromSmiles function in rdkit.Chem.

methane = Chem.MolFromSmiles("C")
methane
../_images/rdkit_4_0.png

You can see more information on this Guide to SMILES. Usually, you will be able to look up the SMILES strings for your molecule of interest using something like PubChem, although you can also write SMILES by hand.

Check Your Understanding

Create RDKit molecules for the following molecules:

  1. Propane

  2. Ethene

  3. Cyclohexane

  4. A molecule of your choice.

Creating More Complicated Molecules#

We can look up SMILES of our molecules of choice on websites like PubChem. See the examples below for phenylalanine and benzene.

# this phenylalanine uses canonical smiles (no stereochemistry information)
phenylalanine = Chem.MolFromSmiles("C1=CC=C(C=C1)CC(C(=O)O)N")
phenylalanine
../_images/rdkit_7_0.png
# phenyalanine using isomeric smiles
l_phenylalanine = Chem.MolFromSmiles("C1=CC=C(C=C1)C[C@@H](C(=O)O)N")
l_phenylalanine
../_images/rdkit_8_0.png
benzene = Chem.MolFromSmiles("c1ccccc1")
benzene
../_images/rdkit_9_0.png

Working with Molecules#

RDKit will allow us to access information about our molecule. For example, we can have RDKit tell us the number of atoms in our molecule.

num_methane = methane.GetNumAtoms()
print(f"The number of atoms in methane is {num_methane}")
The number of atoms in methane is 1

This isn’t exactly what we expect. By default, RDKit only counts “heavy atoms”. This means that hydrogen isn’t included. We can tell RDKit to count hydrogens by adding onlyExplicit=False to our GetNumAtoms function.

num_methane_h = methane.GetNumAtoms(onlyExplicit=False)
print(f"The number of atoms in methane including hydrogens is {num_methane_h}.")
The number of atoms in methane including hydrogens is 5.

We can also get information about atoms or bonds from RDKit molecules. We can use the GetAtoms() function, then use a loop to look at the atoms.

ethanol = Chem.MolFromSmiles("CCO")

for atom in ethanol.GetAtoms():
    print(atom.GetSymbol(), atom.GetMass())
C 12.011
C 12.011
O 15.999

Like the previous example, the molecule does not have hydrogens by default. If you need to work with the hydrogens, you can use a function to add them. Now, when we look through the atoms, we get the hydrogens as well.

ethanol_h = Chem.AddHs(ethanol)

ethanol_h
../_images/rdkit_17_0.png
for atom in ethanol_h.GetAtoms():
    print(atom.GetSymbol())
C
C
O
H
H
H
H
H
H

Similarly, you can use the GetBonds method to get the bonds in a molecule.

for bond in ethanol.GetBonds():
    print(bond.GetBondType())
SINGLE
SINGLE

You can see the methods available on atoms and bonds by using the dir method in Python. The dir method is built-in and will tell you what attributes and methods are available on an object.

Note that the atom and bond variables used in the code blocks below are from our previous for loops.

dir(atom)
['ClearProp',
 'DescribeQuery',
 'GetAtomMapNum',
 'GetAtomicNum',
 'GetBonds',
 'GetBoolProp',
 'GetChiralTag',
 'GetDegree',
 'GetDoubleProp',
 'GetExplicitBitVectProp',
 'GetExplicitValence',
 'GetFormalCharge',
 'GetHybridization',
 'GetIdx',
 'GetImplicitValence',
 'GetIntProp',
 'GetIsAromatic',
 'GetIsotope',
 'GetMass',
 'GetMonomerInfo',
 'GetNeighbors',
 'GetNoImplicit',
 'GetNumExplicitHs',
 'GetNumImplicitHs',
 'GetNumRadicalElectrons',
 'GetOwningMol',
 'GetPDBResidueInfo',
 'GetProp',
 'GetPropNames',
 'GetPropsAsDict',
 'GetQueryType',
 'GetSmarts',
 'GetSymbol',
 'GetTotalDegree',
 'GetTotalNumHs',
 'GetTotalValence',
 'GetUnsignedProp',
 'HasOwningMol',
 'HasProp',
 'HasQuery',
 'InvertChirality',
 'IsInRing',
 'IsInRingSize',
 'Match',
 'NeedsUpdatePropertyCache',
 'SetAtomMapNum',
 'SetAtomicNum',
 'SetBoolProp',
 'SetChiralTag',
 'SetDoubleProp',
 'SetExplicitBitVectProp',
 'SetFormalCharge',
 'SetHybridization',
 'SetIntProp',
 'SetIsAromatic',
 'SetIsotope',
 'SetMonomerInfo',
 'SetNoImplicit',
 'SetNumExplicitHs',
 'SetNumRadicalElectrons',
 'SetPDBResidueInfo',
 'SetProp',
 'SetUnsignedProp',
 'UpdatePropertyCache',
 '__class__',
 '__copy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__instance_size__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__']
dir(bond)
['ClearProp',
 'DescribeQuery',
 'GetBeginAtom',
 'GetBeginAtomIdx',
 'GetBondDir',
 'GetBondType',
 'GetBondTypeAsDouble',
 'GetBoolProp',
 'GetDoubleProp',
 'GetEndAtom',
 'GetEndAtomIdx',
 'GetIdx',
 'GetIntProp',
 'GetIsAromatic',
 'GetIsConjugated',
 'GetOtherAtom',
 'GetOtherAtomIdx',
 'GetOwningMol',
 'GetProp',
 'GetPropNames',
 'GetPropsAsDict',
 'GetSmarts',
 'GetStereo',
 'GetStereoAtoms',
 'GetUnsignedProp',
 'GetValenceContrib',
 'HasOwningMol',
 'HasProp',
 'HasQuery',
 'IsInRing',
 'IsInRingSize',
 'Match',
 'SetBondDir',
 'SetBondType',
 'SetBoolProp',
 'SetDoubleProp',
 'SetIntProp',
 'SetIsAromatic',
 'SetIsConjugated',
 'SetProp',
 'SetStereo',
 'SetStereoAtoms',
 'SetUnsignedProp',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__']

Molecular Descriptors#

Molecular descriptors are properties of molecules which can be analyzed using statistical methods to make predictions about molecular properties. You can see a full list of RDKit descriptors or see the module documentation. RDKit has a module which can be used to retrieve and calculate molecular descriptors.

To get molecular descriptors from RDKit, we import the Descriptors module.

from rdkit.Chem import Descriptors

Some molecular descriptors we might be interested in are:

  1. Heavy Atom Count (HeavyAtomCount)

  2. Number of hydrogen bond acceptrs (NumHAcceptors)

  3. Number of hydrogen bond donors (NumHDonors)

  4. Number of valence electrons (NumValenceElectrons)

  5. Molecular weight (MolWt)

  6. Number of aromatic rings (NumAromaticRings) And many others!

To get the descritpor you’re interested in, the syntax is

Descriptors.DescriptorName(molecule_variable)
print("Printing info for benzene:")
print(f"The molecular weight is {Descriptors.MolWt(benzene)}")
print(f"The number of aromatic rings is {Descriptors.NumAromaticRings(benzene)}")
Printing info for benzene:
The molecular weight is 78.11399999999999
The number of aromatic rings is 1

Stereochemistry#

# Can tell us how many stereocenters
Chem.rdMolDescriptors.CalcNumAtomStereoCenters(phenylalanine)
1
# Can tell us the atom number and the orientation
print(Chem.FindMolChiralCenters(phenylalanine, includeUnassigned=True))
print(Chem.FindMolChiralCenters(l_phenylalanine, includeUnassigned=True))
[(7, '?')]
[(7, 'S')]
# Can generate stereoisomers
from rdkit.Chem.EnumerateStereoisomers import EnumerateStereoisomers, StereoEnumerationOptions

stereo = list(EnumerateStereoisomers(phenylalanine))

print(f"Generated {len(stereo)} stereoisomers.")
Generated 2 stereoisomers.
stereo[0]
../_images/rdkit_32_0.png
stereo[1]
../_images/rdkit_33_0.png