Introduction to Chemoinformatics: The Digital World of Molecules
Welcome to the first tutorial in our chemoinformatics series!
We'll explore the fundamental question: What is chemoinformatics? We'll also dive into how we represent complex chemical structures in a way a computer can understand.
Chemoinformatics, also known as cheminformatics, is an interdisciplinary field that uses chemistry, computer science, and data analysis to process and analyse chemical data. (Although there are different definitions for chemoinformatics from other scientists.)
Chemoinformatics plays a central role in drug discovery, materials design, and molecular modeling, allowing researchers to predict molecular properties, design new compounds, and identify potential drug candidates computationally.
In simple terms, chemoinformatics helps scientists turn chemical structures into data that computers can understand and use for predictions.
Core Concepts and Fundamental Principles
1. Chemical Structure Representation
In chemoinformatics, understanding how to represent chemical structures in ways that computers can process is fundamental. Unlike humans, who can visualize molecular structures, computers require specialized notations:
SMILES (Simplified Molecular Input Line Entry System): A string-based way to show molecular structures using ASCII characters. It gives a short and easy-to-read way to show chemical structures, with atomic symbols for atoms and specific symbols for bonds. In SMILES notation, for instance, ethanol is written as "CCO."
InChI (International Chemical Identifier): A standardised identifier that can be read by machines and gives a unique representation of molecular structures around the world. InChI is hierarchical and has several layers that store different levels of structural information. This means that the same molecule will always have the same InChI, no matter where it comes from.
Molecular Graph Representations: These treat atoms as nodes and bonds as edges in a graph, enabling complex computational operations like substructure searching and molecular similarity calculations.
2. Molecular Descriptors and Properties
Molecular descriptors are numerical values that capture specific characteristics of molecules, enabling quantitative analysis and comparison:
0D Descriptors: Basic properties including atom counts, molecular weight, and molar refractivity.
1D Descriptors: Fragment counts, hybridization states, hydrogen bond donors/acceptors, and polar surface area.
2D and 3D Descriptors: More complex descriptors derived from molecular topology or three-dimensional geometry that capture structural complexity and spatial arrangements.
3 Chemical Databases and Data Mining
Chemical databases serve as essential repositories of chemical information, enabling researchers to access and analyze vast arrays of chemical structures, properties, and biological activities. Key databases include:
PubChem: A comprehensive database of chemical compounds owned by the National Center for Biotechnology Information (NCBI).
ChEMBL: A curated database of bioactive molecules with drug-like properties, containing information on compound activities and target interactions.
ChemSpider: The Royal Society of Chemistry runs ChemSpider, which has a huge database of chemical structures, characteristics, and other related information.
These databases are necessary for important chemoinformatics tasks like searching for compounds, analysing structure-activity relationships, virtual screening, and finding new information via mining data.