|
Supplier of Chemicals Worldwide |
|
CDrequest
2 million chemicals |
|---|
The distributed Ambinter compound collection contains about 3 million molecules
while we also have in house another collection of 6 million compounds (3
distributed freely + 3 extra in house).
The electronic version of these collections is presently 2D SDF files
but we would like to propose other formats that could be valuable for some
research projects.
The main pre-processing that we aim at providing involve:

a) Proposing the collection (3 millions) in 3D (mono conformers)
b) A filtered collection from the initial in which potential undesirable
compounds would be removed
via several ADME/Tox filtering steps.
c) A diversity set
d) A random collection
a) Collection in 3D
It will be possible to download the 3 million-compound collection in 3D (single
conformation)
and in Mol2 format, free of charge at: http://bioserv.rpbs.jussieu.fr/
Indeed, for several projects involving virtual screening or docking, it is
beneficial to have
the molecules in 3D. To this end, we will use our optimized version of Frog (manuscript
in preparation
and Leite et al., NAR 2007, 35:W568-72), a software that takes as input a 2D SDF
file and generate
the 3D structure of each compound. Counter ions and salts are removed and a
standard protonation
state is assigned. Gasteiger partial charges are added in the final 3D Mol2
file.
b) ADME/Tox filtering

In several situations, including drug design projects, it is important to filter
out compounds that
do not possess a “drug-like” profile.
Druglikeness is a qualitative concept used in drug design for how “drug-like” a
substance is. It is
estimated in many different ways, from the molecular structure, when dealing
with in silico methods.
A druglike molecule has properties like: (a) optimal solubility to both water
and fat, because an
orally administered drug has to go through the intestinal mucosa…One model
compound for the cellular
membrane is octanol, so the logarithm of the octanol/water partition coefficient
can be used to estimate
solubility. Solubility in water can be estimated from the number of hydrogen
bond donors vs alkyl side
chains in the molecule…Too many hydrogen bond donors, on the other hand, lead to
low fat solubility…
Other properties include molecular weight, with about 80% of the traded drugs
have MW under 450 Da,
search for some reactive substructures…
Several famous filtering rules have been reported in the past, one of the
traditional rules of thumb
for estimating bioavailability is the Lipinski’s rule of Five. Yet other
important criteria can be
considered, such as removing the so-called frequent hitters, reactive groups,
toxic groups, etc.
We will use an updated version of FAF-Drugs (Miteva et al., NAR 2006,
34:W738-44) to perform this step.
In this new version, about 200 rules have been implemented, including molecular
weight, topological
polar surface area (TPSA) and logP, absolute and relative content of heteroatoms
as well as limits
on the number of a very wide variety of functional groups, the number and size
of ring systems,
the flexibility of the molecule.
We will provide the ADME/Tox filtered collection in 2D SDF and in 3D Mol2
format. For each molecule,
information about the filtering will be stored in a tabulated file. In addition,
some of these data
will be attached as fields in the SDF format for convenience.
c) Diversity set

For some projects, it is beneficial to have a diversity subset representing the
entire collection.
We will first compute fingerprints (over 400) for each compound, then, to
estimate the diversity,
a clustering approach will be applied. A dissimilarity selection method will be
applied in order
to extract a diversity set. Depending on the cutoff selected for definition of
similarity, the number
of the diversity subset extracted will vary. We will thus provide several iverse
collections of
50,000, 100,000 and 1,000,000 compounds, in 2D SDF and in 3D Mol2.
d) Random collections
It can be useful to have random collections, like to generate statistics about a
collection or for some
molecular modeling projects or to compare databases, etc. We will provide
several random collections of
different sizes, in 2D and in 3D.
Alternative processing upon request is possible. For instance, each compound can
be generated in 3D
multiconformer states, the ADME/Tox filtering can be very soft or very stringent…etc.