Diversity and similarity analysis
- Diversity sorting - sort a data
set by diversity with a very fast algorithm (full sorting of 200,000 compounds
in 12 hours)
- Compound Selection - selects
compounds from external database with maximized diversity
- Similarity calculations - for
each compound in an initial data set calculates similarities with a selected
data set
Similarity
Similarity is a quantitative measure of two molecules resemblance. To
calculate similarity, ChemDBsoft divides a molecule into structural fragments. The
more common structural fragments are there between tested molecules the more
Similarity is. Spherical environments for each atom in a molecule are
calculated. ChemDBsoft calculates spheres up to two bonds depth. For example, for
molecule CH3-CH2-CH(OH)-CH3 next screens will be generated:
CH3-; -CH2-; -CH-; -OH; CH3CH2-; CH3CH2CH-; -CH2CH(OH)CH3; CH3CH-; CH(OH)
Bond types and ring sizes are taken into consideration during screen
calculations. CH3-CH2 fragment is not the same as CH2=CH. Also, CH2 group on 5-
and 6-membered cycles are different.
Screens are stored in internal database. If a new screen will occur during
calculation, it will be added to database automatically. Unlimited number of
screens is used for similarity calculations.
Lets internal database contains N screens, which defines totally data set
under interest. Each compound in the data set can be represented as vector of N
dimension, elements of the vector being 0 if a screen is absent in the compound
or 1 if a screen is present. Lets calculate similarity R(K,M) of a two compound
in a data set as cosine between two vectors:
R(K,M)=SV(K)*V(M)/sqrt(SV(K)*SV(M))
The similarity between compounds K and M is equal zero, if no common screens
are exists for both structures, and one if all screens are common.
The Tanimoto similarity coefficient:
R(K,M)=SV(K)*V(M)/(SV(K)+SV(M)-
SV(K)*V(M))
is frequently used instead of cosine measure. It was shown (J.D.Holliday,
S.S.Ranade, P.Willet; Quant. Struct.-Act. Relat. 14, 501-506 (1995)), that using
of cosine and Tanimoto measures will give very highly correlated similarities.
ChemDBsoft uses Cosine measure elsewhere because of there exists very fast
algorithms for similarity and diversity calculations.
The Similarity to dataset is defined as:
R=SR/K
e.a. averaged similarity between current structure and structures in the data
set of compounds (K-dimension).
Diversity
Diversity is a property of dataset and characterizes the Similarity (8.3.2)
(or dissimilarity) of molecules included in it.
Dissimilarity of pair structures I and J (or diversity of compounds ) is
defined as:
D(I,J)=1-R(I,J)
Where R similarity. 1 value for D(I,J) means totally different compounds,
0 identical compounds.
A Diversity matrix can be created for a data set of N compounds:
| |D|= |
1
D21 D31
.DN1
D12 1 D32
DN2
.
D1N D2N D3N
1 |
The diagonal matrix (Dij=Dji) contains 1 at diagonal.
The Diversity of the dataset (or simply Diversity) is defined as:
D=SDij(i<>j)/N*(N-1)
- i.e. as sum nondiagonal elements, divided by number of such elements. The
possible values for D from zero (all compounds in data set are identical) to one
(no common screens exists for a pair of compounds in the data set).
Diversity is calculated in ChemDBsoft by Diversity... command of main menu Tools
item.
Sorting according to similarity to structure
The command calculates similarities (8.4.2) of records in dataset to given
structure pointed out in other dataset or the same one, with succeded sorting.
It is executed if icon is
dragged from source dataset window and dropped into blinking
region of the destination dataset(same or another), and radiobutton Sort
structures in<dataset 1> according to similarity of selected compound in
<dataset 2> is clicked in Comparison type selection dialog (fig. 8-1)
arisen then. The similarities of all compounds in target database with the
current structure in source database are calculated. After finishing the command
records are sorted in destination database in similarity s discendend order.
The
values of similarities are shown in second column (it is called Match) of the
table in destination database window
Sorting according to similarity to dataset
This command is executed if icon
is dragged from source database window and dropped into blinking
region of the destination (same or another) one, and radiobutton Sort structures
in <Dataset 1> according to similarity of <dataset 2> is clicked in
Comparison type selection dialog (fig. 8-1) arisen then. The Similarities to
dataset(8.4.2) of all compounds in target database to the source database are
calculated. After finishing the command records are sorted in destination
database in similarity s discendend order (fig. 8-9).
The procedure is useful for selection of compounds, which might have some
activities. Suppose, one have databases of compounds with appropriate kind of
activities. Execution of the command will select from target database a number
of compounds, which structures are most similar to source database of compound
with an activity.
Diversity sorting
The goal of Diversity calculation and sorting is to select a maximally
diverse subset of a given size from a given large pool of a candidate molecules.
The diverse data set can be used for screening purposes to reduce expenses for
compounds testing or compounds selection.
Algorithm of diversity sorting is:
- From initial data set a compound, which is most dissimilar (8.4.2) to all
another compounds, is selected.
- Calculates diversity (8.4.3) of data set with remaining compounds.
- Select from remaining compounds a structure, which is most dissimilar to
selected data set (8.4.2).
- Repeat steps 2-3 until all compounds will be exhausted or diversity
sorting will be aborted by user.
During such procedure, the diversity, calculated from diversity matrix,
remains maximal for initial number of compounds.
It should be mentioned that, if a data set contains duplicated structures and
screens for these structures are rare-occurred, the duplicates can be selected
at beginning of diversity sorting as "good" compounds. To remove the
problem, one needs to remove duplicates from data set prior diversity sorting.
The command is executed by icon
dragging from database window and dropping it into blinking
area of the SAME database. Then one needs to select command Diversity sorting
of... in Comparison type selection dialog (fig. 8-1) arisen.
Diversity sorting is time-consumption procedure. Time, required for
calculations, increases in square accordingly to data set to be sorted growth.
Calculation time for 100,000 compounds is approximately 4 hour (Pentium II,
500mhz). The size of set it defined by pressing of the Cancel button in Job
control window in on-line mode. Then the subset of most diverse compounds is
created, Yes button in the dialog for Cancel command confirmation is pressed. If
not aborted, calculation will lead to the full size of dataset, sorted by
diversity.

The result of calculations is sorted by diversity value database and its
graphical representation - Diversity Plot (fig. 8-12).
It contains diversity profile versus number of selected compound.
Usual for XY diagram Zoom, Scroll - operations, Cursor position box,
Right-button popup menu are available. The profile, for example, can be copied
into clipboard to insert in a Microsoft Word document.
Double-click at plot will cause database pointer to be set at the record,
corresponding to X-value of of mouse cursor.
Number of screens displayed is total number of structural screens (8.4.2) in
data set, from which compounds are selected. It can be used as diversity measure
of the dataset.
Selection of most diverse compounds
Suppose, you want to expand the large stock database with the most dissimilar
compounds from tested dataset. The problem to be solved is the calculation of
stock dataset diversity by addition of new compounds, one by one, in order of
dissimilarity to the stock. The result of procedure is sorted test data set, the
first records being the most dissimilar to stock data base.
Algorithm of compound selection is:
- Calculates diversity of stock data set with
each compound added.
- Select from tested compounds, a structure, which is most dissimilar to
stock data set.
- Repeat steps 1-2 until all test dataset compounds will be exhausted or
user will cancel diversity sorting.
After finishing of calculations and sorting Diversity plot, presenting the
changing in stocks diversity relative the number of added compounds from
tested dataset will be shown. There may be three kinds of diversity curves:
Type 1 new data set added has low diversity relative stock and is
recommended to be rejected.
Type 2. Two cases are possible:
a) The new data set is highly diverse, the large diversities are both for
each compound in new data set and with each compound in existing stock. The case
is very rare, the whole new data set may be used.
b) Initial stock is not diverse. For example, it is 15,000 compounds size,
the 10,000 derivatives of benzhydrazones being present. Addition of any
compounds, which is not derivative of benhydrazone, will increase diversity. In
this case full stock content can not be used for new compounds selection. One
needs to create a representative selection of stock by diversity sorting and
select initial compounds with diversity 0.8-0.85.
Type 3 usual curve for diverse stock. Upon tasks solved, one can select
only those compounds, which give maximum on diversity curve or those number
compounds, that diversity remains the same. For stock of large sizes very small
amount of compounds (order of 1%) from proposed new ones will increase
diversity.
It should be mentioned that, if a data set contains duplicated structures
with stock and screens for these structures are rare-occurred, the duplicates
can be selected at beginning of diversity sorting as "good" compounds.
To remove the problem, one needs to remove duplicates from data set proposed
prior compounds selection.
To select the compounds the most diverse to the stock database:
- Find and remove duplicates in stock database.
- Make representative dataset for stock by diversity sorting.
- Create a new database and import proposed SDF file.
- Find and remove duplicates in tested database.
Drag icon from tested
database and drop it into blinking
area of stock.
- Click radiobutton -Diversity sorting of <proposed database>, the
dataset in <stock> being used as already selected one in the
Comparison type selection dialog (fig. 8-1) arisen.
- Calculation can be aborted by Cancel button pressing in Job control
window. In this case subset of most diverse compounds might be created if
Yes button is pressed in the dialog for Cancel command confirmation (fig.
8-11).
|