Vizualization of Almost-Delaunay Tetrahedra in 2ACY from ε = 0.0 (red) to ε = 0.35 (yellow)

Motivation

Delaunay tessellations and Voronoi diagrams capture proximity relationships among sets of points. When applied to points representing protein atoms or residue positions, they are used to compute molecular surfaces and protein volumes, to define cavities and pockets, to analyze and score packing interactions, and to find structural motifs. Since atom and residue coordinates are known imprecisely, we explore the effect of coordinate perturbation on Delaunay-based scoring and motif finding. We have developed two concepts to help us do this: almost-Delaunay tetrahedra and Delaunay probability.

Definitions

We define 4 points in 3D to be an almost-Delaunay tetrahedron with threshold ε, denoted AD(ε), if they can be made to satisfy the Delaunay empty sphere property when all points in the set are perturbed by at most ε. Almost-Delaunay simplices are defined in exactly the same way using n-tuples of points in arbitrary dimensional space. This is illustrated in the diagrams below, for almost-Delaunay triangles in 2D.

(a) A non-Delaunay triangle (vertices blue) with points within its circumcircle
(b) Radial perturbation of triangle vertices and other points.
(c) The smallest perturbation can be found from a minimum-width annulus.
Delaunay Probability: An AD(ε) tetrahedron in some sense captures the results of a worst-case perturbation. The average case can be modeled by assuming that perturbations come from a radial probability density function (pdf) such as a Gaussian, and performing Monte Carlo integration over a large number of instances, of the probability of a perturbed tetrahedron occurring times the probability that its circumsphere is empty. This computation takes a long time for all O(n^4) possible tetrahedra, but we can restrict it to those with a low AD threshold, since AD threshold is inversely correlated to Delaunay probability as shown below.

Tetrahedron probability falling off with AD threshold for a protein with perturbations from a Gaussian with average radius a=0.1 and a=0.5 Angstrom.
Implementation
The 3D coordinates most often used to represent a protein are of the C-α atoms or the side-chain centroids. Our implementation in MATLAB computes the indices of Almost-Delaunay edges, triangles or tetrahedra along with their corresponding minimum thresholds. We set an edge-length prune to restrict the longest edge of an AD tetrahedron, typically to 10 Angstrom. Since we are only interested in specific ranges of perturbation, we keep only tetrahedra with small thresholds by setting an ε cutoff, typically 2.0 Angstrom.
The entire set of MATLAB source files is here, sample pdb and DSSP files used in the experiments are here and documentation on all the individual files can be found here.

Results
Robustness of the Delaunay Tessellation in proteins:  We show below how the number of AD tetrahedra seems to decrease as we go from random points to chains to decoys to proteins with similar sequence length and packing density. Also, plotting the number of tetrahedra added at low thresholds in (e) indicates that the DT of proteins changes less than that of random points for a given perturbation.
 


Comparing the AD tetrahedral threshold distributions, with error bars for standard deviation, of (a) totally random points (b) random lattice chains folded using MJ potential (c) decoys of 2cro (d) proteins represented by C-αs. (e) shows the stability of the DT in proteins. 



Stability of SNAPP scores: We developed variants of the SNAPP scores using almost-Delaunay tetrahedra with positive threshold (AD-SNAPP), Delaunay and AD tetrahedra (D+AD-SNAPP) and using Delaunay probabilities to weight each tetrahedron's score (Delaunay-probability SNAPP). The average scores turn out to be well correlated and equally good as averaged SNAPP scores at distinguishing decoys. The total SNAPP scores are better at making this distinction, and Delaunay-probability SNAPP scores are equally good as SNAPP scores, as seen below in the second figure.
 
Comparison of average SNAPP scores using Delaunay and almost-Delaunay tetrahedra (left, upper figure), and total SNAPP scores against Delaunay-probability SNAPP scores (left, lower), for the protein 2CRO (left, upper) and its decoys, all represented by sidechain-centroids.
   
Detection of structural motifs: We are able to use the extra information available from AD thresholds to improve the discrimination of secondary structural elements (α-helices and β-sheets) over previous methods based on the Delaunay tessellation.
 
The histogram distribution of AD thresholds in a synthetic α-helix is striking; apart from the Delaunay tetrahedra that follow the patterns i+(1,2,3,4) and i+(1,2,4,5) as known in the literature, there are three characteristic α-helical peaks at ε=0.3, 0.7 and 1.2. The table at left shows the patterns and their corresponding threshold values. These patterns and peaks are seen in the histograms of all α-helical proteins represented by C-αs, and can be used for quantitative estimation of α-helix content as well as to estimate packing (by counting non-pattern tetrahedra). Decoys of α-helical proteins show sharp peaks, but they can be distinguished by the relative lack of non-pattern tetrahedra. Proteins represented by sidechain centroids do not show peaks, since the centroids don't have the regular structure seen in backbone coordinates.
 
 
synthetic α-helix
α-helical protein (PXR)
β-sheet protein (chymotrypsin)
C-αs
sidechain centroids
decoy of 2cro
 
 
Patterns based on sequence interval (and in fact all the patterns that we have tried) do not show a clear signature of values of the AD threshold, in proteins with β-sheets. The pattern we currently use for detecting β-sheets is that almost-Delaunay tetrahedra that straddle two neighboring β-strands tend have similar values of the maximum gap in sequence (for parallel strands) or values distributed in an interval (for antiparallel strands). Thus we look for peaks and plateaus in the histogram of the maximum sequence gap, and among the tetrahedra that fall in these areas, we determine the neighbors in adjacent β-strands by a mutual maximum frequency of occurrence search, and cluster parallel and antiparallel sequences of residues to complete the β-sheet.
 
 
Above: Here we compare the results of α-helix and β-sheet determination using our methods with those from DSSP for a set of proteins from different structural families in CATH. The individual α-helices are accurate to +/- 2 residues at each end, and the total to within 10%. The β-sheet prediction is less accurate than α-helix, since our method currently does not use the value of the AD threshold as a part of the pattern, so the results are not always specific to the geometry of the β-sheet. We are working on methods for detecting β-turns and more complex motifs.

Project Members
Deepak
Bandyopadhyay

Graduate Student
UNC Chapel Hill
debug@cs.unc.edu

Alexander Tropsha
Professor and Director,
Laboratory of Molecular Modeling,
School of Pharmacy, UNC Chapel Hill
alex_tropsha@unc.edu
Jack Snoeyink
Professor of Computer Science
UNC Chapel Hill
snoeyink@cs.unc.edu