Principal component analysis (PCA)

Principal component analysis (PCA) is a useful statistical technique that has found application in fields such as detection of correlated motion in molecular dynamics data, face recognition, image compression, and is a common technique for finding patterns in data of high dimension. Here our focus will be on its application in mining information from a molecular dynamic simulation (MD) trajectory, spe­cially MD of proteins. But, some basic concept of PCA technique and mathematical background is inevitable to understand the application of PCA on MD trajectory, like standard deviation, covariance, eigenvectors, eigenvalues and matrix algebra.

Need for PCA in MD

Protein dynamics is expressed in terms of change in molecular structure, or conformation as a function of time. PCA can be applied to the MD simulation trajectories to detect the global, correlated motions of the system, which otherwise is difficult to detect due to local and global motions recorded altogether in a classical MD. 

Standard Deviation (SD)

SD is the measure of how much a data point differ from the mean value for the group. A low standard deviation indicates that the data points tend to be close to the mean. Variance is another measure of the spread of data in a data set, in fact it is the square of SD. Variance and SD are purely 1-D measure involving just one series of data points such as: heights of all the learners in the class room, average monthly rainfall in a region for a given year. However, many time data sets have more than one dimension, and the aim of the statistical analysis of these data sets is usually to see if there is any relationship between the dimensions. For example, we might have as our data set both the average monthly rainfall and the yield of paddy in the region. We, then use statistical analysis to see if the amount of rainfall has any effect on yields of paddy crops. Covariance is such a measure, and is always measured between 2 dimensions. If you calculate the covariance between one dimension and itself, you basically get the variance.

Covariance matrix

Covariance is always measured between 2 dimensions of data. If we have a data set with more than 2 dimensions, there is more than one covariance values that can be calculated. For example, for a data set with 3 dimensions (dimensions x, y, z) one could calculate cov (x, y), cov(x, z) and cov (y, z).  In fact, for an n dimensional data set, you can calculate  different covariance values.

In the above vector multiplication, resulting vector is an integer multiple of the original vector. This is eigenvector i.e.  is an eigenvector of the  matrix and 4 is the eigenvalue associated with that eigenvector. Eigenvectors can only be found for square matrices, and, not every square matrix has eigenvectors. And, given an n * n matrix that does have eigenvectors, there are n of them. All the eigenvectors of a matrix are perpendicular (orthogonal), i.e. at right angles to each other, no matter how many dimensions you have.

Steps in a PCA

Step 1 : calculate the mean for the given data and subtract it from each data point.

Step 2: calculate the covariance matrix

Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix

Step 4: find the principle component i.e. the eigenvector with the highest eigenvalue is the principle component of the data set. It is the most significant relationship between the dimensions in a given data.

Step 5: derive the feature vector using the component (eigenvector) which you want to keep (explaining the highest variance)

Tutorial by Lindsay I Smith

Normal Modes and Essential Dynamics by Steven Hayward and Bert L. de Groot