Analysis Report

CAGED Report (October 9, 2001)

Analysis Report

Materials

The database under consideration comprises 13 expression measurements of 517 genes. The recorded expression values range between 0.1 and 86.92, with an average expression value of 1.526.

Methods

The analytical task is to partition the genes into clusters sharing a similar behavior over the 13 different conditions. This cluster analysis was performed using CAGED (1), which implements a Bayesian method called Bayesian clustering by Dynamics (2).

A Bayesian approach assumes that the observed data were generated by a set of unknown processes. We do not know how many processes are responsible for the observed data, and the clustering process aims at identifying these processes. We define two gene profiles as similar if they were generated by the same process. In a Bayesian framework, we can define this similarity measure as the posterior probability of a clustering model - i.e. a way of grouping individuals into groups - given the observed data. The clustering method adopted herein starts by assuming that all clustering models are equally probable (i.e. assuming a uniform prior distribution over these models), computes the posterior probability of each model given the data, and selects the most probable one. This approach spares us the effort of defining an arbitrary similarity threshold to decide whether two individuals are similar and has the further advantage of taking simultaneously all the data into account, rather than looking at single pairwise measures of similarity.

We assume the observations to be generated by an autoregressive model of order 1, and set the prior precision and the gamma value to 1 and 0, respectively. Autoregressive models capture the dependency of data along time. The order of such models encodes the number of past time points relevant to the present, by assuming that each time point is conditionally independent of the past given the immediate n precedent points, where n is the autoregressive order. The prior precision is the size of the sample upon which the prior distribution is built, while the gamma value is the rate to zero of the prior precision, with 0 representing the case of perfect ignorance. The method requires a similarity measure to guide the search procedure and a Euclidean distance distance measure between gene profiles was adopted. Note that this measure is used only for search purposes and it is not involved in the actual decision to merge a set of gene profiles together. Goodness of fit of the resulting model is assessed by checking the normality of the standardized residuals of each cluster. Details about this clustering method can be found in (1). Data are natural logarithmic transformed.

Results

The cluster analysis described in the Methods section yields 4 clusters. Figure 1 draws the expression profiles of the members of each cluster.

Figure 1. Plots of individual gene expression profiles partitioned by cluster membership. Data are transformed on a natural logarithmic scale. Click on the images to enlarge.

Figure 2 shows the dendogram of the hierarchical structure of the clustering model together with the gene expression profiles (3). The left hand side of the picture shows 4 parallel hierarchies, one for each identified cluster. The order of the clusters follows, from top to bottom, the indexing reported in Table 1. Integers at the branching points of the trees report, on a logarithmic scale, the Bayes factor of each merging. The Bayes factor is the ratio between the posterior probabilities of two models and, in this case, indicates how many times more probable is the clustering model in which the two branches are merged than the clustering model in which they are kept separated. The central part of the drawing displays the expression values of each gene expression profile. Expression values higher or lower than 0 are reported in two different colors. The intensity of each color is proportional to the distance of each value from the cutpoint 0. The last row reports a brief description of each gene in the database.

Figure 2. Display of the hierarchical structure of the clustering model. The left hand side of the picture shows the tree structure of each discovered cluster. Numbers at the branching points of the trees report, on a logarithmic scale, the Bayes factor of each merging. The Bayes factor is the ratio between the posterior probabilities of two models and, in this case, it indicates how many times more probable is the clustering model in which the two branches are merged than the clustering model in which they are kept separated. The central part of the drawing displays the expression values of each gene expression profile. Expression values higher or lower than 0 are reported in two different colors. The intensity of each color is proportional to the distance of each value from the cutpoint. Intensities are transformed on a natural logarithmic scale. The last row reports a brief description of each gene in the database.

Figure 3 draws the expression profiles of each of the 4 identified clusters. A cluster profile is the prototypical behavior of its members computed as pointwise average of the gene profiles comprises in the cluster.

Figure 3. Plots of the profiles of each discovered cluster. Each profile may be regarded as the prototypical behavior of its members across the different conditions and it is computed as pointwise average of its cluster members. Data are transformed on a natural logarithmic scale. Click on the images to enlarge.

The basic terms of the statistical model of each cluster is reported in the part of Table 1. Each cluster is indexed by a number. The second column shows the number of individual gene expression profiles in each cluster. The third and fourth column report the basic terms of the statistical model of each cluster: residual sum a squares and linear regression coefficients.

Index	Elements	RSS	Coefficients	Residuals
Index	Elements	RSS	Coefficients	Mean	SD	Skewness	Kurtosis
1	216	430.491	.136 .776	-.000	1.000	.305	4.666
2	293	320.890	-.132 .722	.000	1.000	-.040	3.705
3	5	11.995	-.661 .328	.000	.991	-.573	3.841
4	3	20.597	.518 .708	-.000	.986	.635	3.045

Table 1. Summary and diagnostic statistics for each discovered cluster. For each cluster, indexed by the first column, column 2 reports the number of members, while column 3 and 4 the parameters of the statistical model: residual sum of squares and regression coefficients. The last four columns report the statistics of standardized residuals for diagnostic purposes. For standardized residuals following approximatively a normal distribution, the mean should be close to 0, the standard deviation to 1, the skewness to 0 and Kurtosis to 3.

A critical aspect of statistical inference is to assess the goodness of fit of the learned model. In our case, diagnostic analysis of the statistical model is performed by checking the normality of the standardized residuals of each cluster. Cluster residuals are computed by rescaling the differences between the observed values and the values predicted by the model fitted for each cluster. The rescaling factor is the inter-cluster variability. A good fit produces residuals following approximatively a normal distribution with mean 0 and standard deviation 1. The last four columns of Table 1 report the descriptive statistics of the residuals of each cluster. For a sample from a normal distribution, skewness, a measure of asymmetry, should be close to 0 and Kurtosis close to 3. For small clusters , such as cluster 3, 4, containing less than 9 series, deviance from these values may be due to the limited number of available residuals rather than departure from normality. Figure 4 draws the distribution of the residuals for each cluster.

Figure 4. Standardized residuals for each discovered cluster, used for diagnostic purposes. Standardized residuals are computed by rescaling the differences between the observed values and the values predicted by the model fitted for each cluster. Good fit should yield to normally distributed residuals. Click on the images to enlarge.

References

1.	Ramoni, MF, Sebastiani, P and Kohane IS (2002). Cluster Analysis of Gene Expression Dynamics. Under review.
2.	Sebastiani, P and Ramoni, MF (2002). Bayesian clustering of continuous time series. Under review.
3.	Eisen, M, Spellman, P, Brown, P and Botstein, D (1998). Cluster analysis and display of genome-wide expression patterns, Proc. Nat. Acad. Sci. USA, 95:14863-14868.

This report was generated by CAGED v1.0.