Clustering microarray data: an approach based on copula function

Marta di Lascio

Free University of Bozen-Bolzano, Italy

In this talk we focus on clustering microarray data. A microarray data matrix contains the expression level of genes observed under some experimental conditions or biological samples. In this context, cluster analysis can be applied both to the genes and the tissues in order to formulate hypotheses about the possible co-regulation and functional relations between genes and identify biologically and clinically relevant groups, respectively. The potential of clustering in revealing biologically meaningful patterns in microarray data was proved in 1998. Since then, several different clustering algorithms have been proposed and applied to microarray data but they either ignore the dependence relationship between genes or are limited to the bivariate or linear dependence case.

In this talk we introduce the CoClust, an algorithm based on copula function which is able to cluster genes or tissues by taking into account the complex multivariate dependence relationship among them. The CoClust is based on the assumption that the clustering is generated by a multivariate probability model. In particular, each cluster is represented by a (marginal) univariate density function while the whole clustering is modeled through a joint density function defined via copula. The dimension of the model represents the number of clusters. Therefore, the interest is on the inter-cluster dependence relationship rather than on the intra-cluster relationship: observations in different clusters are dependent while observations in the same cluster are independent. We show the performance of our proposal on both simulated and real data.