Classification double

La Classification double ou « Biclustering » est une technique d'exploration de données non-supervisée permettant de segmenter simultanément les lignes et les colonnes d'une matrice. Plus formellement^[1], la définition de la classification double peut s'exprimer de la manière suivante (pour le type de classification par colonne) :

soit

\mathrm {E}

une matrice

\mathrm {M} \times \mathrm {N}

, soient

\mathrm {I} \subseteq \mathrm {M} {\text{ , }}J\subseteq \mathrm {N}

, alors

\mathrm {E} _{IJ}

est appelé « bicluster » de

\mathrm {E}

lorsque

\mathrm {E} _{i_{1},j}=\mathrm {E} _{i_{2},j}=..=\mathrm {E} _{i_{m},j}

pour tout

j\in J{\text{ et }}(i_{1},i_{2},...i_{m})\in \mathrm {M}

Application

Le « biclustering » a été utilisé massivement en biologie^[2] - par exemple dans l'analyse de l'expression génétique par Yizong Cheng et George M. Church^[3] ^, ^[4] -, mais aussi dans d'autres domaines tels que la compression d'image de synthèse^[5], l'analyse médicale - par exemple pour l'étude des traitements de l'épilepsie^[6] par stimulation vagale, la caractérisation d'émetteurs de pourriels (« spam »)^[7], l'analyse du mouvement^[8], l'analyse des termes publicitaires sur internet^[9], ...

Types

Dans les différents algorithmes qui utilisent la classification double, on trouve différents types de bicluster :

« Bi-cluster » à valeurs constantes (a),
« Bi-cluster » à valeurs constantes en lignes (b) ou en colonnes (c),
« Bi-cluster » à valeurs cohérentes (d, e).

a) « Bi-cluster » à valeurs constantes
7,6	7,6	7,6	7,6	7,6
7,6	7,6	7,6	7,6	7,6
7,6	7,6	7,6	7,6	7,6
7,6	7,6	7,6	7,6	7,6
7,6	7,6	7,6	7,6	7,6

b)« Bi-cluster » à valeurs constantes en lignes
1,2	1,2	1,2	1,2	1,2
2,1	2,1	2,1	2,1	2,1
3,2	3,2	3,2	3,2	3,2
4,1	4,1	4,1	4,1	4,1
4,2	4,2	4,2	4,2	4,2

c)« Bi-cluster » à valeurs constantes en colonnes
1,0	2,0	3,0	4,0	5,0
1,0	2,0	3,0	4,0	5,0
1,0	2,0	3,0	4,0	5,0
1,0	2,0	3,0	4,0	5,0
1,0	2,0	3,0	4,0	5,0

d) « Bi-cluster » à valeurs cohérentes (additives)
1.0	4.0	5.0	0.0	1.5
4.0	7.0	8.0	3.0	4.5
3.0	6.0	7.0	2.0	3.5
5.0	8.0	9.0	4.0	5.5
2.0	5.0	6.0	1.0	2.5

e)« Bi-cluster » à valeurs cohérentes (multiplicative)
1.0	0.5	2.0	0.2	0.8
2.0	1.0	4.0	0.4	1.6
3.0	1.5	6.0	0.6	2.4
4.0	2.0	8.0	0.8	3.2
5.0	2.5	10.0	1.0	4.0

En d) la notion d'additivité se comprend comme ceci : $+3,-1,+2,-3$ en colonnes, $+3,+1,-5,+1,5$ en lignes; en e) le motif est ${\frac {1}{2}},*4,{\frac {1}{10}},*4$ en colonnes et $*2,*1.5,{\frac {4}{3}},{\frac {5}{4}}$ .

Algorithmes

Le but des algorithmes de classification double est de trouver, s'il existe, le plus grand « bi-cluster » contenu dans une matrice, en maximisant une fonction objectif. On peut prendre comme fonction, avec les notations adoptées ci-dessus :

f_{1}=\left|\mathrm {I} \right|+\left|J\right|

ou

f_{2}=\left|\mathrm {I} \right|*\left|J\right|

^[10]

De nombreux algorithmes ont été développés notamment par la bio-informatique, dont : « Block clustering », CTWC (« Coupled Two-Way Clustering ») , ITWC (« Interrelated Two-Way Clustering »), δ-bicluster, δ-pCluster, δ-pattern, FLOC, OPC, « Plaid Model », OPSMs (« Order-preserving submatrixes »), Gibbs, SAMBA (« Statistical-Algorithmic Method for Bicluster Analysis »)^[11], RoBA (« Robust Biclustering Algorithm »), « Crossing Minimization »^[12] , cMonkey^[13], PRMs, DCC, LEB (« Localize and Extract Biclusters »), QUBIC (« QUalitative BIClustering »), BCCA (« Bi-Correlation Clustering Algorithm »), FABIA (« Factor Analysis for Bicluster Acquisition »)^[14]. Certains de ces algorithmes ont été comparés par Doruk Bozda, Ashwin S. Kumar et Umit V. Catalyurek^[15] en termes de type de motifs recherchés.
Le package « biclust »^[16] propose un ensemble d'outils pour la classification double dans le logiciel R.

Articles connexes

Notes et références

(en) Cet article est partiellement ou en totalité issu de l’article de Wikipédia en anglais intitulé « Biclustering » (voir la liste des auteurs).

↑ Tran Trang, Nguyen Cam Chi, Hoang Ngoc Minh,Bi-clustering des données de biopuces par les arbres pondérés de plus long préﬁxe - Chapitre 1 Introduction
↑ Sara C. Madeira, Arlindo L. Oliveira,Biclustering Biological Data Analysis
↑ (en) Cheng Y, Church GM, « Biclustering of expression data », Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology,‎ 2000, p. 93–103
↑ Yizong Cheng, George M. Church Biclustering of Expression Data
↑ Xin Sun, Qiming Hou,Zhong Ren, Kun Zhou, Baining Guo,Radiance Transfer Biclustering for Real-time All-frequency Bi-scale Rendering
↑ Stanislav Busygin,Nikita Boyko, Panos M. Pardalos,Michael Bewernitz, Georges Ghacibeh,Biclustering EEG data from epileptic patients treated with vagus nerve stimulation
↑ Kevin S. Xu, Mark Kliger, Alfred O. Hero III, Identifying Spammers by Their Resource Usage Patterns
↑ Keren Erez, Jacob Goldberger, Ronen Sosnik, Moshe Shemesh, Susan Rothstein,Moshe Abeles, Analyzing Movement Trajectories Using a Markov Bi-Clustering Method
↑ Dmitry I. Ignatov, Concept-based Biclustering for Internet Advertisement
↑ Stefano Lonardi, Qiaofeng Yang, Wojciech Szpankowski,Finding biclusters by random projections
↑ (en) Tanay A, Sharan R, Kupiec M and Sahmir R, « Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data », Proc Natl Acad Sci USA, vol. 101, n^o 9,‎ 2004, p. 2981-2986 (PMID 16749936, PMCID 14973197, DOI 10.1073/pnas.0308661100)
↑ Ahsan Abdullah, Data Mining Using the Crossing Minimization Paradigm
↑ (en) Reiss DJ, Baliga NS, Bonneau R, « Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks », BMC Bioinformatics, vol. 2, n^o 7,‎ 2006, p. 280–302 (PMID 16749936, PMCID 1502140, DOI 10.1186/1471-2105-7-280)
↑ (en) Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, Bijnens L, Gohlmann HWH, Shkedy Z, Clevert DA, « FABIA: factor analysis for bicluster acquisition », Bioinformatics, vol. 26, n^o 12,‎ 2010, p. 1520–1527 (PMID 20418340, PMCID 2881408, DOI 10.1093/bioinformatics/btq227)
↑ Doruk Bozda, Ashwin S. Kumar et Umit V. Catalyurek, Comparative Analysis of Biclustering Algorithms
↑ Sebastian Kaiser, Friedrich Leisch, A Toolbox for Bicluster Analysis in R

[1] Tran Trang, Nguyen Cam Chi, Hoang Ngoc Minh,Bi-clustering des données de biopuces par les arbres pondérés de plus long préﬁxe - Chapitre 1 Introduction

[2] Sara C. Madeira, Arlindo L. Oliveira,Biclustering Biological Data Analysis

[3] (en) Cheng Y, Church GM, « Biclustering of expression data », Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology,‎ 2000, p. 93–103

[4] Yizong Cheng, George M. Church Biclustering of Expression Data

[5] Xin Sun, Qiming Hou,Zhong Ren, Kun Zhou, Baining Guo,Radiance Transfer Biclustering for Real-time All-frequency Bi-scale Rendering

[6] Stanislav Busygin,Nikita Boyko, Panos M. Pardalos,Michael Bewernitz, Georges Ghacibeh,Biclustering EEG data from epileptic patients treated with vagus nerve stimulation

[7] Kevin S. Xu, Mark Kliger, Alfred O. Hero III, Identifying Spammers by Their Resource Usage Patterns

[8] Keren Erez, Jacob Goldberger, Ronen Sosnik, Moshe Shemesh, Susan Rothstein,Moshe Abeles, Analyzing Movement Trajectories Using a Markov Bi-Clustering Method

[9] Dmitry I. Ignatov, Concept-based Biclustering for Internet Advertisement

[10] Stefano Lonardi, Qiaofeng Yang, Wojciech Szpankowski,Finding biclusters by random projections

[11] (en) Tanay A, Sharan R, Kupiec M and Sahmir R, « Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data », Proc Natl Acad Sci USA, vol. 101, n^o 9,‎ 2004, p. 2981-2986 (PMID 16749936, PMCID 14973197, DOI 10.1073/pnas.0308661100)

[ahsan-12] Ahsan Abdullah, Data Mining Using the Crossing Minimization Paradigm

[13] (en) Reiss DJ, Baliga NS, Bonneau R, « Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks », BMC Bioinformatics, vol. 2, n^o 7,‎ 2006, p. 280–302 (PMID 16749936, PMCID 1502140, DOI 10.1186/1471-2105-7-280)

[14] (en) Hochreiter S, Bodenhofer U, Heusel M, Mayr A, Mitterecker A, Kasim A, Khamiakova T, Van Sanden S, Lin D, Talloen W, Bijnens L, Gohlmann HWH, Shkedy Z, Clevert DA, « FABIA: factor analysis for bicluster acquisition », Bioinformatics, vol. 26, n^o 12,‎ 2010, p. 1520–1527 (PMID 20418340, PMCID 2881408, DOI 10.1093/bioinformatics/btq227)

[15] Doruk Bozda, Ashwin S. Kumar et Umit V. Catalyurek, Comparative Analysis of Biclustering Algorithms

[16] Sebastian Kaiser, Friedrich Leisch, A Toolbox for Bicluster Analysis in R

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]