## SHOGoiN CELLBLAST

Guide to CELLBLAST

What is CELLBLAST?
CELLBLAST is a system for searching gene expression databases for cells similar to the query gene expression profile. The similarity of two profiles is computed by comparing the order of genes ranked by expression. Although this is a simple measure we have observed that it is sufficient to characterize cell types across different next-generation sequencer platforms (see also Example Results).

What characterizes cells?
Expression value ranges differ between platforms, making direct comparison impossible. Given this situation, we use "gene expression ranks" as a way to compare expression data across platforms.

Spearman's rank correlation coefficient
Spearman's method uses the correlation coefficient $$r$$ between two rank numbers, where $$D_i$$ and $$n$$ indicate the rank difference between gene $$i$$ and the number of genes to be used for calculation.
Fisher's Z-transformation
The highest similarity to the query expression profile are ranked by their statistical significance based on Z-test via Fisher's Z-transformation of the rank correlation coefficient. The distribution of Fisher's Z-transformed sample correlation coefficient $$z_r$$ approximately follows the normal distribution with a mean $$z_\rho$$ and a standard deviation $$1/\sqrt{n-3}$$ regardless of the size of $$n$$. In CELLBLAST, $$z_\rho$$ appears in standardization is approximated as the average of Z-transformed sample correlation coefficients $$z_r$$. Thus, CELLBLAST profile search is statistically robust even when the population correlation coefficient between query and database profiles is non-zero.

Distributions of $$r$$, $$z_r$$, and $$z$$
The following figure shows a graphical abstract of statistical evaluation in CELLBLAST and CellMontage (previous version of CELLBLAST), and the histograms indicate the distributions of sample correlation coefficient $$r$$, t-statistic $$t_r = r\sqrt{(n-2)/(1-r^2)}$$, Z-transformed sample correlation coefficient $$z_r$$, and standardized Z-transformed sample correlation coefficient $$z$$, respectively, where in the case that query is a mouse lung cell sample (GSM1271921) and "SINGLECELL: all" and "MF:transcription factor activity, protein binding" are selected as database setting. In CellMontage, the distribution of $$t_r$$ does not follow t-distribution when the population correlation coefficient between query and database profiles is non-zero. In CELLBLAST, however, the distribution of $$z_r$$ approximately follows the normal distribution whose mean is estimated Z-transformed population correlation coefficient $$z_\rho=0.42$$. The standardized Z-transformed sample correlation coefficient $$z$$ follows the standard normal distribution.
References
Fujibuchi W, Kiseleva L, Taniguchi T, Harada H, Horton P. "CellMontage: similar expression profile search server." Bioinformatics. 2007 Nov 15;23(22):3103-4.
Natalia Polouliakh, Tohru Natsume, Hajime Harada, Wataru Fujibuchi, & Paul Horton, "Comparative Genomic Analysis of Transcription Regulation Elements Involved In Human Map Kinase G-Protein Coupling Pathway", Journal of Bioinformatics and Computational Biology, 2006 Apr;4(2):469-82.
Wataru Fujibuchi, Larisa Kiseleva, Takeaki Taniguchi & Paul Horton, "Development of Cell Knowledge Base and Prediction of Cell Types and Characteristics by Gene Expression Profiles" (in Japanese), IPSJ SIG Technical Report 2005-BIO-2, pp. 33-37. 2005.
"GENE EXPRESSION PROFILE RETRIEVING APPARATUS, GENE EXPRESSION PROFILE RETRIEV\ ING METHOD, AND PROGRAM" US patent [US_11/235150] 2005/09/27
Reality for finding homologous gene expression profiles, Fujibuchi, W. and Horton, P., poster presentation at BITS 2004 Oct. 30 in Kazusa DNA research institute, Chiba.http://www.kap.co.jp/bits2004/
CellMontage - Cell type retrieval system by gene expression profiles, Fujibuchi, W., oral presentation at AIST bioinformatics educational course symposium, 2004 Oct. 1.
Microarray analysis on many genes determine a cell type., Fujibuchi, W., poster presentation at ISMB 2004 Aug. in Glasgow.http://www.iscb.org/ismb2004/cgi-bin/posterabstracts.cgi
Development of similar cell search system, "Cell Montage" from gene expression profiles., Fujibuchi, W. and Horton, P., poster presentation at life science field research workshop, 2004 Feb. 3(Japanese).

NCBI GEO: mining millions of expression profiles--database and tools.: Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R., Nucleic Acids Res. 2005 Jan. 1;33 Database Issue:D562-6.