SHOGoiN CELLBLAST

Downloads

Installation
Requirements
A functional version of Boost (C++ libraries) is required.

Standard Installation
In Linux, just type the following commands:
$ ./configure
$ make
$ make install

Individual Installation
By default, "make install" will install all the files in "/usr/local/bin", "/usr/local/lib" etc. You can specify an installation prefix other than "/usr/local" using "--prefix" to "configure" execution, for instance "--prefix=$HOME".
$ ./configure --prefix=$HOME

Running
Database file generation
Prepare a database file in which gene expression files in CM format are just concatenated as follows:

db.CM
>GSM1269135 |GPL16791 (HiSeq 2500)|H.sapiens|8000301-738 (Muscle, Myoblast_T24 (Skeletal muscle))
ENSG00000000003:469 ENSG00000000005:0 ENSG00000000419:0 ...
...
ENSG00000283122:0 ENSG00000283123:8 ENSG00000283125:0
>GSM1269137 |GPL16791 (HiSeq 2500)|H.sapiens|8000301-738 (Muscle, Myoblast_T24 (Skeletal muscle))
ENSG00000000003:237 ENSG00000000005:0 ENSG00000000419:45 ...
...
ENSG00000283122:0 ENSG00000283123:0 ENSG00000283125:0
GSM1269130 |GPL16791 (HiSeq 2500)|H.sapiens|8000301-738 (Muscle, Myoblast_T24 (Skeletal muscle))
...

Generate index file, "db_geneIds.txt".
$ ./genIndex.pl db.CM | sort | uniq > db_geneIds.txt

Generate binary file of the database, "db.bin".
$ ./runGerIndexer db_geneIds.txt < db.CM > db.bin

Run CELLBLAST profile matcher.
Prepare query file in CM format and run CELLBLAST profile matcher as follows.
$ ./runGerMatcher db.bin query.CM > result.txt

Example Usage
The following is an example usage of CELLBLAST profile matcher. The database file "HiSeqHsapiens.bin" and the query file "query.GSM1901473.TF_activity_protein_binding.CM" can be downloaded from CELLBLAST_Database. The query file contains "MF: transcription factor activity, protein binding" genes. The profile matching is performed using only the genes included in both the database and the query.
$ ./runGerMatcher HiSeqHsapiens.bin query.GSM1901473.TF_activity_protein_binding.CM > result.txt

Example Result
The search result (using CELLBLAST database version 1.0.1 in August 2018) is shown in the following table consisting of five columns:

· 1st column: Sample ID. Sample accessions numbers (GSM) of NCBI Gene Expression Omnibus (GEO) are used in CELLBLAST database file.
· 2nd column: P-value of Fisher's Z-transformed rank correlation coefficient. The details of the derivation of the p-values are described in Document manuals.
· 3rd column: Spearman's rank correlation coefficient. The details of the derivation of the correlation coefficients are also described in Document manuals.
· 4th column: The number of genes used for the profile matching.
· 5th column: Header information of CM format in database file. In the CELLBLAST database files, GEO's accession numbers of sample (GSM), GEO's accession numbers of platform (GPL), organism, and SHOGoiN Cell IDs of the samples delimitated by "|" are given.

Sample ID P-value of Fisher's Z-transformed rank correlation coefficient Spearman's rank correlation coefficient # genes used for matching Header information of CM format in database file
GSM1901473 0 1.00 588 GSM1901473 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet))
GSM1901487 3.62266e-13 0.556442 588 GSM1901487 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet))
GSM1901493 7.22888e-12 0.544295 588 GSM1901493 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet))
GSM1901488 1.94848e-10 0.529727 588 GSM1901488 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet))
GSM1901458 3.72947e-10 0.526685 588 GSM1901458 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-212 (Pancreas, PP cell (Pancreatic islet))
GSM1901497 1.09923e-09 0.521478 588 GSM1901497 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-020 (Pancreas, Alpha cell (Pancreatic islet))
GSM1901464 1.62965e-09 0.519536 588 GSM1901464 |GPL11154 (HiSeq 2000)|H.sapiens|3110002050000000000000-090 (Pancreas, Duct cell (Pancreatic islet))
GSM1901519 3.53105e-09 0.515645 588 GSM1901519 |GPL11154 (HiSeq 2000)|H.sapiens|3110001010000000000000-026 (Pancreas, Beta cell (Pancreatic islet))
... ... ... ... ...