How to use



    Three different types of gene expression data including Cholangiocarcinoma (CHOL), Head and Neck squamous cell carcinoma (HNSCC), and Pancreatic adenocarcinoma (PAAD) from TCGA. The CHOL dataset includes 45 samples (9 normal samples, 36 cancer samples), 20502 genes; the HNSCC dataset consists of 418 samples (20 normal samples, 398 cancer samples), 20502 genes; the PAAD dataset contains 180 samples (4 normal samples, 176 cancer samples), 20502 genes. The number of genes in the gene expression data of the three different cancer types is same, we integrate the three types of gene expression data in the sample direction to form a multi-source gene expression data that have four categories (normal samples and three different types of cancer samples). The datasets of gene expression for characteristic gene selection and tumor classification were obtained from The Cancer Genome Atlas (TCGA, https://portal.gdc.cancer.gov/) database.
  • (1) Gene expression data (.csv): Download
  • (2) Label information data(.csv): Download


  • For the convenience of data processing, the uploaded gene expression data should conform to the form of data matrix.
  • Each row of the matrix represents the expression value of a sample on all genes, and each column of the matrix represents the expression value of a gene on all samples.
  • The feature dimension of the training data set is 20502, therefore, the feature dimension of the uploaded gene expression data must be 20502.
  • The uploaded file should be a .csv file.




Gene expression data matrix (number of samples * number of genes). The number of genes must be 20502. Due to the space limitation, we only show 5 gene features here.

183.4417,38.9582,1.1561,1567.7058,237.0097

77.304,84.8399,0.536,993.1288,44.4861

95.9412,77.1507,0.0,1194.8052,6187.5902

73.6285,67.2859,0.0,1963.4216,2346.0058

116.0062,121.9477,0.5173,4564.8809,924.9871

53.7303,81.3758,0.0,2890.1403,182.7515



The one-hot code of four categories of label information: 1000->0, 0100->1, 0010->2, 0001->3, where 0 represents HNSCC, 1 represents normal, 2 represents CHOL, and 3 represents PAAD.

1,0,0,0 -> 0 : HNSCC

1,0,0,0 -> 0 : HNSCC

0,0,1,0 -> 2 : CHOL

0,0,1,0 -> 2 : CHOL

0,0,0,1 -> 3 : PAAD

0,1,0,0 -> 1 : Normal