Identification of four novel hub genes as monitoring biomarkers for colorectal cancer

Luo, Danqing; Yang, Jing; Liu, Junji; Yong, Xia; Wang, Zhimin

doi:10.1186/s41065-021-00216-7

Research
Open access
Published: 29 January 2022

Identification of four novel hub genes as monitoring biomarkers for colorectal cancer

Danqing Luo¹^na1,
Jing Yang²^na1,
Junji Liu³,
Xia Yong⁴ &
…
Zhimin Wang⁵

Hereditas volume 159, Article number: 11 (2022) Cite this article

3624 Accesses
3 Citations
1 Altmetric
Metrics details

Abstract

Background

It must be admitted that the incidence of colorectal cancer (CRC) was on the rise all over the world, but the related treatment had not caught up. Further research on the underlying pathogenesis of CRC was conducive to improving the survival status of current CRC patients.

Methods

Differentially expressed genes (DEGs) screening were conducted based on “limma” and “RobustRankAggreg” package of R software. Weighted gene co-expression network analysis (WGCNA) was performed in the integrated DEGs that from The Cancer Genome Atlas (TCGA), and all samples of validation were from Gene Expression Omnlbus (GEO) dataset.

Results

The terms obtained in the functional annotation for primary DEGs indicated that they were associated with CRC. The MEyellow stand out whereby showed the significant correlation with clinical feature (disease), and 4 hub genes, including ABCC13, AMPD1, SCNN1B and TMIGD1, were identified in yellow module. Nine datasets from Gene Expression Omnibus database confirmed these four genes were significantly down-regulated and the survival estimates for the low-expression group of these genes were lower than for the high-expression group in Kaplan-Meier survival analysis section. MEXPRESS suggested that down-regulation of some top hub genes may be caused by hypermethylation. Receiver operating characteristic curves indicated that these genes had certain diagnostic efficacy. Moreover, tumor-infiltrating immune cells and gene set enrichment analysis for hub genes suggested that there were some associations between these genes and the pathogenesis of CRC.

Conclusion

This study identified modules that were significantly associated with CRC, four novel hub genes, and further analysis of these genes. This may provide a little new insights and directions into the potential pathogenesis of CRC.

Introduction

Colorectal cancer (CRC) with the second highest cancer-related mortality rate, was the third most common cancer in men and the second most common in women worldwide, caused approximate 881,000 deaths worldwide in 2018 [1]. Gene mutation, epigenetic changes including cytosine guanine (CpG) island methylator phenotype (CIMP) [2] and environmental factors including diet, sedentary lifestyle and gut microbiota played a vital role in the progression of CRC [3]. It’s worth mentioning that obesity was an important factor, a meta-analysis showed that 33% increase in the risk for an obese person compared with a person of normal weight, which may be mediated by insulin resistance [3]. 75–80% cases were sporadic, with hereditary factors contributing to 20–25% of CRC [4]. Therefore, we could consider that the complex interaction between susceptibility genes and environmental factors was the main reason for the occurrence and development of CRC. Early CRC symptoms were not obvious, most patients with significant symptoms were often advanced CRC. Metastases were common in advanced CRC, and the survival rate for patients with metastases is only about 14% [5]. Therefore, we need to make efforts to study the potential pathogenesis of CRC to find new screening and early detection methods. Numerous studies have focused on biomarkers associated with CRC including susceptibility genes and deoxyribonucleic acid (DNA) methylome. Bagheri et al. [6] believed that tissue factor pathway inhibitor 2 (TFPI2) and N-myc downregulated gene family (NDRG) 4 with high enough sensitivity and specificity were nominated as the new CRC screening gene in peripheral blood mononuclear cells. In addition, Manoochehri’s study [7] revealed three-gene, including NGFR, FGF2, and PROM1 genes, signatures as potential therapeutic targets and also candidate molecular markers in CRC chemoradioresistance. In a prospective targeted sequencing involving 1134 CRC samples, mutations in adenomatous polyposis coli (APC) and CTNNB1 genes were identified, which increased oncogenic WNT pathway changes to 96% of CRCs [8]. As a key driver of tissue stem cell types, WNT signaling pathway involved in both the development and homeostasis of tissues [9]. Mutations in WNT signaling pathway components lead to a variety of growth diseases and cancers, including CRC, and many researches about therapeutic approaches to target the WNT pathway (especially Wnt/β-catenin signaling pathway) and their clinical applications were reported. Deletion of CRC gene expression may be associated with hypermethylation of CpG islands in some susceptible genes [10], and CIMP in subtype of CRC was reported [2].

Current diagnostic methods of CRC are based on histopathologic examination, while treatment plans and prognostic predictions usually refer to the tumour, node and metastasis (TNM) stage, which was first mentioned in 1968 [11]. It was well known that heterogeneous genomes of CRC patients can provide important prognostic information. It was a future development trend to conduct individualized treatment for patients with heterogeneous genes, so it was necessary to further understand the underlying pathogenesis of CRC and establish specific CRC gene blueprint. Sun’s study [12] identified that seven key genes, including PPBP, CCL28, CXCL12, INSL5, CXCL3, CXCL10 and CXCL11, were identified as important molecular markers, contributing to the screening, diagnosis, prognosis and new therapeutic targets of CRC. In addition, TIMP1, LZTS3, AXIN2, CXCL1, ITLN1, CPT2 and CLDN23 genes have also been confirmed to be related to the pathogenesis of CRC [13]. In terms of sensitive medicine screening, a study [14] confirmed that the combination of gefeitinib and regorafenib in the treatment of HCT116, CT26 and SW948 colorectal cancer cell lines may be a promising strategy for the treatment of colorectal cancer. At present, a large number of key genes had been identified related to the pathogenesis of CRC, which also brings more possibilities for the gene identification and diagnosis of CRC. Based on the uncertain possibilities of mechanics for CRC, this study will conduct the identification of genes closely related to CRC, which might provide some new insights for future individualized and comprehensive therapy.

Materials and methods

Microarray detection and differential expression analysis

Nine eligible microarrays from GEO dataset, including GSE4183 [15], GSE44076 [16, 17], GSE23878 [18], GSE32323 [19], GSE110223 [20], GSE110224 [20], GSE33113 [21, 22], GSE37364 [23], and GSE9348 [24], were enrolled in the study. The relevant information of the 9 datasets included in the study was list in Table 1. Screening of DEGs can identify the differences in gene expression levels between tumor tissues and matched normal tissues and identify the specific genes correlated with biological characteristics in tumors. We employed the edgeR package [25] of R software (Version 3.6.3) to analyze the differences between non-malignant samples and colon adenocarcinoma (COAD) tissues in the TCGA-COAD dataset. The statistical significance of genes between datasets was assessed by a linear model implemented in the “limma” package [26]. Differentially expressed genes (DEGs), including significantly downregulated and upregulated genes, were selected for further study with the cut-off criteria of false discovery rate (FDR) < 0.05 and |log₂ fold change (FC)| >1. P-value threshold <0.05 was set as statistically significant for DEGs among the genes. The process of DEGs sorting by P-value using robust rank aggregation (RRA) method [27] was carried out. The integrated DEGs were used for subsequent analysis.

Table 1 Characteristics of the data sets enrolled in the study

Full size table

Visualization of gene expression patterns and chromosome locations

In the section, top 100 DEGs including top 50 up-regulated genes and top 50 down-regulated genes were uploaded to the National Center for Biotechnology Information Gene for chromosomal locations. Then, visualization of the expression patterns and chromosomal locations were conducted in “OmicCircos” package [28].

Functional annotation and visualization

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were performed for the top 300 DEGs found in the integrated DEGs with Database for Annotation, Visualization and Integrated Discovery (DAVID) [29]. Functional annotation and enrichment pathways were visualized in “GOplot” package [30].

Weighted gene co-expression network analysis

Weighted gene co-expression network analysis (WGCNA) was performed in the integrated DEGs that from The Cancer Genome Atlas (TCGA). Nine samples were outliers with the threshold for identifying outlier sample was set as −0.25 (Supplemental Fig. 1). Correlation matrix (gene expression matrix), which each row represents a different gene and each column represent a sample, was formed. The correlation matrix was transformed to an adjacency matrix based on proper soft-thresholding parameter β, which can enhance high correlations and weaken low correlations. Then, modules with the mini-size of module gene numbers set as 30 were obtained after an average linkage hierarchical clustering was carried out based on topological overlap matrix (TOM) dissimilarity measure [31]. In current study, β = 9 was chosen to pledge a scale-free network (Supplemental Fig. 2). Module eigengene (ME) was the first principal component obtained from the principal component analysis of the expression matrix of each gene, and interesting module was identified by calculating the relevance between MEs and clinical features. Furthermore, two parameters were calculated: the Pearson’s correlation between the ME of each module and clinical information was defined as module significance (MS), the log10 transformation of the P-value was defined as gene significance (GS). The functional annotation and enrichment analysis were restricted to KEGG and GO in DAVID [29].

Identification, validation and efficacy evaluation for hub genes

Hub genes were defined on genemodumemberships >0.80, and genetics significance >0.20. All samples of validation were from GEO dataset, including GSE4183 [15], GSE44076 [16, 17], GSE23878 [18], GSE32323 [19], GSE110223 [20], GSE110224 [20], GSE33113 [21, 22], GSE37364 [23], and GSE9348 [24]. Evaluation of diagnostic efficacy was typically based on the summary index of receiver operating characteristic (ROC) curve and the area under curve (AUC). Therefore, ROC curve that as an effective method of evaluating the performance of diagnostic tests was plotted and AUC was calculated with “pROC” package [32] to evaluate the diagnostic performance for hub genes.

Kaplan-Meier survival analysis

In the section, patients were divided into two groups according to median expression value of each hub genes to plot survival curve. Survival curves constructed under the control of a single variable for a hub gene were compared using log-rank test. Log-rank test could indicate that whether there was statistical significance between two groups. Survival analysis was conducted in Gene Expression Profiling Interactive Analysis (GEPIA) [33].

DNA methylation analysis of hub genes

DNA methylation, the addition of a methyl group to the carbon 5-position of cytosine within a CpG dinucleotide, was a common and early event in cancer [34]. DNA methylation was increasingly being incorporated in biomarker studies because of its potential prognostic value. In current study, DNA methylation datum of hub genes obtained from the human disease methylation database version 2.0 [35]. Furthermore, the visualize DNA methylation, expression and clinical data (MEXPRESS) [36] guided the relationship between hub genes expression and their DNA methylation status.

Hub Genes and Tumor-Infiltrating Immune Cells and Gene Set Enrichment Analysis (GSEA)

Hub genes were uploaded to the Tumor IMmune Estimation Resource (TIMER) [37], which is a web server for comprehensive analysis of tumor-infiltrating immune cells, to study their interactions with tumor-infiltrating immune cells (B cells, CD4⁺ T cells, CD8⁺ T cells, neutrophils, macrophages and dendritic cells). Moreover, samples were divided into high-risk and low-risk groups according to the median risk score for GSEA, which was a way used to analyse and interpret coordinate pathway-level changes in transcriptomics experiments [38].

Results

Differential expression analysis

DEGs were identified according to the threshold of P-value <0.05 based on “limma” and RRA. The top 20 down-regulated and up-regulated genes were listed based on Fig. 1. Dataset ID and 40 genes were displayed, and each gene was colored according to color key, the redder the gene, the more it was up-regulated, and the bluer the gene, the more it was down-regulated.

Visualization of gene expression patterns and chromosome locations

Top 100 DEGs including top 50 up-regulated genes and top 50 down-regulated genes were preformed visualization of the expression patterns and chromosomal locations. Figure 2 illustrated that two DEGs located in chromosome X, and DEGs mainly located in chromosome 1, 3, 5 and 16. The top 5 up-regulated genes ABCA8, CLDN1, CDH3, TSPAN7 and GRAMD3 located in chromosomes 17, 3, 16, X, and 5, while top 5 down-regulated genes GRINA, PPM1H, SCD, GEMIN6 and SHMT2 located in chromosomes 8, 12, 10, 2 and 12, respectively.