Co-expression Analysis Revealed Key Biomarkers Associated With the Diagnosis and Progression of Hypertrophic Cardiomyopathy

Objective: To reveal the molecular mechanism underlying the pathogenesis of HCM and nd new effective therapeutic strategies using a systematic biological approach. Methods: The WGCNA algorithm was applied to building the co-expression network of HCM samples. A sample cluster analysis was performed using the hclust tool and a co-expression module was constructed. The WGCNA algorithm was used to study the interactive connection between co-expression modules and draw a heat map to show the strength of interactions between modules. The genetic information of the respective modules was mapped to the associated GO terms and KEGG pathways, and the Hub Genes with the highest connectivity in each module were identied. The Wilcoxon test was used to verify the expression level of hub genes between HCM and normal samples, and the "pROC" R package was used to verify the possibility of hub genes as biomarkers. Finally, the potential functions of hub genes were analyzed by GSEA software. Results: Seven co-expression modules were constructed using sample clustering analysis. GO and KEGG enrichment analysis judged that the turquoise module is an important module. The hub genes of each module are RPL35A for module Black, FH for module Blue, PREI3 for module Brown, CREB1 for module Green, LOC641848 for module Pink, MYL7 for module Turquoise and MYL6 for module Yellow. The results of the differential expression analysis indicate that MYL7 and FH are considered true hub genes. In addition, the ROC curves revealed their high diagnostic value as biomarkers for HCM. Finally, in the results of the GSEA analysis, MYL7 and FH highly expressed genes were enriched with the "proteasome" and a "PPAR signaling pathway," respectively. Conclusions: The MYL7 and FH genes may be the true hub genes of HCM. Their respective enriched pathways, namely the "proteasome" and the "PPAR signaling pathway," may play an important role in the development of HCM.


Introduction
Hypertrophic cardiomyopathy (HCM) is the most common genetic heart disease with a prevalence of approximately 1:500 (Maron, 2002). HCM is characterized by ventricular hypertrophy, with clinical features in patients ranging from asymptomatic to heart failure and sudden cardiac death. At present, symptomatic treatment is still used to delay the progress of the disease, while traditional treatment is still unable to reverse the disease. Since the rst chromosomal location was mapped in 1989 (Jarcho et al., 1989), variants in numerous genes have been reported to cause HCM. The clinical diagnostic for genetic testing of HCM is increasingly becoming a part of the mainstream clinical management of patients and playing a key role in the cascade detection of family members (Ciriono et al., 2017;Hershberger et al., 2018). Previous studies have shown that in more than 50% of HCM samples, the disease is caused by mutations in genes encoding cardiac myosin, such as MYH7, TNNI3, MYBPC3 and MYL3. However, the molecular mechanism underlying the pathogenesis of HCM remains unclear, and new advanced analytic strategies and genomic technologies provide opportunities to explore the genetic structure of HCM. Therefore, this is an opportunity to reveal the molecular mechanism of HCM and nd new effective therapeutic strategies.
In a number of computational studies, disease risk modules have been developed to provide important measurement for the diagnosis of genetic mutations and the development of new treatment strategies (Hu et al., 2010;. WGCNA (weighted gene co-expression network analysis) is a powerful genetic analysis strategy. It is a systematic biological analysis method based on "guilt-by-association." This method is used to identify gene modules that can be used as candidate biomarkers or therapeutic targets (Langfelder and Horvath, 2008;Dileo et al., 2011). It is currently widely used for research and analysis of schizophrenia (Ren et  In this study, we used this method to analyze a large amount of HCM genetic data to nd genes and pathways that play an important role in the occurrence and development of HCM. Provide guidance for disease research and identify potential effective treatment options. It is hoped that the results of this study will provide guidance for the study of the disease and the search for potential effective treatments.

Microarray data analysis
Analysis was carried out of the gene expressions of the HCM datasets acquired from the GEO database (http://www.ncbi.nlm.nih.gov/geo) (Barrett et al., 2007). GSE36961, a much larger and newer microarray dataset of HCM, includes 106 cases and 39 controls (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE36961). Gene IDs were mapped to the microarray probes using the annotated information offered by the record. Probes corresponding to more than one gene were excluded from the dataset. The average expression values of the genes were obtained using measurements from a number of probes. A suitable threshold value was selected based on the number of probes with different thresholds of expression. The WGCNA algorithm (Langfelder & Horvath, 2008) was applied to building the coexpression network of HCM samples. Sample cluster analysis was performed using the hclust tool (R package, https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/hclust) with a threshold value of 35.

Co-expression module construction
The power value was evaluated during the construction of the modules using the WGCNA package in R (https://cran.r-project.org/web/packages/WGCNA/). The mean connectivity and scale independence of network modules were analyzed using the gradient test under different power values, which ranged from 1 to 20. The soft threshold power of 9 was chosen based on the scale-free topology criterion. The WGCNA algorithm further identi ed co-expression modules under these conditions. The minimum size of the gene group was set at 50 to ensure the reliability of the results for this module.

Interaction analysis of co-expression modules
The interactive connection among the co-expression modules was studied using the WGCNA algorithm.
The WGCNA R software package (Langfelder and Horvath, 2008) can be used to determine network construction, the calculation of topological properties, gene selection, module detection, differential network analysis, and network statistics. Furthermore, a heat map was plotted to exhibit the strength of interaction among the modules.

Functional enrichment analysis
Functional enrichment analysis was carried out in co-expression modules. The genetic information of the respective modules was mapped to the associated gene ontology (GO) terms and KEGG pathways using the DAVID tool (version 6.8; http://david.abcc.ncifcrf.gov/) (Huang, Sherman & Lempicki, 2009). The top ve records with p-value < 0.05 were retained for analysis.

Hub Gene identi cation and validation
Hub Genes with the highest connectivity in each module were identi ed using the function "chooseTopHubInEachModule" in WGCNA (Sezin et al., 2017). Bee swarm plots were created using the R package (https://cran.r-project.org/web/packages/beeswarm/index.html). Wilcoxon tests were performed to validate the hub genes' expression levels between HCM and normal tissue samples. P<0.05 was considered statistically signi cant. To validate the possibility of a hub gene as a biomarker, we outlined receiver operating characteristic (ROC) curves and calculated the area under the ROC curve (AUC) with the "pROC" R package (Sing et al., 2005).

Gene set enrichment analysis
To detect potential function of the hub genes, gene set enrichment analysis (GSEA) was conducted to explore the high-risk score associated KEGG pathways on GSEA software downloaded from the Broad Institute (http://www.broadinstitute. org/gsea). By running GSEA, normalized enrichment scores and pvalue were generated.

HCM dataset pre-processing
A total amount of 37846 gene expression values were derived from the raw le. 4,000 genes with the greatest average expression values were chosen for cluster evaluation ( Figure 1). 105 HCM samples remained for subsequent WGCNA analysis after one outlier sample was removed (GSM907253).

Identi cation of co-expression modules of HCM genes
The expression values of 4,000 genes in 105 HCM samples were analyzed to identify the modules of highly correlated genes. The soft threshold power was set at 9 (scale-free R2=0.88) to guarantee a scalefree network ( Figure 2). A total of 8 modules including green (255 genes), turquoise (992 genes), grey (932 genes), blue (473 genes), brown (424 genes), yellow (387 genes), pink (149 genes), and black (388 genes) were identi ed ( Figure 3). The genes in grey were not included in any module, so no further analysis was conducted for these genes.

Correlation analysis of co-expression modules
The WGCNA package analyzed the interactive relationships underlying the seven co-expression modules ( Figure 4). Gene expression levels were relatively independent as illustrated by the topological overlap matrix (TOM) plot of 4,000 genes, suggesting that each module was independently validated. The connectivity degree of eigengenes was assessed to further quantify the similarity of co-expression. These seven modules yielded two main clusters followed by cluster analysis ( Figure 5A), including two modules (modules Green and Yellow) and ve modules (modules Brown, Blue, Pink, Turquoise and Black), respectively. Based on the heatmap plot of the adjacencies ( Figure 5B), we found two pairs (modules Blue and Pink; modules Black and Turquoise) had the higher adjacency value.

Functional enrichment analysis in critical co-expression modules
Enrichment analysis of GO and KEGG were conducted to assess the functions of genes in the seven identi ed modules. The leading ve enriched GO and KEGG terms with p value < 0.05 were selected for further analysis. Based on the heatmap plot for GO ( Figure 6A) and KEGG ( Figure 6B) evaluations, signi cant differences in the enriched terms and enriched degree was detected among the co-expression modules. Through the analysis of the GO biological process, each module was very different from the others ( Table 1)   Abbreviations: GO, Gene Ontology; SRP, signal recognition particle.  Figure 7A) and FH ( Figure 7B) were regarded as true hub genes. Moreover, ROC curves revealed their high diagnostic value as biomarkers for HCM ( Figure 7C; MYL7 AUC: 0.762, FH AUC: 0.612).

Gene set enrichment analysis
We performed GSEA to further explore the potential functions of MYL7 and FH in HCM. As shown in Figure 8A and 8B, genes in high expression groups of MYL7 and FH were enriched (p<0.05) in "Proteasome" and "PPAR signaling" pathways, respectively.

Discussion
HCM is a recognized genetic form of heart disease. Many patients have poor long-term outcomes, including heart failure, malignant arrhythmias and sudden cardiac death. Currently there are no treatments that can effectively reverse the disease, including drug therapy, interventional therapy, and surgical treatment. Therefore, there is an urgent need to explore new effective therapeutic strategies and etiological explanations for HCM. Genetic testing is an indispensable part of labor practice, which has diagnostic and predictive value. It also offers hope that scientists will be able to decipher the mechanisms of disease occurrence and identify targets for effective treatments. As well, new sequencing techniques have led to a large number of candidate genes. In this study, we used a global approach to construct a gene co-expression network to predict candidate gene clusters in the pathogenesis of HCM. Furthermore, we hope to predict candidate gene sets as the basis for a given biological process through closed co-expressed gene modules with common functional annotations.
WGCNA is a new statistical method based on gene correlation which can be used to construct gene networks, detect modules, identify central genes and screen candidate genes as biomarkers (Langfelder and Horvath, 2008). In the statistical process, WGCNA focuses on processing a set of gene modules rather than individual genes and their interactions. This avoids the disadvantages of only processing genes and therefore ignoring molecular transcription networks. In this study, we obtained 4000 genes of 105 HCM samples from a NCMI dataset, which yielded 7 modules through the use of the WGCNA method. According to a correlation study by the topological overlap matrix (TOM) plot (Figure 4), each module was shown to be independent of the others. In addition, functional enrichment analysis was performed on the genes in these modules to identify important modules and the genes they contained. By analyzing the functional richness of these seven modules, there are signi cant differences in their enrichment degrees and terms. By analyzing these data, we found that in both GO enrichment and KEGG pathway analysis, the turquoise module has the highest enrichment. The greatest number of genes (1202 genes) were enriched in the turquoise module. It accounts for 30% of the total number of genes.
Therefore, the turquoise module is the most relevant module from the 7 previously identi ed. Through GO analysis, the genes were mainly concentrated in protein binding, poly(A) RNA binding and translation.
Through KEGG analysis, we can nd that differentially expressed genes in the turquoise module are highly rich in Ribosome, Proteasome and Oxidative phosphorylation.
The WGCNA package function "chooseTopHubInEachModule" was used to identify the modular hub genes with the highest connectivity. The hub genes of each module are RPL35A for module Black, FH for module Blue, PREI3 for module Brown, CREB1 for module Green, LOC641848 for module Pink, MYL7 for module Turquoise and MYL6 for module Yellow. Interestingly, the hub gene of the most important module Turquoise obtained by GO and KEGG analysis was MYL7. It has been repeatedly reported as one of the prevalent pathogenic mutation genes for HCM (Richard et al., 2003;Andersen et al., 2009). Compared to normal people, the above 7 hub genes were further differentially expressed. We found that FH and MYL7 were highly expressed in the HCM group compared to the normal group. These results suggested that these two genes may play an important role in the occurrence and development of HCM. Therefore MYL7 ( Figure 7A) and FH ( Figure 7B The fumarate dehydrogenase (FH) gene is located on chromosome 1. In normal cells, FH is located in both mitochondria and cytosol and catalyzes fumarate to malate (Jardim-Messeder et al., 2017).
Fumarate is a covalent oncometabolite. Its accumulation is characteristic of hereditary leiomyomatosis of genetic cancer syndrome. The mutation of the fumarate hydratase (FH) gene may cause the affected cells to transition to aerobic glycolysis (Warburg effect) (King et al., 2006). It has been found that mutations in fumaric acid can cause several fumarase-related diseases in humans, such as benign mesenchymal tumors of the uterus, leiomyomatosis and so on. In the results of this study, the FH gene may also be a key gene in the occurrence and development of HCM. At present there is little research on their correlation. In the future, exploration of the FH gene may be a salient new direction for further research on HCM.
Therefore, in order to determine the potential molecular function of these two important hub genes, we continued to use GSEA to search for KEGG pathways that are rich in high-expression samples. As shown in Figure 8A and 8B, genes in high expression groups of MYL7 and FH were enriched (p<0.05) in "Proteasome" and "PPAR signaling" pathways, respectively.
The proteasome is the main multicatalytic protease complex. It is involved in the degradation of most intracellular proteins. In addition, muscle bers and sarcoma proteins have been shown to be primarily degraded by proteasome. Some scholars have pointed out that proteasome dysfunction is related to human HCM. As well, a previous study suggested that the protease inhibitor ps-519 may cause a signi cant regression of cardiac hypertrophy.
Peroxisome proliferator-activated receptors (PPAR) are expressed in many tissues, such as skeletal and cardiac muscles, fat cells, liver cells, etc. Different subtypes of PPAR have different tissue distribution and expression pro les, leading to different clinical outcomes. In particular, the subtype of PPARα is highly expressed in tissues with high fatty acid oxidation capacity, such as liver, heart, and skeletal muscle. In addition, activation of the receptor for another subtype, PPARβ/δ has been shown to protect the myocardium from ischemia-reperfusion injury typical of diabetic cardiomyopathy. Although there is no research on the direct relationship between the PPAR pathway and HCM, this pathway may also be a direction for further researching HCM and nding therapeutic approaches.

Conclusions
In this study, two key genes of HCM, FH and MYH7, were identi ed from extensive genetic data through co-expression network analysis. In addition, the most enriched pathways for two key genes were discovered. They are the PPAR signaling pathway and proteasome. They may have played a very important role in the occurrence and development of HCM. The key genes and pathways identi ed in this study may provide guidance for further study of the mechanism and treatment of HCM in the future.
These ndings still need to be tested in a large number of clinical practices in the future.

Availability of data and materials
The datasets generated during the current study are available in the GEO database (http://www.ncbi.nlm.nih.gov/geo)

Competing interests
No competing nancial interests exist.

Funding Sources
No funding was obtained for this study.
Author Contributions XL, XZ, DG, and CL conceived this study; XL, XZ, DG and CL performed the analysis; XL, XZ, DG, CL, JL, YW and CW prepared the manuscript. All authors have read and approved the submitted manuscript.  The constructed coexpression modules of HCM genes by WGCNA software.  The heatmap for GO (A) and KEGG (B) enrichment analysis of HCM genes in coexpression modules.
Rows and columns represent the terms and modules, respectively.