يعرض 1 - 10 نتائج من 720 نتيجة بحث عن '"data normalization"', وقت الاستعلام: 0.93s تنقيح النتائج
  1. 1
    دورية أكاديمية

    المصدر: Briefings in Bioinformatics. 24(5)

    الوصف: The advent of single-cell RNA sequencing (scRNA-seq) technologies has enabled gene expression profiling at the single-cell resolution, thereby enabling the quantification and comparison of transcriptional variability among individual cells. Although alterations in transcriptional variability have been observed in various biological states, statistical methods for quantifying and testing differential variability between groups of cells are still lacking. To identify the best practices in differential variability analysis of single-cell gene expression data, we propose and compare 12 statistical pipelines using different combinations of methods for normalization, feature selection, dimensionality reduction and variability calculation. Using high-quality synthetic scRNA-seq datasets, we benchmarked the proposed pipelines and found that the most powerful and accurate pipeline performs simple library size normalization, retains all genes in analysis and uses denSNE-based distances to cluster medoids as the variability measure. By applying this pipeline to scRNA-seq datasets of COVID-19 and autism patients, we have identified cellular variability changes between patients with different severity status or between patients and healthy controls.

    وصف الملف: application/pdf

  2. 2
    دورية أكاديمية

    المؤلفون: Bui Cong Giao, Duong Tuan Anh

    المصدر: Vietnam Journal of Computer Science, Vol 11, Iss 02, Pp 241-274 (2024)

    الوصف: Subsequence join in time series is to search for couples of similar subsequences from multiple time series. The task is useful in data mining on time series; nevertheless, it is extremely difficult because of its enormous computational cost. The task should use normalized time series during the search execution and be performed under an efficient distance measure to obtain accurate resulting couples. The task is more challenging when it works in a streaming environment in which time-series data might be collected very quickly. To address this problem, we propose an efficient method of subsequence join in streaming time series under Dynamic Time Warping (DTW), supporting z-score normalization. The proposed method utilizes a technique of subsequence extraction based on major extrema of streaming time series to search for couples of similar subsequences from coevolving time series. This method can identify couples of similar subsequences of the same length or different lengths. The experimental results show that the proposed method has high performance and can bring out interesting couples of similar subsequences. In addition, this method acts as an approximation algorithm suitable for a streaming scenario where users often expect fast responses from the task of subsequence join over time-series streams of high rates.

    وصف الملف: electronic resource

  3. 3
    دورية أكاديمية

    المؤلفون: Aleksandr Y. Dubinin

    المصدر: Безопасность информационных технологий, Vol 31, Iss 1, Pp 120-134 (2024)

    الوصف: At the current stage of information technology development, a key trend is the use of machine learning algorithms in the information security field. The relevance of our research is determined by the importance of identifying vulnerabilities and the potential of machine learning to enhance modern cybersecurity management software. Moreover, there is an evident demand for security mechanisms for critical infrastructure organizations, including in the context of import substitution needs. Our study focuses on examining a dataset formed from data obtained from the firewall about the remote connection between a remote user and an organization's VPN server. Thus, it seems reasonable to explore the possibilities of expanding the functionality with a User and Entity Behavior Analytics (UEBA) module when working with firewalls. The research goal is to analyze methods of detecting atypical user behavior, with the prospect of using the developed model in a behavioral analysis module integrated into the firewall. The primary material for the study is information collected from the firewall about the start and end events of user remote sessions to the organization's VPN server. The main research methods include the analysis of existing theoretical sources and practical recommendations, preprocessing of data, dimensionality reduction exploration, the use of the Isolation Forest model as an anomaly detection method, and the tuning of hyperparameters for this model. Particular attention is given to promising features for use in the proposed model. The theoretical significance of the research lies in the potential development of the employee profiling idea for building an organizational information security management system. From a practical standpoint, the article is relevant for professionals in the information security and machine learning fields.

    وصف الملف: electronic resource

  4. 4
    دورية أكاديمية

    المصدر: Healthcare Analytics, Vol 5, Iss , Pp 100324- (2024)

    الوصف: Cervical cancer is a significant public health concern among females worldwide. Despite being preventable, it remains a leading cause of mortality. Early detection is crucial for successful treatment and improved survival rates. This study proposes an ensemble Machine Learning (ML) classifier for efficient and accurate identification of cervical cancer using medical data. The proposed methodology involves preparing two datasets using effective preprocessing techniques, extracting essential features using the scikit-learn package, and developing an ensemble classifier based on Random Forest, Support Vector Machine, Gaussian Naïve Bayes, and Decision Tree classifier traits. Comparison with other state-of-the-art algorithms using several ML techniques, including support vector machine, decision tree, random forest, Naïve Bayes, logistic regression, CatBoost, and AdaBoost, demonstrates that the proposed ensemble classifier outperforms them significantly, achieving accuracies of 98.06% and 95.45% for Dataset 1 and Dataset 2, respectively. The proposed ensemble classifier outperforms current state-of-the-art algorithms by 1.50% and 6.67% for Dataset 1 and Dataset 2, respectively, highlighting its superior performance compared to existing methods. The study also utilizes a five-fold cross-validation technique to analyze the benefits and drawbacks of the proposed methodology for predicting cervical cancer using medical data. The Receiver Operating Characteristic (ROC) curves with corresponding Area Under the Curve (AUC) values are 0.95 for Dataset 1 and 0.97 for Dataset 2, indicating the overall performance of the classifiers in distinguishing between the classes. Additionally, we employed SHapley Additive exPlanations (SHAP) as an Explainable Artificial Intelligence (XAI) technique to visualize the classifier’s performance, providing insights into the important features contributing to cervical cancer identification. The results demonstrate that the proposed ensemble classifier can efficiently and accurately identify cervical cancer and potentially improve cervical cancer diagnosis and treatment.

    وصف الملف: electronic resource

  5. 5
    دورية أكاديمية

    المؤلفون: Jun Sun, Yinglin Xia

    المصدر: Genes and Diseases, Vol 11, Iss 3, Pp 100979- (2024)

    الوصف: Metabolomics as a research field and a set of techniques is to study the entire small molecules in biological samples. Metabolomics is emerging as a powerful tool generally for precision medicine. Particularly, integration of microbiome and metabolome has revealed the mechanism and functionality of microbiome in human health and disease. However, metabolomics data are very complicated. Preprocessing/pretreating and normalizing procedures on metabolomics data are usually required before statistical analysis. In this review article, we comprehensively review various methods that are used to preprocess and pretreat metabolomics data, including MS-based data and NMR -based data preprocessing, dealing with zero and/or missing values and detecting outliers, data normalization, data centering and scaling, data transformation. We discuss the advantages and limitations of each method. The choice for a suitable preprocessing method is determined by the biological hypothesis, the characteristics of the data set, and the selected statistical data analysis method. We then provide the perspective of their applications in the microbiome and metabolome research.

    وصف الملف: electronic resource

  6. 6
    دورية أكاديمية

    المصدر: Metabolites. 13(8)

    الوصف: Large-scale metabolomics assays are widely used in epidemiology for biomarker discovery and risk assessments. However, systematic errors introduced by instrumental signal drifting pose a big challenge in large-scale assays, especially for derivatization-based gas chromatography-mass spectrometry (GC-MS). Here, we compare the results of different normalization methods for a study with more than 4000 human plasma samples involved in a type 2 diabetes cohort study, in addition to 413 pooled quality control (QC) samples, 413 commercial pooled plasma samples, and a set of 25 stable isotope-labeled internal standards used for every sample. Data acquisition was conducted across 1.2 years, including seven column changes. In total, 413 pooled QC (training) and 413 BioIVT samples (validation) were used for normalization comparisons. Surprisingly, neither internal standards nor sum-based normalizations yielded median precision of less than 30% across all 563 metabolite annotations. While the machine-learning-based SERRF algorithm gave 19% median precision based on the pooled quality control samples, external cross-validation with BioIVT plasma pools yielded a median 34% relative standard deviation (RSD). We developed a new method: systematic error reduction by denoising autoencoder (SERDA). SERDA lowered the median standard deviations of the training QC samples down to 16% RSD, yielding an overall error of 19% RSD when applied to the independent BioIVT validation QC samples. This is the largest study on GC-MS metabolomics ever reported, demonstrating that technical errors can be normalized and handled effectively for this assay. SERDA was further validated on two additional large-scale GC-MS-based human plasma metabolomics studies, confirming the superior performance of SERDA over SERRF or sum normalizations.

    وصف الملف: application/pdf

  7. 7
    دورية أكاديمية

    المؤلفون: Ma Dongliang, Li Yi, Zhou Tao, Huang Yanping

    المصدر: Nuclear Engineering and Technology, Vol 55, Iss 11, Pp 4102-4111 (2023)

    الوصف: In order to better perform thermal hydraulic calculation and analysis of supercritical water reactor, based on the experimental data of supercritical water, the model training and predictive analysis of the heat transfer coefficient of supercritical water were carried out by using the support vector machine (SVM) algorithm. The changes in the prediction accuracy of the supercritical water heat transfer coefficient are analyzed by the changes of the regularization penalty parameter C, the slack variable epsilon and the Gaussian kernel function parameter gamma. The predicted value of the SVM model obtained after parameter optimization and the actual experimental test data are analyzed for data verification. The research results show that: the normalization of the data has a great influence on the prediction results. The slack variable has a relatively small influence on the accuracy change range of the predicted heat transfer coefficient. The change of gamma has the greatest impact on the accuracy of the heat transfer coefficient. Compared with the calculation results of traditional empirical formula methods, the trained algorithm model using SVM has smaller average error and standard deviations. Using the SVM trained algorithm model, the heat transfer coefficient of supercritical water can be effectively predicted and analyzed.

    وصف الملف: electronic resource

  8. 8
    دورية أكاديمية

    المصدر: mSystems, Vol 9, Iss 4 (2024)

    الوصف: ABSTRACTFunctional genomics techniques, such as transposon insertion sequencing and RNA-sequencing, are key to studying relative differences in bacterial mutant fitness or gene expression under selective conditions. However, certain stress conditions, mutations, or antibiotics can directly interfere with DNA synthesis, resulting in systematic changes in local DNA copy numbers along the chromosome. This can lead to artifacts in sequencing-based functional genomics data when comparing antibiotic treatment to an unstressed control. Further, relative differences in gene-wise read counts may result from alterations in chromosomal replication dynamics, rather than selection or direct gene regulation. We term this artifact “chromosomal location bias” and implement a principled statistical approach to correct it by calculating local normalization factors along the chromosome. These normalization factors are then directly incorporated into statistical analyses using standard RNA-sequencing analysis methods without modifying the read counts themselves, preserving important information about the mean-variance relationship in the data. We illustrate the utility of this approach by generating and analyzing a ciprofloxacin-treated transposon insertion sequencing data set in Escherichia coli as a case study. We show that ciprofloxacin treatment generates chromosomal location bias in the resulting data, and we further demonstrate that failing to correct for this bias leads to false predictions of mutant drug sensitivity as measured by minimum inhibitory concentrations. We have developed an R package and user-friendly graphical Shiny application, ChromoCorrect, that detects and corrects for chromosomal bias in read count data, enabling the application of functional genomics technologies to the study of antibiotic stress.IMPORTANCEAltered gene dosage due to changes in DNA replication has been observed under a variety of stresses with a variety of experimental techniques. However, the implications of changes in gene dosage for sequencing-based functional genomics assays are rarely considered. We present a statistically principled approach to correcting for the effect of changes in gene dosage, enabling testing for differences in the fitness effects or regulation of individual genes in the presence of confounding differences in DNA copy number. We show that failing to correct for these effects can lead to incorrect predictions of resistance phenotype when applying functional genomics assays to investigate antibiotic stress, and we provide a user-friendly application to detect and correct for changes in DNA copy number.

    وصف الملف: electronic resource

  9. 9
    دورية أكاديمية

    المصدر: Frontiers in Ecology and Evolution, Vol 12 (2024)

    الوصف: IntroductionHigh-throughput sequencing (HTS) provides an efficient and cost-effective way to generate large amounts of sequence data, providing a very powerful tool to analyze biodiversity of soil organisms. However, marker-based methods and the resulting datasets come with a range of challenges and disputes, including incomplete reference databases, controversial sequence similarity thresholds for delimitating taxa, and downstream compositional data analysis. MethodsHere, we use HTS data from a soil nematode biodiversity experiment to explore standardized HTS data processing procedures. We compared the taxonomic assignment performance of two main rDNA reference databases (SILVA and PR2). We tested whether the same ecological patterns are detected with Amplicon Sequence Variants (ASV; 100% similarity) versus classical Operational Taxonomic Units (OTU; 97% similarity). Further, we tested how different HTS data normalization methods affect the recovery of beta diversity patterns and the identification of differentially abundant taxa.ResultsAt this time, the SILVA 138 eukaryotic database performed better than the PR2 4.12 database, assigning more reads to family level and providing higher phylogenetic resolution. ASV- and OTU-based alpha and beta diversity of nematodes correlated closely, indicating that OTU-based studies represent useful reference points. For downstream data analyses, our results indicate that loss of data during subsampling under rarefaction-based methods might reduce the sensitivity of the method, e.g. underestimate the differences between nematode communities under different treatments, while the clr-transformation-based methods may overestimate effects. The Analysis of Compositions of Microbiome with Bias Correction approach (ANCOM-BC) retains all data and accounts for uneven sampling fractions for each sample, suggesting that this is currently the optimal method to analyze compositional data.DiscussionOverall, our study highlights the importance of comparing and selecting taxonomic reference databases before data analyses, and provides solid evidence for the similarity and comparability between OTU- and ASV-based nematode studies. Further, the results highlight the potential weakness of rarefaction-based and clr-transformation-based methods. We recommend future studies use ASV and that both the taxonomic reference databases and normalization strategies are carefully tested and selected before analyzing the data.

    وصف الملف: electronic resource

  10. 10
    دورية أكاديمية

    المصدر: Biochemistry and Biophysics Reports, Vol 37, Iss , Pp 101618- (2024)

    الوصف: Data normalization is the critical step for the RNA-seq data analysis. Several techniques are suggested for the normalization of transcript reads in the samples. In this study, the differentially expressed genes (DEGs) are generated from the TCGA normalized laryngeal cancer data obtained using the TPM, FPKM, and DESeq2 techniques. The results showed that the reports of DEGs were different based on the normalization techniques. We suggested that the DEG intersections obtain the top transcripts from the normalized data in support of the pathway enrichment process.

    وصف الملف: electronic resource