يعرض 1 - 10 نتائج من 2,116 نتيجة بحث عن '"data cleaning"', وقت الاستعلام: 1.59s تنقيح النتائج
  1. 1
    دورية أكاديمية

    المصدر: Global Energy Interconnection, Vol 7, Iss 3, Pp 293-312 (2024)

    الوصف: Current methodologies for cleaning wind power anomaly data exhibit limited capabilities in identifying abnormal data within extensive datasets and struggle to accommodate the considerable variability and intricacy of wind farm data. Consequently, a method for cleaning wind power anomaly data by combining image processing with community detection algorithms (CWPAD-IPCDA) is proposed. To precisely identify and initially clean anomalous data, wind power curve (WPC) images are converted into graph structures, which employ the Louvain community recognition algorithm and graph- theoretic methods for community detection and segmentation. Furthermore, the mathematical morphology operation (MMO) determines the main part of the initially cleaned wind power curve images and maps them back to the normal wind power points to complete the final cleaning. The CWPAD-IPCDA method was applied to clean datasets from 25 wind turbines (WTs) in two wind farms in northwest China to validate its feasibility. A comparison was conducted using density-based spatial clustering of applications with noise (DBSCAN) algorithm, an improved isolation forest algorithm, and an image-based (IB) algorithm. The experimental results demonstrate that the CWPAD-IPCDA method surpasses the other three algorithms, achieving an approximately 7.23% higher average data cleaning rate. The mean value of the sum of the squared errors (SSE) of the dataset after cleaning is approximately 6.887 lower than that of the other algorithms. Moreover, the mean of overall accuracy, as measured by the F1-score, exceeds that of the other methods by approximately 10.49%; this indicates that the CWPAD-IPCDA method is more conducive to improving the accuracy and reliability of wind power curve modeling and wind farm power forecasting.

    وصف الملف: electronic resource

  2. 2
    دورية أكاديمية

    المصدر: BMC Ecology and Evolution, Vol 24, Iss 1, Pp 1-13 (2024)

    الوصف: Abstract Background Understanding biodiversity patterns is a central topic in biogeography and ecology, and it is essential for conservation planning and policy development. Diversity estimates that consider the evolutionary relationships among species, such as phylogenetic diversity and phylogenetic endemicity indices, provide valuable insights into the functional diversity and evolutionary uniqueness of biological communities. These estimates are crucial for informed decision-making and effective global biodiversity management. However, the current methodologies used to generate these metrics encounter challenges in terms of efficiency, accuracy, and data integration. Results We introduce PhyloNext, a flexible and data-intensive computational pipeline designed for phylogenetic diversity and endemicity analysis. The pipeline integrates GBIF occurrence data and OpenTree phylogenies with the Biodiverse software. PhyloNext is free, open-source, and provided as Docker and Singularity containers for effortless setup. To enhance user accessibility, a user-friendly, web-based graphical user interface has been developed, facilitating easy and efficient navigation for exploring and executing the pipeline. PhyloNext streamlines the process of conducting phylogenetic diversity analyses, improving efficiency, accuracy, and reproducibility. The automated workflow allows for periodic reanalysis using updated input data, ensuring that conservation strategies remain relevant and informed by the latest available data. Conclusions PhyloNext provides researchers, conservationists, and policymakers with a powerful tool to facilitate a broader understanding of biodiversity patterns, supporting more effective conservation planning and policy development. This new pipeline simplifies the creation of reproducible and easily updatable phylogenetic diversity analyses. Additionally, it promotes increased interoperability and integration with other biodiversity databases and analytical tools.

    وصف الملف: electronic resource

  3. 3
    دورية أكاديمية

    المصدر: Standards, Vol 4, Iss 2, Pp 52-65 (2024)

    الوصف: The intricate process of planning production, involving product life cycle management and the synthesis of manufacturing information, is crucial for coherence in manufacturing. Manufacturing companies, operating in a high-mix, low-volume production environment, integrate production planning with management to focus on production processes, emphasizing high-quality, rapid product delivery. This includes material item planning to anticipate future demands and ensure sufficient raw material and finished product quantities, considering purchasing, production, and sales capacities. This study explores the electro technical sector, specifically a manufacturing entity specializing in low-voltage plastic cable distribution boxes. It scrutinizes the vital role of seasonal data cleaning in optimizing production planning, with a targeted focus on three products. The implementation of a chase demand strategy is related to capacity planning, taking into account the change in production capacity linked to demand over time. The problem in implementing this strategy is related to the fluctuating level of quality due to changes in demand for specified products.

    وصف الملف: electronic resource

  4. 4
    دورية أكاديمية

    المصدر: Methods in Ecology and Evolution, Vol 15, Iss 5, Pp 816-823 (2024)

    الوصف: Abstract Ecology increasingly relies on a massive volume of biodiversity occurrence records to draw insights into large‐scale biogeographical, ecological and evolutionary phenomena. This often involves defining a set of criteria that guides the collection, filtering and standardising of available records. These curation processes are often neither described in detail nor well documented. This hampers the comparability and reproducibility of studies and thus undermines the robustness of any ecological result. Yet, to date, there is no guide providing a friendly way to reason and document a complete data curation process. We reviewed the available literature on data curation, including tools such as R packages and workflows. From this assessment, we created a complete guide organised into five modules that allows users to consciously select and validate occurrence records based on these curation criteria. This is presented in the user‐friendly Shiny R OCCUR application, available at https://ecoinformatic.shinyapps.io/OCCURTest/. OCCUR application provides a guide for researchers to deal with the trade‐off between data certainty and coverage generated in each data curation step. An interactive graph of these changes is provided within each module. OCCUR also produces a custom‐made final report including all steps used for data filtering. This report helps to streamline the writing of the methods section of manuscripts and technical reports, thus promoting the reproducibility of data curation processes. This Shiny application provides an interactive overview of the data curation methods and their use for handling occurrence records from public repositories. It brings together the taxonomic, temporal and spatial dimensions of data and also the identification of duplicates of the records. OCCUR can be applied for multiple purposes, such as teaching R to ecologists or enhancing reproducibility of macroecology and biogeography.

    وصف الملف: electronic resource

  5. 5
    رسالة جامعية

    المساهمون: Sutton, Charles, Williams, Chris

    الوصف: A typical data science or machine learning pipeline starts with data exploration; then data engineering (wrangling, cleaning); then moves towards modelling (model selection, learning, validation); and finally model visualization or deployment. Most of the datasets used in industry are either structured or text based. Two relevant instances of structured datasets are: graph data (e.g. knowledge graphs), and tabular data (e.g. excel sheets, databases). However, image datasets are increasingly used in industry and have similar pipeline steps. This thesis explores the data cleaning problem, where two of its main steps are outlier detection and subsequent data repair. This work focuses on outliers that result from corruption processes that are applied to a subset of instances belonging to an original clean dataset. The remaining instances unaffected by corruption, or before corruption, are called inliers. The outlier detection step finds which data instances have been corrupted. The repair step either replaces the entire instance with a clean version, or imputes the values of specific features in that instance that are deemed corrupted. In both cases, an ideal repair process restores the underlying inlier instance, before having been corrupted by errors. The main goal is to devise machine learning (ML) models that automate both outlier detection and data repair, with minimal supervision by the end-user. In particular, we focus on solutions based on variational autoencoders (VAEs), because these are flexible generative models capable of providing repairs as samples or reconstructions. Moreover, the reconstruction provided by VAEs also allow for the detection of corrupted feature values, unlike classic outlier detection methods. Since the training dataset is corrupted by outliers, the key aspect to good performance in detection and repair is model robustness to data corruption, which prevents overfitting to errors. If the model overfits to errors, then it is difficult to distinguish inliers from outliers, therefore degrading performance. In this thesis two novel generative models are proposed for this task, to be used in different contexts. The two most common types of errors are either of random or systematic nature. Random errors corrupt each instance independently using an unknown distribution, exhibiting no clear anomalous pattern across outlier instances. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, exhibiting a clear pattern across outliers. Overall, this means high capacity models like VAEs more easily overfit to systematic errors, which compromises outlier detection and repair performance. This thesis focuses on point outliers as they are the most commonly found by practitioners. Point outliers are those that can be identified by only evaluating said instance individually, without the context of other instances (e.g. space, time, graphs). The first model proposal devises a novel unsupervised VAE that is robust to random errors for mixed-type (e.g. categorical, continuous) tabular data. This first model is called the Robust Variational Autoencoder (RVAE). We introduce this robustness by designing a decoder architecture that downweighs the contribution of corrupted feature values (cells) during training. Unlike traditional methods, besides providing which instances are outliers, the novel model provides which cells have been corrupted improving model interpretability. It is shown experimentally that the novel model performs better than baselines in cell outlier detection and repair, and is robust against initial hyper-parameter selection. In the second model proposal the focus is on detection and repair in datasets corrupted by systematic errors. This second model is called the Clean Subspace Variational Autoencoder (CLSVAE). The nature of systematic errors makes them easy to learn, and thus easy to overfit to. This means that if they are numerous in a dataset, then unsupervised methods will have difficulty distinguishing between inliers and outliers. A novel semi-supervised VAE is proposed that only requires a small labelled set of inliers and outliers, thus minimizing end-user intervention. The main idea is to learn separate latent representations for inliers and systematic errors, and only use the inlier representation for data repair. The novel model is shown to be robust to systematic errors, and it registers state-of-the-art repair in image datasets. Compared to the baselines, the novel model does better in challenging scenarios, where corruption level is higher or the labelled set is very small.

  6. 6
    دورية أكاديمية

    المصدر: Applications of Modelling and Simulation, Vol 8, Pp 78-92 (2024)

    الوصف: Accurate and non-invasive measurement of material thickness plays an important role across several industry sectors such as aerospace, oil and gas, rail and others. This paper aims to use neural networks as a predictive tool to enhance thickness measurement accuracy of immersed steel samples. In this study, a set of training data is provided through conducting experiments on an immersed wedge sample with varying thickness using the A-scan method. This dataset is used for training a single-layer neural network. To evaluate the performance of the trained neural network, a set of test data is provided on different samples with various thicknesses. Through this study, a promising methodology is demonstrated toward accurate and effective thicknesses prediction using neural networks. The outcomes exhibited good agreement when employing a neural network with the same architecture to predict the void locations in another sample of similar material. Furthermore, the results revealed that this method has achieved an error of less than 3% for thickness prediction and less than 7% for void detection.

    وصف الملف: electronic resource

  7. 7
    دورية أكاديمية

    المصدر: Ecology and Evolution, Vol 14, Iss 5, Pp n/a-n/a (2024)

    الوصف: Abstract Plant trait data are used to quantify how plants respond to environmental factors and can act as indicators of ecosystem function. Measured trait values are influenced by genetics, trade‐offs, competition, environmental conditions, and phenology. These interacting effects on traits are poorly characterized across taxa, and for many traits, measurement protocols are not standardized. As a result, ancillary information about growth and measurement conditions can be highly variable, requiring a flexible data structure. In 2007, the TRY initiative was founded as an integrated database of plant trait data, including ancillary attributes relevant to understanding and interpreting the trait values. The TRY database now integrates around 700 original and collective datasets and has become a central resource of plant trait data. These data are provided in a generic long‐table format, where a unique identifier links different trait records and ancillary data measured on the same entity. Due to the high number of trait records, plant taxa, and types of traits and ancillary data released from the TRY database, data preprocessing is necessary but not straightforward. Here, we present the ‘rtry’ R package, specifically designed to support plant trait data exploration and filtering. By integrating a subset of existing R functions essential for preprocessing, ‘rtry’ avoids the need for users to navigate the extensive R ecosystem and provides the functions under a consistent syntax. ‘rtry’ is therefore easy to use even for beginners in R. Notably, ‘rtry’ does not support data retrieval or analysis; rather, it focuses on the preprocessing tasks to optimize data quality. While ‘rtry’ primarily targets TRY data, its utility extends to data from other sources, such as the National Ecological Observatory Network (NEON). The ‘rtry’ package is available on the Comprehensive R Archive Network (CRAN; https://cran.rTest‐project.org/package=rtry) and the GitHub Wiki (https://github.com/MPITest‐BGC‐Functional‐Biogeography/rtry/wiki) along with comprehensive documentation and vignettes describing detailed data preprocessing workflows.

    وصف الملف: electronic resource

  8. 8
    كتاب

    المؤلفون: Andersen, Torben Juul, author

    المصدر: A Study of Risky Business Outcomes: Adapting to Strategic Disruption

  9. 9
    دورية أكاديمية

    المصدر: Zhihui kongzhi yu fangzhen, Vol 45, Iss 6, Pp 102-111 (2023)

    الوصف: Aiming at some "pain points" in data access of the wargame system, a bus-based real-time data collection and management platform for wargame is designed by absorbing the experience and lessons of domestic and foreign wargame systems. The distributed storage platform is introduced as the underlying storage foundation, and the in-memory database is introduced as the external service interface. The bus-based data collection is used as the collection source, and data cleaning and branch management based on data segments are performed. The system as a whole is divided into three modules, namely collection module, management module, and service module. It realizes real-time collection, cleaning, storage and management of wargame data. The practical application shows that the platform provides high-speed and reliable data access support for other modules in the wargame system.

    وصف الملف: electronic resource

  10. 10
    دورية أكاديمية

    المؤلفون: Yuting Liu

    المصدر: Frontiers in Energy Research, Vol 12 (2024)

    الوصف: The lean resource management and reliable interaction of massive data are important components of a low-carbon-oriented new grid. However, with a high proportion of distributed low-carbon resources connected to a new grid, issues such as data anomalies, data redundancy, and missing data lead to inefficient resource management and unreliable interaction, affecting the accuracy of power grid decision-making, as well as the effectiveness of emission reduction and carbon reduction. Therefore, this paper proposes a lean resource management and reliable interaction framework of a middle platform based on distributed data governance. On this basis, a distributed data governance approach for the lean resource management method of the middle platform in the low-carbon new grid is proposed, which realizes anomalous data cleaning and missing data filling. Then, a data storage and traceability method for reliable interaction is proposed, which prevents important data from being illegally tampered with in the interaction process. The simulation results demonstrate that the proposed algorithm significantly enhances efficiency, reliability, and accuracy in anomalous data cleaning and filling, as well as data traceability.

    وصف الملف: electronic resource