يعرض 1 - 10 نتائج من 489 نتيجة بحث عن '"PIPELINE PROCESSING"', وقت الاستعلام: 0.85s تنقيح النتائج
  1. 1
    دورية أكاديمية

    المصدر: IEEE Access, Vol 12, Pp 23808-23826 (2024)

    الوصف: Polar codes have garnered substantial research attention due to their impressive performance characteristics and have found applications in recent technologies, including 5G New Radio (NR) systems, Internet of Things (IoT) communications, and cyber-physical systems that utilize sensor and actuator networks. However, the existing SC decoders suffer from lengthy processing latencies due to their sequential processing steps, thereby restricting the practical applicability of polar codes. To address this latency issue, this paper introduces a Compound Pipeline Processing Unit (CPPU) and its simplified counterpart, a crucial step in realizing tree-level compound pipelining. In contrast to sequential circuitry, the previously described combinational architecture lacks internal storage elements, with the clock period defined by the longest path delay. This strategy conserves hardware resources by avoiding memory usage, but it inevitably decelerates the decoder’s performance. Notably, implementation results underline the efficiency of the proposed CPPU-based SC polar decoder using a fully unrolled encoder and decoder on the targeted platform of a Virtex UltraScale - XCVU190 Field Programmable Gate Array (FPGA), using a parametric approach in the Very High-Speed Integrated Circuit Hardware Description Language (VHDL). The assessment of error-correction performance involves examining various combinations of integral and fractional bits in LLR quantized representations. This approach achieves a throughput of about 2672 Mbps, accompanied by a substantial reduction of 17% in Lookup Table (LUT) usage. Furthermore, the decoder’s speed is enhanced by approximately 17.34% for a code length of 128 bits and LLR quantization of 5 bits.

    وصف الملف: electronic resource

  2. 2
    دورية أكاديمية

    المؤلفون: Bongwon Jang, In-Chul Yoo, Dongsuk Yook

    المصدر: IEEE Access, Vol 12, Pp 5477-5489 (2024)

    الوصف: To accelerate the training speed of massive DNN models on large-scale datasets, distributed training techniques, including data parallelism and model parallelism, have been extensively studied. In particular, pipeline parallelism, which is derived from model parallelism, has been attracting attention. It splits the model parameters across multiple computing nodes and executes multiple mini-batches simultaneously. However, naive pipeline parallelism suffers from the issues of weight inconsistency and delayed gradients, as the model parameters used in the forward and backward passes do not match, causing unstable training and low performance. In this study, we propose a novel pipeline parallelism technique called EA-Pipe to address the weight inconsistency and delayed gradient problems. EA-Pipe applies an elastic averaging method, which has been studied in the context of data parallelism, to pipeline parallelism. The proposed method maintains multiple model replicas to solve the weight inconsistency problem, and synchronizes the model replicas using an elasticity-based moving average method to mitigate the delayed gradient problem. To verify the efficacy of the proposed method, we conducted three image classification experiments on the CIFAR-10/100 and ImageNet datasets. The experimental results show that EA-Pipe not only accelerates training speed but also demonstrates more stable learning property compared to existing pipeline parallelism techniques. Especially, in the experiments using the CIFAR-100 and ImageNet datasets, EA-Pipe recorded error rates that were 2.58% and 2.19% lower, respectively, than the baseline pipeline parallelization method.

    وصف الملف: electronic resource

  3. 3
    دورية أكاديمية

    المؤلفون: Fedir Smilianets, Oleksii Finogenov

    المصدر: Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska, Vol 14, Iss 1 (2024)

    الوصف: This paper introduces a novel algorithm for dynamically constructing and traversing Directed Acyclic Graphs (DAGs) in workflow systems, particularly targeting distributed computation and data processing domains. Traditional workflow management systems rely on explicitly defined, rigid DAGs, which can be cumbersome to maintain, especially in response to frequent changes or updates in the system. Our proposed algorithm circumvents the need for explicit DAG construction, instead opting for a dynamic approach that iteratively builds and executes the workflow based on available data and operations, through a combination of entities like Data Kinds, Operators, and Data Units, the algorithm implicitly forms a DAG, thereby simplifying the process of workflow management. We demonstrate the algorithm’s functionality and assess its performance through a series of tests in a simulated environment. The paper discusses the implications of this approach, especially focusing on cycle avoidance and computational complexity, and suggests future enhancements and potential applications.

    وصف الملف: electronic resource

  4. 4
    دورية أكاديمية

    المصدر: Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska; Vol. 14 No. 1 (2024); 115-118 ; Informatyka, Automatyka, Pomiary w Gospodarce i Ochronie Środowiska; Tom 14 Nr 1 (2024); 115-118 ; 2391-6761 ; 2083-0157

    الوصف: This paper introduces a novel algorithm for dynamically constructing and traversing Directed Acyclic Graphs (DAGs) in workflow systems, particularly targeting distributed computation and data processing domains. Traditional workflow management systems rely on explicitly defined, rigid DAGs, which can be cumbersome to maintain, especially in response to frequent changes or updates in the system. Our proposed algorithm circumvents the need for explicit DAG construction, instead opting for a dynamic approach that iteratively builds and executes the workflow based on available data and operations, through a combination of entities like Data Kinds, Operators, and Data Units, the algorithm implicitly forms a DAG, thereby simplifying the process of workflow management. We demonstrate the algorithm’s functionality and assess its performance through a series of tests in a simulated environment. The paper discusses the implications of this approach, especially focusing on cycle avoidance and computational complexity, and suggests future enhancements and potential applications. ; W artykule przedstawiono nowy algorytm dynamicznego konstruowania i przejść skierowanych grafów acyklicznych (DAG) w systemach zarządzania przepływem pracy, w szczególności tych ukierunkowanych na domeny obliczeń rozproszonych i przetwarzania danych. Tradycyjne systemy zarządzania przepływem pracy opierają się na jawnie zdefiniowanych, sztywnych grafach DAG, które mogą być uciążliwe w utrzymaniu, zwłaszcza w odpowiedzi na częste zmiany lub aktualizacje systemu. Proponowany algorytm pozwala uniknąć konieczności jawnego konstruowania SAG, zamiast tego wybierając dynamiczne podejście, które iteracyjnie buduje i wykonuje przepływy pracy w oparciu o dostępne dane i operacje. Korzystając z kombinacji jednostek, takich jak typ danych, operator i element danych, algorytm niejawnie buduje DAG, upraszczając w ten sposób proces zarządzania przepływami pracy. Demonstrujemy funkcjonalność algorytmu i oceniamy jego wydajność za pomocą serii testów w symulowanym ...

    وصف الملف: application/pdf

  5. 5
    دورية أكاديمية

    الوصف: Quality data is critically important for research and policy-making. The availability of device location data carrying rich, detailed information on travel patterns has increased significantly in recent years with the proliferation of personal GPSenabled mobile devices and fleet transponders. However, in its raw form, location data can be inaccurate and contain embedded biases that can skew analyses. This report describes the development of a method to process, clean, and enrich location data. Researchers developed a computational framework for processing large scale location datasets. Using this framework several hundred days of location data from the San Francisco Bay Area was (a) cleaned, to identify and discard inaccurate or problematic data, (b) enriched, by filtering and annotating the data, and (c) matched to links on the road network. This framework provides researchers with the capability to build link-level metrics across large scale geographic regions. Various applications for this enriched data are also discussed in this report (including applications related to corridor planning, freight planning, and disaster and emergency management) along with suggestions for further work.

    وصف الملف: application/pdf

  6. 6
    رسالة جامعية

    المؤلفون: López Álvarez, David

    المساهمون: University/Department: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors

    مرشدي الرسالة: Valero Cortés, Mateo, Llosa, Josep (Llosa Espuny)

    المصدر: TDX (Tesis Doctorals en Xarxa)

    الوصف: Els bucles son la part que més temps consumeix en les aplicacions numèriques. El rendiment dels bucles està limitat tant pels recursos oferts per l'arquitectura com per les recurrències del bucle en la computació. Per executar més operacions per cicle, els processadors actuals es dissenyen amb graus creixents de replicació de recursos (tècnica de replicació) para ports de memòria i unitats funcionals. En canvi, el gran cost en termes d'àrea i temps de cicle d'aquesta tècnica limita tenir alts graus de replicació: alts valors en temps de cicle contraresten els guanys deguts al decrement en el nombre de cicles, mentre que alts valors en l'àrea requerida poden portar a configuracions impossibles d'implementar. Una alternativa a la replicació de recursos, és fer los més amples (tècnica que anomenem "widening"), i que ha estat usada en alguns dissenys recents. Amb aquesta tècnica, l'amplitud dels recursos s'amplia, fent una mateixa operació sobre múltiples dades. Per altra banda, alguns microprocessadors escalars de propòsit general han estat implementats amb unitats de coma flotants que implementen la instrucció sumar i multiplicar unificada (tècnica de fusió), el que redueix la latència de la operació combinada, tanmateix com el nombre de recursos utilitzats. A aquest treball s'avaluen un ampli conjunt d'alternatives de disseny de processadors VLIW que combinen les tres tècniques. S'efectua una projecció tecnològica de les noves generacions de processadors per predir les possibles alternatives implementables. Com a conclusió, demostrem que tenint en compte el cost, combinar certs graus de replicació i "widening" als recursos hardware és més efectiu que aplicar únicament replicació. Així mateix, confirmem que fer servir unitats que fusionen multiplicació i suma pot tenir un impacte molt significatiu en l'increment de rendiment en futures arquitectures de processadors a un cost molt raonable.

    الوصف (مترجم): Loops are the main time-consuming part of numerical applications. The performance of the loops is limited either by the resources offered by the architecture or by recurrences in the computation. To execute more operations per cycle, current processors are designed with growing degrees of resource replication (replication technique) for memory ports and functional units. However, the high cost in terms of area and cycle time of this technique precludes the use of high degrees of replication. High values for the cycle time may clearly offset any gain in terms of number of execution cycles. High values for the area may lead to an unimplementable configuration. An alternative to resource replication is resource widening (widening technique), which has also been used in some recent designs in which the width of the resources is increased (i.e., a single operation is performed over multiple data). Moreover, several general-purpose superscalar microprocessors have been implemented with multiply-add fused floating point units (fusion technique), which reduces the latency of the combined operation and the number of resources used. On this thesis, we evaluate a broad set of VLIW processor design alternatives that combine the three techniques. We perform a technological projection for the next processor generations in order to foresee the possible implementable alternatives. From this study, we conclude that if the cost is taken into account, combining certain degrees of replication and widening in the hardware resources is more effective than applying only replication. Also, we confirm that multiply-add fused units will have a significant impact in raising the performance of future processor architectures with a reasonable increase in cost.

    وصف الملف: application/pdf

  7. 7
    دورية أكاديمية

    المؤلفون: Bongwon Jang, Inchul Yoo, Dongsuk Yook

    المصدر: Applied Sciences, Vol 13, Iss 21, p 11730 (2023)

    الوصف: Stochastic gradient descent (SGD) is an optimization method typically used in deep learning to train deep neural network (DNN) models. In recent studies for DNN training, pipeline parallelism, a type of model parallelism, is proposed to accelerate SGD training. However, since SGD is inherently sequential, naively implemented pipeline parallelism introduces the weight inconsistency and the delayed gradient problems, resulting in reduced training efficiency. In this study, we propose a novel method called TaylorPipe to alleviate these problems. The proposed method generates multiple model replicas to solve the weight inconsistency problem, and adopts a Taylor expansion-based gradient prediction algorithm to mitigate the delayed gradient problem. We verified the efficiency of the proposed method using the VGG-16 and the ResNet-34 on the CIFAR-10 and CIFAR-100 datasets. The experimental results show that not only the training time is reduced by up to 2.7 times but also the accuracy of TaylorPipe is comparable with that of SGD.

    وصف الملف: electronic resource

  8. 8
    دورية أكاديمية

    المصدر: Remote Sensing; Volume 15; Issue 11; Pages: 2885

    جغرافية الموضوع: agris

    الوصف: Advancements in remote sensing technology and very-large-scale integrated circuit (VLSI) have significantly augmented the real-time processing capabilities of spaceborne synthetic aperture radar (SAR), thereby enhancing terrestrial observational capacities. However, the inefficiency of voluminous data storage and transfer inherent in conventional methods has emerged as a technical hindrance, curtailing real-time processing within SAR imaging systems. To address the constraints of a limited storage bandwidth and inefficient data transfer, this study introduces a three-dimensional cross-mapping approach premised on the equal subdivision of sub-matrices utilizing dual-channel DDR3. This method considerably augments storage access bandwidth and achieves equilibrium in two-dimensional data access. Concurrently, an on-chip data transfer approach predicated on a superscalar pipeline buffer is proposed, mitigating pipeline resource wastage, augmenting spatial parallelism, and enhancing data transfer efficiency. Building upon these concepts, a hardware architecture is designed for the efficient storage and transfer of SAR imaging system data, based on the superscalar pipeline. Ultimately, a data storage and transfer engine featuring register addressing access, configurable granularity, and state monitoring functionalities is realized. A comprehensive imaging processing experiment is conducted via a “CPU + FPGA” heterogeneous SAR imaging system. The empirical results reveal that the storage access bandwidth of the proposed superscalar pipeline-based SAR imaging system’s data efficient storage and transfer engine can attain up to 16.6 GB/s in the range direction and 20.0 GB/s in the azimuth direction. These findings underscore that the storage exchange engine boasts superior storage access bandwidth and heightened data storage transfer efficiency. This considerable enhancement in the processing performance of the entire “CPU + FPGA” heterogeneous SAR imaging system renders it suitable for application within spaceborne SAR ...

    وصف الملف: application/pdf

  9. 9
    دورية أكاديمية

    المصدر: IEEE Access, Vol 10, Pp 110444-110458 (2022)

    الوصف: Fractal compression technique is a well-known technique that encodes an image by mapping the image into itself and this requires performing a massive and repetitive search. Thus, the encoding time is too long, which is the main problem of the fractal algorithm. To reduce the encoding time, several hardware implementations have been developed. However, they are generally developed for grayscale images, and using them to encode colour images leads to doubling the encoding time $3\times $ at least. Therefore, in this paper, new high-speed hardware architecture is proposed for encoding RGB images in a short time. Unlike the conventional approach of encoding the colour components similarly and individually as a grayscale image, the proposed method encodes two of the colour components by mapping them directly to the most correlated component with a searchless encoding scheme, while the third component is encoded with a search-based scheme. This results in reducing the encoding time and also in increasing the compression rate. The parallel and deep-pipelining approaches have been utilized to improve the processing time significantly. Furthermore, to reduce the memory access to the half, the image is partitioned in such a way that half of the matching operations utilize the same data fetched for processing the other half of the matching operations. Consequently, the proposed architecture can encode a $1024\times 1024$ RGB image within a minimal time of 12.2 ms, and a compression ratio of 46.5. Accordingly, the proposed architecture is further superior to the state-of-the-art architectures.

    وصف الملف: electronic resource

  10. 10

    المؤلفون: Garrido, Mario, 1981, Möller, K., Kumm, M.

    المصدر: IEEE Transactions on Circuits and Systems Part 1. 66(4):1507-1516

    الوصف: This paper presents the fastest fast Fourier transform (FFT) hardware architectures so far. The architectures are based on a fully parallel implementation of the FFT algorithm. In order to obtain the highest throughput while keeping the resource utilization low, we base our design on making use of advanced shift-and-add techniques to implement the rotators and on selecting the most suitable FFT algorithms for these architectures. Apart from high throughput and resource efficiency, we also guarantee high accuracy in the proposed architectures. For the implementation, we have developed an automatic tool that generates the architectures as a function of the FFT size, input word length and accuracy of the rotations. We provide experimental results covering various FFT sizes, FFT algorithms, and field-programmable gate array boards. These results show that it is possible to break the barrier of 100 GS/s for FFT calculation.

    وصف الملف: electronic