دورية أكاديمية

Workload-Dependent Relative Fault Sensitivity and Error Contribution Factor of GPU Onchip Memory Structures

التفاصيل البيبلوغرافية
العنوان: Workload-Dependent Relative Fault Sensitivity and Error Contribution Factor of GPU Onchip Memory Structures
المؤلفون: Shah, Ronak, Choi, Minsu, Jang, Byunghyun
المصدر: Electrical and Computer Engineering Faculty Research & Creative Works
بيانات النشر: Scholars' Mine
سنة النشر: 2013
المجموعة: Missouri University of Science and Technology (Missouri S&T): Scholars' Mine
مصطلحات موضوعية: Computer Graphics, Computer Simulation, Monte Carlo Methods, Personal Computers, Program Processors, Architectural Vulnerability Factor, Error Correction Codes, Fault Sensitivity, General Purpose Computing on GPU, General-Purpose Computing, Graphics Processing Unit, High Performance Computing, Monte-Carlo Simulations, Computer Graphics Equipment, Electrical and Computer Engineering
الوصف: GPU (Graphics Processing Unit) is emerging as an efficient and scalable accelerator for data-parallel workloads in various applications ranging from tablet PCs to HPC (High Performance Computing) mainframes. Unlike traditional 3D graphics rendering, general-purpose compute applications demand stringent assurance of reliability. Therefore, single error tolerance schemes such as SECDED (Single Error Correcting Double Error Detecting) code are being rapidly introduced to high-end GPUs targeting high-performance general-purpose computing. However, relative fault sensitivity and error contribution of critical on-chip memory structures such as active mask stack (AMS), register file (REG) and local memory (MEM) are yet to be studied. Also, implications of single error tolerance on various GPGPU (General Purpose computing on GPU) workloads have not been quantitatively analyzed to reveal its relative cost/fault-tolerance efficiency. To address this issue, a novel Monte Carlo simulation framework has been explored in this work to enumerate and analyze well-converged fault injection data. Instead of estimating AVF (Architectural Vulnerability Factor) of each structure individually, we have injected faults to a whole memory (AMS, REG and MEM combined) in a structure-oblivious fashion. Then, we further categorized and analyzed each structure's relative fault sensitivity and error contribution factor. Finally, we have studied implications of single error tolerance on the memory structures by further considering eight different possible ECC profiles. Results show that relative fault sensitivity and error contribution of REG is highest among the considered memory structures; therefore, ECC (Error Correction Code) protection of REG is most critical and cost-effective.
نوع الوثيقة: text
اللغة: unknown
العلاقة: https://scholarsmine.mst.edu/ele_comeng_facwork/3200Test; https://doi.org/10.1109/SAMOS.2013.6621134Test
DOI: 10.1109/SAMOS.2013.6621134
الإتاحة: https://doi.org/10.1109/SAMOS.2013.6621134Test
https://scholarsmine.mst.edu/ele_comeng_facwork/3200Test
حقوق: © 2013 Institute of Electrical and Electronics Engineers (IEEE), All rights reserved.
رقم الانضمام: edsbas.809D9159
قاعدة البيانات: BASE