SCALPEL3: A scalable open-source library for healthcare claims databases

التفاصيل البيبلوغرافية
العنوان: SCALPEL3: A scalable open-source library for healthcare claims databases
المؤلفون: Maryan Morel, Fanny Leroy, Emmanuel Bacry, Dinh Phong Nguyen, Stéphane Gaïffas, Dian Sun, Youcef Sebiat
المساهمون: CEntre de REcherches en MAthématiques de la DEcision (CEREMADE), Centre National de la Recherche Scientifique (CNRS)-Université Paris Dauphine-PSL, Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Laboratoire de Probabilités, Statistiques et Modélisations (LPSM (UMR_8001)), Université Paris Diderot - Paris 7 (UPD7)-Sorbonne Université (SU)-Centre National de la Recherche Scientifique (CNRS), Centre de Mathématiques Appliquées - Ecole Polytechnique (CMAP), École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS), ANR-19-P3IA-0001,PRAIRIE,PaRis Artificial Intelligence Research InstitutE(2019), Université Paris Dauphine-PSL-Centre National de la Recherche Scientifique (CNRS), Laboratoire de Probabilités, Statistique et Modélisation (LPSM (UMR_8001))
المصدر: International Journal of Medical Informatics
International Journal of Medical Informatics, Elsevier, 2020
سنة النشر: 2019
مصطلحات موضوعية: FOS: Computer and information sciences, Web analytics, 020205 medical informatics, Databases, Factual, Computer science, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], Maintainability, Health Informatics, 02 engineering and technology, computer.software_genre, Denormalization, Computer Science - Computers and Society, 03 medical and health sciences, 0302 clinical medicine, Computers and Society (cs.CY), 0202 electrical engineering, electronic engineering, information engineering, Humans, 030212 general & internal medicine, Data hub, ComputingMilieux_MISCELLANEOUS, Data-flow analysis, Data processing, Database, business.industry, Reproducibility of Results, Pipeline (software), Computer Science - Distributed, Parallel, and Cluster Computing, Scalability, Distributed, Parallel, and Cluster Computing (cs.DC), France, business, computer, Delivery of Health Care
الوصف: This article introduces SCALPEL3, a scalable open-source framework for studies involving Large Observational Databases (LODs). Its design eases medical observational studies thanks to abstractions allowing concept extraction, high-level cohort manipulation, and production of data formats compatible with machine learning libraries. SCALPEL3 has successfully been used on the SNDS database (see Tuppin et al. (2017)), a huge healthcare claims database that handles the reimbursement of almost all French citizens. SCALPEL3 focuses on scalability, easy interactive analysis and helpers for data flow analysis to accelerate studies performed on LODs. It consists of three open-source libraries based on Apache Spark. SCALPEL-Flattening allows denormalization of the LOD (only SNDS for now) by joining tables sequentially in a big table. SCALPEL-Extraction provides fast concept extraction from a big table such as the one produced by SCALPEL-Flattening. Finally, SCALPEL-Analysis allows interactive cohort manipulations, monitoring statistics of cohort flows and building datasets to be used with machine learning libraries. The first two provide a Scala API while the last one provides a Python API that can be used in an interactive environment. Our code is available on GitHub. SCALPEL3 allowed to extract successfully complex concepts for studies such as Morel et al (2017) or studies with 14.5 million patients observed over three years (corresponding to more than 15 billion healthcare events and roughly 15 TeraBytes of data) in less than 49 minutes on a small 15 nodes HDFS cluster. SCALPEL3 provides a sharp interactive control of data processing through legible code, which helps to build studies with full reproducibility, leading to improved maintainability and audit of studies performed on LODs.
تدمد: 1872-8243
1386-5056
الوصول الحر: https://explore.openaire.eu/search/publication?articleId=doi_dedup___::5d87f72855ed412de4cf14a27331f8e1Test
https://pubmed.ncbi.nlm.nih.gov/32485553Test
حقوق: OPEN
رقم الانضمام: edsair.doi.dedup.....5d87f72855ed412de4cf14a27331f8e1
قاعدة البيانات: OpenAIRE