دورية أكاديمية

Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models

التفاصيل البيبلوغرافية
العنوان: Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models
المؤلفون: Azzouzi, Mohamed El, Coatrieux, Gouenou, Bellafqira, Reda, Delamarre, Denis, Riou, Christine, Oubenali, Naima, Cabon, Sandie, Cuggia, Marc, Bouzillé, Guillaume
المساهمون: Laboratoire Traitement du Signal et de l'Image (LTSI), Université de Rennes (UR)-Institut National de la Santé et de la Recherche Médicale (INSERM), Laboratoire de Traitement de l'Information Medicale (LaTIM), Université de Brest (UBO)-Institut National de la Santé et de la Recherche Médicale (INSERM)-Centre Hospitalier Régional Universitaire de Brest (CHRU Brest)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom Paris (IMT)-Institut Mines-Télécom Paris (IMT)-Institut Brestois Santé Agro Matière (IBSAM), Université de Brest (UBO), Département lmage et Traitement Information (IMT Atlantique - ITI), IMT Atlantique (IMT Atlantique), Institut Mines-Télécom Paris (IMT)-Institut Mines-Télécom Paris (IMT), Centre Hospitalier Universitaire Rennes, The research reported in this study is funded by INSERM (Institut national de la santé et de la recherche médicale). The funding body played no specific role in the conceptualization, design, data collection, analysis, decision to publish, or preparation of the manuscript. Publication costs are funded by DOMASIA Research team.
المصدر: ISSN: 1472-6947.
بيانات النشر: HAL CCSD
BioMed Central
سنة النشر: 2024
المجموعة: Université de Rennes 1: Publications scientifiques (HAL)
مصطلحات موضوعية: Automatic annotation, Clinical de-identification, Deep learning, Distant supervision, French language, Named entity recognition, Word representations, [SDV.IB]Life Sciences [q-bio]/Bioengineering
الوصف: International audience ; Background: Electronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. De-identification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic de-identification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the de-identification pipeline to other clinical centers.Methods: We proposed an automated annotation process for French clinical de-identification, exploiting data from the eHOP clinical data warehouse (CDW) of the CHU de Rennes and national knowledge bases, as well as other features. In addition, this paper proposes an assisted data annotation solution using the Prodigy annotation tool. This approach aims to reduce the cost required to create a reference corpus for the evaluation of state-of-the-art NER models. Finally, we evaluated and compared the effectiveness of different NER methods.Results: A French de-identification dataset was developed in this work, based on EHRs provided by the eHOP CDW at Rennes University Hospital, France. The dataset was rich in terms of personal information, and the distribution of entities was quite similar in the training and test datasets. We evaluated a Bi-LSTM + CRF sequence labeling architecture, combined with Flair + FastText word embeddings, on a test set of manually annotated clinical reports. The model outperformed the other tested models with a significant F1 score of 96,96%, demonstrating the effectiveness of our automatic approach for deidentifying sensitive ...
نوع الوثيقة: article in journal/newspaper
اللغة: English
العلاقة: info:eu-repo/semantics/altIdentifier/pmid/38365677; hal-04477654; https://hal.science/hal-04477654Test; https://hal.science/hal-04477654/documentTest; https://hal.science/hal-04477654/file/s12911-024-02422-5.pdfTest; PUBMED: 38365677
DOI: 10.1186/s12911-024-02422-5
الإتاحة: https://doi.org/10.1186/s12911-024-02422-5Test
https://hal.science/hal-04477654Test
https://hal.science/hal-04477654/documentTest
https://hal.science/hal-04477654/file/s12911-024-02422-5.pdfTest
حقوق: info:eu-repo/semantics/OpenAccess
رقم الانضمام: edsbas.F93B4B9D
قاعدة البيانات: BASE