The Data Artifacts Glossary: a community-based repository for bias on health datasets

Gameiro, Rodrigo R.; Woite, Naira L.; Sauer, Christopher M.; Hao, Sicheng; Fernandes, Chrystinne O.; Premo, Anna E.; Teixeira, Alice R.; Resli, Isabelle; Wong, An-Kwok I.; Celi, Leo A.

dc.contributor.author	Gameiro, Rodrigo R.
dc.contributor.author	Woite, Naira L.
dc.contributor.author	Sauer, Christopher M.
dc.contributor.author	Hao, Sicheng
dc.contributor.author	Fernandes, Chrystinne O.
dc.contributor.author	Premo, Anna E.
dc.contributor.author	Teixeira, Alice R.
dc.contributor.author	Resli, Isabelle
dc.contributor.author	Wong, An-Kwok I.
dc.contributor.author	Celi, Leo A.
dc.date.accessioned	2025-03-03T17:50:57Z
dc.date.available	2025-03-03T17:50:57Z
dc.date.issued	2025-02-04
dc.identifier.uri	https://hdl.handle.net/1721.1/158286
dc.description.abstract	Background The deployment of Artificial Intelligence (AI) in healthcare has the potential to transform patient care through improved diagnostics, personalized treatment plans, and more efficient resource management. However, the effectiveness and fairness of AI are critically dependent on the data it learns from. Biased datasets can lead to AI outputs that perpetuate disparities, particularly affecting social minorities and marginalized groups. Objective This paper introduces the “Data Artifacts Glossary”, a dynamic, open-source framework designed to systematically document and update potential biases in healthcare datasets. The aim is to provide a comprehensive tool that enhances the transparency and accuracy of AI applications in healthcare and contributes to understanding and addressing health inequities. Methods Utilizing a methodology inspired by the Delphi method, a diverse team of experts conducted iterative rounds of discussions and literature reviews. The team synthesized insights to develop a comprehensive list of bias categories and designed the glossary’s structure. The Data Artifacts Glossary was piloted using the MIMIC-IV dataset to validate its utility and structure. Results The Data Artifacts Glossary adopts a collaborative approach modeled on successful open-source projects like Linux and Python. Hosted on GitHub, it utilizes robust version control and collaborative features, allowing stakeholders from diverse backgrounds to contribute. Through a rigorous peer review process managed by community members, the glossary ensures the continual refinement and accuracy of its contents. The implementation of the Data Artifacts Glossary with the MIMIC-IV dataset illustrates its utility. It categorizes biases, and facilitates their identification and understanding. Conclusion The Data Artifacts Glossary serves as a vital resource for enhancing the integrity of AI applications in healthcare by providing a mechanism to recognize and mitigate dataset biases before they impact AI outputs. It not only aids in avoiding bias in model development but also contributes to understanding and addressing the root causes of health disparities.	en_US
dc.publisher	BioMed Central	en_US
dc.relation.isversionof	https://doi.org/10.1186/s12929-024-01106-6	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	BioMed Central	en_US
dc.title	The Data Artifacts Glossary: a community-based repository for bias on health datasets	en_US
dc.type	Article	en_US
dc.identifier.citation	Gameiro, R.R., Woite, N.L., Sauer, C.M. et al. The Data Artifacts Glossary: a community-based repository for bias on health datasets. J Biomed Sci 32, 14 (2025).	en_US
dc.contributor.department	Harvard--MIT Program in Health Sciences and Technology. Laboratory for Computational Physiology	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Urban Studies and Planning	en_US
dc.relation.journal	Journal of Biomedical Science	en_US
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2025-02-13T10:17:41Z
dc.language.rfc3066	en
dc.rights.holder	The Author(s)
dspace.date.submission	2025-02-13T10:17:41Z
mit.journal.volume	32	en_US
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: 12929_2024_Article_1106.pdf
Size:: 2.429Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record