Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Althnian, Alhanoof; AlSaeed, Duaa; Al-Baity, Heyam; Samha, Amani; Dris, Alanoud  Bin; Alzakari, Najla; Abou Elwafa, Afnan; Kurdi, Heba A.

dc.contributor.author	Althnian, Alhanoof
dc.contributor.author	AlSaeed, Duaa
dc.contributor.author	Al-Baity, Heyam
dc.contributor.author	Samha, Amani
dc.contributor.author	Dris, Alanoud Bin
dc.contributor.author	Alzakari, Najla
dc.contributor.author	Abou Elwafa, Afnan
dc.contributor.author	Kurdi, Heba A.
dc.date.accessioned	2022-07-15T19:17:10Z
dc.date.available	2021-09-20T14:16:14Z
dc.date.available	2022-07-15T19:17:10Z
dc.date.issued	2021-01-15
dc.identifier.uri	https://hdl.handle.net/1721.1/131330.2
dc.description.abstract	Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.	en_US
dc.publisher	Multidisciplinary Digital Publishing Institute	en_US
dc.relation.isversionof	http://dx.doi.org/10.3390/app11020796	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	Multidisciplinary Digital Publishing Institute	en_US
dc.title	Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain	en_US
dc.type	Article	en_US
dc.identifier.citation	Applied Sciences 11 (2): 796 (2021)	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Mechanical Engineering	en_US
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dc.date.updated	2021-01-22T15:59:33Z
dspace.date.submission	2021-01-22T15:59:33Z
mit.metadata.status	Publication Information Needed	en_US

Files in this item

Name:: applsci-11-00796.pdf
Size:: 1.717Mb
Format:: Unknown

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record

Version	Item	Date	Summary
2	1721.1/131330.2*	2022-07-15T19:15:28Z	Metadata changed: Verified or entered author name and department authority metadata.
1	1721.1/131330	2021-09-20T14:16:14Z

*Selected version

DSpace@MIT

Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Files in this item

This item appears in the following Collection(s)

Version History