Show simple item record

dc.contributor.authorAlthnian, Alhanoof
dc.contributor.authorAlSaeed, Duaa
dc.contributor.authorAl-Baity, Heyam
dc.contributor.authorSamha, Amani
dc.contributor.authorDris, Alanoud Bin
dc.contributor.authorAlzakari, Najla
dc.contributor.authorAbou Elwafa, Afnan
dc.contributor.authorKurdi, Heba A.
dc.date.accessioned2022-07-15T19:17:10Z
dc.date.available2021-09-20T14:16:14Z
dc.date.available2022-07-15T19:17:10Z
dc.date.issued2021-01-15
dc.identifier.urihttps://hdl.handle.net/1721.1/131330.2
dc.description.abstractDataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.en_US
dc.publisherMultidisciplinary Digital Publishing Instituteen_US
dc.relation.isversionofhttp://dx.doi.org/10.3390/app11020796en_US
dc.rightsCreative Commons Attributionen_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.sourceMultidisciplinary Digital Publishing Instituteen_US
dc.titleImpact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domainen_US
dc.typeArticleen_US
dc.identifier.citationApplied Sciences 11 (2): 796 (2021)en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Mechanical Engineeringen_US
dc.identifier.mitlicensePUBLISHER_CC
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dc.date.updated2021-01-22T15:59:33Z
dspace.date.submission2021-01-22T15:59:33Z
mit.metadata.statusPublication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

VersionItemDateSummary

*Selected version