Show simple item record

dc.contributor.authorTang, Nan
dc.contributor.authorFan, Ju
dc.contributor.authorLi, Fangyi
dc.contributor.authorTu, Jianhong
dc.contributor.authorDu, Xiaoyong
dc.contributor.authorLi, Guoliang
dc.contributor.authorMadden, Sam
dc.contributor.authorOuzzani, Mourad
dc.date.accessioned2022-07-15T16:13:16Z
dc.date.available2022-07-15T16:13:16Z
dc.date.issued2021
dc.identifier.urihttps://hdl.handle.net/1721.1/143770
dc.description.abstract<jats:p> <jats:italic>Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers?</jats:italic> We answer this question by presenting RPT, a denoising autoencoder for <jats:italic>tuple-to-X</jats:italic> models (" <jats:italic>X</jats:italic> " could be tuple, token, label, JSON, and so on). RPT is pre-trained for a <jats:italic>tuple-to-tuple</jats:italic> model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation. </jats:p>en_US
dc.language.isoen
dc.publisherVLDB Endowmenten_US
dc.relation.isversionof10.14778/3457390.3457391en_US
dc.rightsCreative Commons Attribution-NonCommercial-NoDerivs Licenseen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/4.0/en_US
dc.sourceVLDB Endowmenten_US
dc.titleRPT: relational pre-trained transformer is almost all you need towards democratizing data preparationen_US
dc.typeArticleen_US
dc.identifier.citationTang, Nan, Fan, Ju, Li, Fangyi, Tu, Jianhong, Du, Xiaoyong et al. 2021. "RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation." Proceedings of the VLDB Endowment, 14 (8).
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
dc.relation.journalProceedings of the VLDB Endowmenten_US
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2022-07-15T15:53:25Z
dspace.orderedauthorsTang, N; Fan, J; Li, F; Tu, J; Du, X; Li, G; Madden, S; Ouzzani, Men_US
dspace.date.submission2022-07-15T15:53:26Z
mit.journal.volume14en_US
mit.journal.issue8en_US
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record