dc.contributor.author | Tang, Nan | |
dc.contributor.author | Fan, Ju | |
dc.contributor.author | Li, Fangyi | |
dc.contributor.author | Tu, Jianhong | |
dc.contributor.author | Du, Xiaoyong | |
dc.contributor.author | Li, Guoliang | |
dc.contributor.author | Madden, Sam | |
dc.contributor.author | Ouzzani, Mourad | |
dc.date.accessioned | 2022-07-15T16:13:16Z | |
dc.date.available | 2022-07-15T16:13:16Z | |
dc.date.issued | 2021 | |
dc.identifier.uri | https://hdl.handle.net/1721.1/143770 | |
dc.description.abstract | <jats:p>
<jats:italic>Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers?</jats:italic>
We answer this question by presenting RPT, a denoising autoencoder for
<jats:italic>tuple-to-X</jats:italic>
models ("
<jats:italic>X</jats:italic>
" could be tuple, token, label, JSON, and so on). RPT is pre-trained for a
<jats:italic>tuple-to-tuple</jats:italic>
model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation.
</jats:p> | en_US |
dc.language.iso | en | |
dc.publisher | VLDB Endowment | en_US |
dc.relation.isversionof | 10.14778/3457390.3457391 | en_US |
dc.rights | Creative Commons Attribution-NonCommercial-NoDerivs License | en_US |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/4.0/ | en_US |
dc.source | VLDB Endowment | en_US |
dc.title | RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation | en_US |
dc.type | Article | en_US |
dc.identifier.citation | Tang, Nan, Fan, Ju, Li, Fangyi, Tu, Jianhong, Du, Xiaoyong et al. 2021. "RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation." Proceedings of the VLDB Endowment, 14 (8). | |
dc.contributor.department | Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory | |
dc.relation.journal | Proceedings of the VLDB Endowment | en_US |
dc.eprint.version | Final published version | en_US |
dc.type.uri | http://purl.org/eprint/type/ConferencePaper | en_US |
eprint.status | http://purl.org/eprint/status/NonPeerReviewed | en_US |
dc.date.updated | 2022-07-15T15:53:25Z | |
dspace.orderedauthors | Tang, N; Fan, J; Li, F; Tu, J; Du, X; Li, G; Madden, S; Ouzzani, M | en_US |
dspace.date.submission | 2022-07-15T15:53:26Z | |
mit.journal.volume | 14 | en_US |
mit.journal.issue | 8 | en_US |
mit.license | PUBLISHER_CC | |
mit.metadata.status | Authority Work and Publication Information Needed | en_US |