RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation
Author(s)
Tang, Nan; Fan, Ju; Li, Fangyi; Tu, Jianhong; Du, Xiaoyong; Li, Guoliang; Madden, Sam; Ouzzani, Mourad; ... Show more Show less
DownloadPublished version (856.5Kb)
Publisher with Creative Commons License
Publisher with Creative Commons License
Creative Commons Attribution
Terms of use
Metadata
Show full item recordAbstract
<jats:p>
<jats:italic>Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers?</jats:italic>
We answer this question by presenting RPT, a denoising autoencoder for
<jats:italic>tuple-to-X</jats:italic>
models ("
<jats:italic>X</jats:italic>
" could be tuple, token, label, JSON, and so on). RPT is pre-trained for a
<jats:italic>tuple-to-tuple</jats:italic>
model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation.
</jats:p>
Date issued
2021Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence LaboratoryJournal
Proceedings of the VLDB Endowment
Publisher
VLDB Endowment
Citation
Tang, Nan, Fan, Ju, Li, Fangyi, Tu, Jianhong, Du, Xiaoyong et al. 2021. "RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation." Proceedings of the VLDB Endowment, 14 (8).
Version: Final published version