RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation

Tang, Nan; Fan, Ju; Li, Fangyi; Tu, Jianhong; Du, Xiaoyong; Li, Guoliang; Madden, Sam; Ouzzani, Mourad

Author(s)

Tang, Nan; Fan, Ju; Li, Fangyi; Tu, Jianhong; Du, Xiaoyong; ... Show more

DownloadPublished version (856.5Kb)

Publisher with Creative Commons License

Terms of use

Creative Commons Attribution-NonCommercial-NoDerivs License http://creativecommons.org/licenses/by-nc-nd/4.0/

Metadata

Show full item record

Abstract

Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers? We answer this question by presenting RPT, a denoising autoencoder for tuple-to-X models (" X " could be tuple, token, label, JSON, and so on). RPT is pre-trained for a tuple-to-tuple model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation.

Date issued

2021

URI

https://hdl.handle.net/1721.1/143770

Department

Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory

Journal

Proceedings of the VLDB Endowment

Publisher

VLDB Endowment

Citation

Tang, Nan, Fan, Ju, Li, Fangyi, Tu, Jianhong, Du, Xiaoyong et al. 2021. "RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation." Proceedings of the VLDB Endowment, 14 (8).

Version: Final published version

Collections

MIT Open Access Articles