MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
  • DSpace@MIT Home
  • MIT Open Access Articles
  • MIT Open Access Articles
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation

Author(s)
Tang, Nan; Fan, Ju; Li, Fangyi; Tu, Jianhong; Du, Xiaoyong; Li, Guoliang; Madden, Sam; Ouzzani, Mourad; ... Show more Show less
Thumbnail
DownloadPublished version (856.5Kb)
Publisher with Creative Commons License

Publisher with Creative Commons License

Creative Commons Attribution

Terms of use
Creative Commons Attribution-NonCommercial-NoDerivs License http://creativecommons.org/licenses/by-nc-nd/4.0/
Metadata
Show full item record
Abstract
<jats:p> <jats:italic>Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers?</jats:italic> We answer this question by presenting RPT, a denoising autoencoder for <jats:italic>tuple-to-X</jats:italic> models (" <jats:italic>X</jats:italic> " could be tuple, token, label, JSON, and so on). RPT is pre-trained for a <jats:italic>tuple-to-tuple</jats:italic> model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation. </jats:p>
Date issued
2021
URI
https://hdl.handle.net/1721.1/143770
Department
Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Journal
Proceedings of the VLDB Endowment
Publisher
VLDB Endowment
Citation
Tang, Nan, Fan, Ju, Li, Fangyi, Tu, Jianhong, Du, Xiaoyong et al. 2021. "RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation." Proceedings of the VLDB Endowment, 14 (8).
Version: Final published version

Collections
  • MIT Open Access Articles

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.