Show simple item record

dc.contributor.authorChen, Qiaochu
dc.contributor.authorBanerjee, Arko
dc.contributor.authorDemiralp, ?a?atay
dc.contributor.authorDurrett, Greg
dc.contributor.authorDillig, I??l
dc.date.accessioned2023-11-03T20:26:18Z
dc.date.available2023-11-03T20:26:18Z
dc.date.issued2023-10-16
dc.identifier.issn2475-1421
dc.identifier.urihttps://hdl.handle.net/1721.1/152906
dc.description.abstractMany data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and compositional type-directed synthesis for fast and effective generalization from a small number of examples. We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools.en_US
dc.publisherACMen_US
dc.relation.isversionofhttps://doi.org/10.1145/3622863en_US
dc.rightsCreative Commons Attributionen_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.sourceAssociation for Computing Machineryen_US
dc.titleData Extraction via Semantic Regular Expression Synthesisen_US
dc.typeArticleen_US
dc.identifier.citationChen, Qiaochu, Banerjee, Arko, Demiralp, ?a?atay, Durrett, Greg and Dillig, I??l. 2023. "Data Extraction via Semantic Regular Expression Synthesis." Proceedings of the ACM on Programming Languages, 7 (OOPSLA2).
dc.contributor.departmentMassachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
dc.relation.journalProceedings of the ACM on Programming Languagesen_US
dc.identifier.mitlicensePUBLISHER_CC
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dc.date.updated2023-11-01T07:57:57Z
dc.language.rfc3066en
dc.rights.holderThe author(s)
dspace.date.submission2023-11-01T07:57:57Z
mit.journal.volume7en_US
mit.journal.issueOOPSLA2en_US
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record