| dc.contributor.author | Chen, Qiaochu | |
| dc.contributor.author | Banerjee, Arko | |
| dc.contributor.author | Demiralp, ?a?atay | |
| dc.contributor.author | Durrett, Greg | |
| dc.contributor.author | Dillig, I??l | |
| dc.date.accessioned | 2023-11-03T20:26:18Z | |
| dc.date.available | 2023-11-03T20:26:18Z | |
| dc.date.issued | 2023-10-16 | |
| dc.identifier.issn | 2475-1421 | |
| dc.identifier.uri | https://hdl.handle.net/1721.1/152906 | |
| dc.description.abstract | Many data extraction tasks of practical relevance require not only syntactic pattern matching but also semantic reasoning about the content of the underlying text. While regular expressions are very well suited for tasks that require only syntactic pattern matching, they fall short for data extraction tasks that involve both a
syntactic and semantic component. To address this issue, we introduce semantic regexes, a generalization of
regular expressions that facilitates combined syntactic and semantic reasoning about textual data. We also
propose a novel learning algorithm that can synthesize semantic regexes from a small number of positive
and negative examples. Our proposed learning algorithm uses a combination of neural sketch generation and
compositional type-directed synthesis for fast and effective generalization from a small number of examples.
We have implemented these ideas in a new tool called Smore and evaluated it on representative data extraction tasks involving several textual datasets. Our evaluation shows that semantic regexes can better support complex data extraction tasks than standard regular expressions and that our learning algorithm significantly outperforms existing tools, including state-of-the-art neural networks and program synthesis tools. | en_US |
| dc.publisher | ACM | en_US |
| dc.relation.isversionof | https://doi.org/10.1145/3622863 | en_US |
| dc.rights | Creative Commons Attribution | en_US |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | en_US |
| dc.source | Association for Computing Machinery | en_US |
| dc.title | Data Extraction via Semantic Regular Expression Synthesis | en_US |
| dc.type | Article | en_US |
| dc.identifier.citation | Chen, Qiaochu, Banerjee, Arko, Demiralp, ?a?atay, Durrett, Greg and Dillig, I??l. 2023. "Data Extraction via Semantic Regular Expression Synthesis." Proceedings of the ACM on Programming Languages, 7 (OOPSLA2). | |
| dc.contributor.department | Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory | |
| dc.relation.journal | Proceedings of the ACM on Programming Languages | en_US |
| dc.identifier.mitlicense | PUBLISHER_CC | |
| dc.eprint.version | Final published version | en_US |
| dc.type.uri | http://purl.org/eprint/type/JournalArticle | en_US |
| eprint.status | http://purl.org/eprint/status/PeerReviewed | en_US |
| dc.date.updated | 2023-11-01T07:57:57Z | |
| dc.language.rfc3066 | en | |
| dc.rights.holder | The author(s) | |
| dspace.date.submission | 2023-11-01T07:57:57Z | |
| mit.journal.volume | 7 | en_US |
| mit.journal.issue | OOPSLA2 | en_US |
| mit.license | PUBLISHER_CC | |
| mit.metadata.status | Authority Work and Publication Information Needed | en_US |