Show simple item record

dc.contributor.advisorFarias, Vivek F.
dc.contributor.authorCheng, Emily
dc.date.accessioned2025-09-18T14:27:45Z
dc.date.available2025-09-18T14:27:45Z
dc.date.issued2025-05
dc.date.submitted2025-06-23T14:01:36.364Z
dc.identifier.urihttps://hdl.handle.net/1721.1/162695
dc.description.abstractSynthesizing realistic tabular data is crucial for any analytical application, including policy evaluation related to household energy use. However, detailed household-level consumption data, necessary for such evaluation, are scare at fine geographic scales, as public surveys like the U.S. Residential Energy Consumption Survey (RECS) provide too few observations. We address this gap by developing a topology-guided diffusion-based generative model that produces realistic synthetic household data, and our approach handles two key challenges in this setting: (1) mixed continuous and discrete features and (2) strong hierarchical dependencies among variables. To handle categorical features, we build upon recent advancements in discrete diffusion, particularly TabDDPM [1] and TabDiff [2], which discretize the diffusion process through noise transition matrices, effectively extending diffusion methods to discrete tabular domains. To address hierarchical dependence, we include (1) a structure-aware noise schedule that injects noise from the leaves to the root along an approximate Chow–Liu tree constructed from the variables and (ii) a masked self-attention denoiser that aligns with the same graphical structure. Extensive experiments show that our structured diffusion model outperforms the baseline TabDiff on data with tree-like dependencies, due to the inductive bias from our structure-aware noise schedule. On data that only approximately follows a tree, such as the RECS dataset, our model maintains competitive performance, only slightly outperforming standard diffusion methods. These results highlight the potential for future work to further optimize the tradeoff between structural approximation and estimation accuracy and for future work beyond the energy domain.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleA Topology-Guided Diffusion Process for Synthetic Tabular Data Generation
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record