MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

A Topology-Guided Diffusion Process for Synthetic Tabular Data Generation

Author(s)
Cheng, Emily
Thumbnail
DownloadThesis PDF (2.231Mb)
Advisor
Farias, Vivek F.
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Synthesizing realistic tabular data is crucial for any analytical application, including policy evaluation related to household energy use. However, detailed household-level consumption data, necessary for such evaluation, are scare at fine geographic scales, as public surveys like the U.S. Residential Energy Consumption Survey (RECS) provide too few observations. We address this gap by developing a topology-guided diffusion-based generative model that produces realistic synthetic household data, and our approach handles two key challenges in this setting: (1) mixed continuous and discrete features and (2) strong hierarchical dependencies among variables. To handle categorical features, we build upon recent advancements in discrete diffusion, particularly TabDDPM [1] and TabDiff [2], which discretize the diffusion process through noise transition matrices, effectively extending diffusion methods to discrete tabular domains. To address hierarchical dependence, we include (1) a structure-aware noise schedule that injects noise from the leaves to the root along an approximate Chow–Liu tree constructed from the variables and (ii) a masked self-attention denoiser that aligns with the same graphical structure. Extensive experiments show that our structured diffusion model outperforms the baseline TabDiff on data with tree-like dependencies, due to the inductive bias from our structure-aware noise schedule. On data that only approximately follows a tree, such as the RECS dataset, our model maintains competitive performance, only slightly outperforming standard diffusion methods. These results highlight the potential for future work to further optimize the tradeoff between structural approximation and estimation accuracy and for future work beyond the energy domain.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/162695
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.