Structuring Heterogeneous Real Estate Market Evidence Using LLMs: A Provenance-Aware Analytical Framework
Author(s)
Hahmann, Luca; Xie, Richard
DownloadThesis PDF (2.757Mb)
Advisor
Torous, Walter
Terms of use
Metadata
Show full item recordAbstract
Real estate market analysis at early analytical stages relies on evidence drawn from heterogeneous public and semi-public sources, including brokerage research, administrative datasets, planning documents, and narrative commentary. These inputs differ systematically in scope, definition, temporal framing, and institutional construction. Market indicators are frequently consumed through static reports and summary tables that obscure provenance, comparability constraints, and evidentiary gaps. As a result, analytical conclusions often depend on implicit assumptions about how market information is constructed and aligned before formal underwriting or quantitative modeling begins. This thesis develops an evidence-centric framework for structuring and inspecting real estate market information prior to inference, using large language models (LLMs) to translate unstructured report artifacts into structured evidence objects. The framework treats market indicators, narrative claims, and source dependencies as constructed analytical objects. Each observation is represented together with explicit metadata describing geographic scope, temporal reference, definitional disclosure, and upstream data dependencies. This representation supports disciplined comparison across sources and makes uncertainty, non-equivalence, and missing context explicit. The methodological contribution consists of a KPI taxonomy, a context-aware data model, and a layered processing architecture. Observations are preserved with their original construction context and aligned only when comparability conditions are satisfied. Uncertainty is encoded through coverage, recency, and dispersion indicators. Visualization functions as an interface for evidentiary inspection, enabling users to navigate indicators, examine parallel representations, and trace reported values to their sources. The framework is demonstrated through an application to U.S. multifamily market reports. A single-source case study illustrates how brokerage reports combine headline KPIs, narrative claims, submarket tables, time-series elements, and transaction summaries within a single artifact. A multi-source case study examines contemporaneous reports for the same market and shows how differences in segmentation logic, measurement conventions, and temporal aggregation shape apparent agreement and disagreement. In both cases, structural non-equivalence remains visible as an analytical feature. The results show that explicit representation of evidentiary structure supports clearer interpretation of market claims independent of predictive modeling. The thesis positions LLM-enabled structuring as foundational infrastructure for real estate market analysis and as a prerequisite for downstream quantitative and causal research. Future work may extend the framework to additional data sources, longitudinal analysis of reporting behavior, and institutional deployment across investment workflows.
Date issued
2026-02Department
Massachusetts Institute of Technology. Center for Real Estate. Program in Real Estate Development.Publisher
Massachusetts Institute of Technology