Integrating Optimization and Modern Machine Learning:
Theory, Computation, and Healthcare Applications
by
Kimberly M. Villalobos Carballo
B.S. Mathematics, MIT, 2019
B.S. Computer Science, MIT, 2019
Submitted to the Sloan School of Management
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY IN OPERATIONS RESEARCH
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2024
© 2024 Kimberly M. Villalobos Carballo. This work is licensed under a CC BY-NC-ND 4.0
license.
The author hereby grants to MIT a nonexclusive, worldwide, irrevocable, royalty-free license
to exercise any and all rights under copyright, including to reproduce, preserve, distribute
and publicly display copies of the thesis, or release the thesis under an open-access license.
Authored by: Kimberly M. Villalobos Carballo
Sloan School of Management
May 3, 2024
Certified by: Dimitris Bertsimas
Boeing Leaders for Global Operations Professor of Management
Associate Dean for Business Analytics
Thesis Supervisor
Accepted by: Georgia Perakis
John C Head III Dean (Interim), MIT Sloan School of Management
Professor, Operations Management, Operations Research & Statistics
Co-director, Operations Research Center
2
Integrating Optimization and Modern Machine Learning: Theory,
Computation, and Healthcare Applications
by
Kimberly M. Villalobos Carballo
Submitted to the Sloan School of Management
on May 3, 2024 in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY IN OPERATIONS RESEARCH
ABSTRACT
Optimization and machine learning are two predominant fields for decision-making today.
The increasing availability of data over the past years has facilitated advancements in the
intersection of these two domains, which in turn has led to better decision support tools.
Optimization has significantly enhanced traditional machine learning models by refining
their training methods, and machine learning has improved many optimization algorithms by
enabling better decision-making through accurate predictions.
However, integrating optimization theory with modern machine learning methods, like
neural networks and kernel functions, faces two primary challenges. Firstly, these models
don’t meet the fundamental convexity assumptions of optimization theory. Secondly, these
models are primarily used in tasks with numerous parameters and high-dimensional data,
requiring highly efficient and scalable algorithms. This focus on efficiency limits consideration
for discrete variables and general constraints that are typical in optimization. This thesis
introduces novel algorithms to address these challenges.
The work is divided into four chapters, encompassing rigorous theory, computational
tools, and diverse applications. In Chapter 1, we extend state-of-the-art tools from robust
optimization to non-convex and non-concave settings, allowing us to generate neural networks
that are robust against input perturbations. In Chapter 2, we develop a holistic deep learning
framework that jointly optimizes for neural network robustness, stability and sparsity by
appropriately modifying the loss function. In Chapter 3 we introduce TabText, a flexible
methodology that leverages the power of Large Language Models for patient flow predictions
from tabular data. Lastly, in Chapter 4 we present a data-driven approach for solving
multistage stochastic optimization problems via sparsified kernel methods.
Thesis supervisor: Dimitris Bertsimas
Title: Boeing Leaders for Global Operations Professor of Management
Associate Dean for Business Analytics
3
4
Acknowledgments
I want to sincerely thank Dimitris Bertsimas, my academic advisor, for his unwavering
guidance, support, and encouragement during my five-year PhD journey. Dimitris has shown
me how impactful research can be and how important it is to focus on problems that can
help others. Beyond academics, Dimitris has taught me about leadership, communication,
teamwork, and more – I never leave his office without some invaluable piece of life advice.
As a mentor, he has consistently acknowledged my strengths while also helping me identify
and address my weaknesses to foster improvement. Dimitris, I am extremely grateful to you
for persuading and supporting me in pursuing a career as a professor. If I achieve even a
fraction of what you have accomplished, I will consider it a success. Thank you for caring for
me and inspiring me to become a better researcher, teacher, and person.
I also want to express my gratitude to Dick den Hertog, Vivek Farias, and Swati Gupta
for being part of my thesis committee and supporting me in my faculty job search. I truly
appreciate all your time and advice. Dick, I’m also deeply grateful for your mentorship
as both my professor and collaborator. Your guidance has been incredibly impactful, and
I’ve learned so much from you. I also thank Georgia Perakis for her support and kindness
throughout my PhD; your academic and emotional advice has been exceptional.
I am grateful for all the collaborators I have been lucky to work with throughout my
doctoral studies. Special thanks to Jean Pauphillet, Xavier Boix, Ignacio Fuentes, and Barry
Stein, whom I admire deeply and thank for all their mentorship and support.
At MIT I have had the privilege of meeting people that aren’t just outstanding academics,
5
but also genuine and loving friends. I am profoundly grateful to Irra, for being the best
teammate and friend I could have asked for. The panic attacks when our code broke, our
late-night girl calls, as well as our joint academic recognitions are memories that I will keep
very close to my heart. I’m also immensely grateful to Adrian, Giancarlo, and Shalom for
making the pandemic times bearable and for being such unconditional friends. Thank you Yu,
Cynthia, Leonard, Amine, Moise, for so much love and support over the past years. Thanks
to Manu, Shuvo, El Ghali, Leann, Patricio, Ted, Ryan, Michael, Vassili, and Holy, for gifting
me so many great memories of my PhD.
Thanks to all my friends from home who helped me get here, including Santiago, Juleana,
Colleen, Tomas, Daniel, Marco and Jonathan. Thanks to OLCOMA, UCR, UNA, Education
USA, and all other institutions of Costa Rica that supported my education. Thanks to all
my teachers and mentors. Special thanks to Jairo Villegas, Mariechen Wust, Julio Salazar,
and Ronald Bustamante, who have always believed in my big dreams.
Gracias a todos mis amigos de Costa Rica que me ayudaron a llegar hasta aquí, incluyendo
a Santiago, Juleana, Colleen, Tomas, Daniel, Marco y Jonathan. Gracias a OLCOMA, la
UCR, UNA, Education USA, y todas las demás instituciones que apoyaron mi educación.
Gracias a todos mis profesores y mentores. Un agradecimiento especial a Jairo Villegas,
Mariechen Wust, Julio Salazar y Ronald Bustamante, quienes siempre han creído y apoyado
mis sueños.
To my boyfriend, Sean, thank you for standing by my side all these years. Thank you for
holding my hand on the most difficult days, for staying up late on my long nights, and for
celebrating with me every good moment. I look forward to many more years next to you.
Finally, I am infinitely thankful for my family; to whom I owe everything. Thanks to my
grandparents Fermina and Gregorio for their loving and encouraging words, to my grandma
Ana for always praying and lighting up candles on my stressful days, and to my grandpa Noe,
who left us too early but to this day motivates me to believe in myself. Thanks to my sister
Jessica, who is also my best teacher and my first role model. Thanks to my beautiful nephews
6
Jeremy and Daniel, who inspire me every day. Above all, thanks to my parents Maricruz
Carballo and Felipe Villalobos, who despite all the financial distress gave me the greatest
example of courage and hard work. Only God knows the extent of sacrifices you both made
for me to be here. It is with the most heartfelt gratitude that I dedicate this thesis to you.
Finalmente, estoy infinitamente agradecida por mi familia; a quienes les debo todo lo que
soy. Gracias a mis abuelos Fermina y Gregorio por sus palabras amorosas y alentadoras, a
mi abuela Ana por siempre rezar y encender velitas en días importantes, y a mi abuelo Noé,
quien nos dejó demasiado pronto pero hasta el día de hoy me sigue motivando a creer en
mí misma. Gracias a mi hermana Jéssica, quien también es mi mejor maestra y mi primer
modelo a seguir. Gracias a mis bellos sobrinos Jeremy y Daniel, quienes me inspiran cada
día. Sobre todo, gracias a mis padres Maricruz Carballo y Felipe Villalobos, quienes a pesar
de todas las dificultades financieras me dieron el mejor ejemplo de coraje y esfuerzo. Solo
Dios sabe el alcance de los sacrificios que ambos hicieron para que yo esté aquí. Con el más
sincero agradecimiento, esta tesis se la dedico a ustedes.
7
8
Contents
1 Introduction 19
1.1 Robust, Stable and Sparse Optimization for Neural Networks . . . . . . . . . 21
1.2 Large Language Models for Tabular Data Representations . . . . . . . . . . 23
1.3 Sparse Reproducing Kernels for Stochastic Multistage Optimization with
Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Robust Deep Learning 27
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Previous Works on Robust Optimization . . . . . . . . . . . . . . . . . . . . 31
2.3 The Robust Optimization problem . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Approximate Robust Upper Bound for small ρ . . . . . . . . . . . . . . . . . 37
2.5 Robust Upper bound for the L1 norm and general ρ. . . . . . . . . . . . . . 41
2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6.1 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6.2 UCI data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6.3 Vision Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3 Holistic Deep Learning 65
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
9
3.2.1 Robust Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2.2 Sparse Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.3 Stable Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.3 The Holistic Deep Learning Approach . . . . . . . . . . . . . . . . . . . . . . 70
3.3.1 The HDL Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.3.2 Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.3 Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3.4 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.1 UCI Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.4.2 Image Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4.3 Computational Times . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.4 SHAP Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.5 Prescriptive Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.4.6 Significance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4 Large Language Models for Patient Flow Predictions 89
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Patient Flow Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2.1 Data and Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . 93
4.2.2 Machine Learning Models . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 TabText . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4 TabText Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.4.1 High Performance with Minimal Pre-Processing . . . . . . . . . . . . 106
4.4.2 Enhanced Performance with Contextual Representation . . . . . . . . 108
10
4.4.3 Larger Benefits for Harder Predictions . . . . . . . . . . . . . . . . . 108
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5 Multistage Stochastic Optimization via Kernels 113
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1.1 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.3 Reproducing Kernel Hilbert space formulation for Multistage Optimization . 119
5.4 Sparse Multistage Optimization with Kernels . . . . . . . . . . . . . . . . . . 121
5.4.1 Functional Stochastic Gradient Descent (FSGD) . . . . . . . . . . . . 122
5.4.2 Proximal Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.6 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.7 Computational Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.7.1 Inventory Control Problem . . . . . . . . . . . . . . . . . . . . . . . . 133
5.7.2 Shipment Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A Chapter 2 Appendix 143
A.1 Proofs of Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.1.1 Proof of Lemma 2.3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
A.1.2 Proof of Lemma 2.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
A.1.3 Proof of Lemma 2.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.1.4 Proof of Lemma 2.5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A.2 Generalized Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
A.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 154
11
B Chapter 3 Appendix 157
B.1 Results Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
C Chapter 4 Appendix 163
C.1 HHC Data Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.2 Accuracy for each Hospital at HHC . . . . . . . . . . . . . . . . . . . . . . . 163
C.3 Empirical Treatment Effect for HHC . . . . . . . . . . . . . . . . . . . . . . 164
D Chapter 5 Appendix 169
D.1 Reproducing Kernel Hilbert Spaces Overview . . . . . . . . . . . . . . . . . . 169
D.2 Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
D.3 Main Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
D.4 Finding Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
12
List of Figures
2.1 Adversarial loss (cross entropy loss evaluated at adversarial images bounded
in L∞ norm) vs the loss function being minimized. . . . . . . . . . . . . . . . 40
2.2 Average rank for each method across the UCI data sets for adversarial attacks
bounded in L2, L∞ and L1 norm. . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3 Number of UCI data sets for which RUB-L1 improves adversarial accuracy
over PGD-L∞ by a specific percentage. . . . . . . . . . . . . . . . . . . . . 53
2.4 Number of UCI data sets for which aRUB-L∞ improves adversarial accuracy
over PGD-L∞ by a specific percentage. . . . . . . . . . . . . . . . . . . . . . 53
3.1 Evaluation of the different methods depending on the natural accuracy of the
nominal DL approach on the UCI data sets. . . . . . . . . . . . . . . . . . . 79
3.2 Average multi-objective rank. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3 SHAP values on various metrics across different UCI data set categories.
Blue/red indicates that the feature has a positive/negative SHAP value on a
specific category of UCI data set. . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4 Optimal policy tree for maximizing natural accuracy. . . . . . . . . . . . . . 85
3.5 Optimal policy tree for maximizing robustness (ρ = 0.1). . . . . . . . . . . . 85
4.1 Empirical Analysis for Treatment Effect on Length of Stay. All the units in
the control and treatment groups are medicine or cardiology units offering
general level of care. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
13
4.2 End-to-end TabText framework. . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3 Overview of TabText methodology. . . . . . . . . . . . . . . . . . . . . . . . 104
4.4 Boxplots for the out-of-sample AUCs across 10 random train-validation splits
using Tabular vs. TabText models. . . . . . . . . . . . . . . . . . . . . . . . 109
4.5 TabText AUC improvement over the standard Tabular approach at varying
data sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.1 Expected loss and computational time for varying number of periods. . . . . 136
5.2 Expected loss and computational time for varying data sizes. . . . . . . . . . 137
5.3 Expected loss and computational time for varying data dimensions. . . . . . 138
5.4 Expected loss and computational time for varying control dimensions. . . . . 139
14
List of Tables
2.1 Percentage of times when the aRUB approach yields an upper bound of the
adversarial loss with respect to PGD attacks. . . . . . . . . . . . . . . . . . . 39
2.2 Average number of batches processed per second across the 46 UCI data sets,
as well as the corresponding standard deviations. . . . . . . . . . . . . . . . 54
2.3 Adversarial Accuracy (%) for Fashion MNIST. . . . . . . . . . . . . . . . . . 56
2.4 Adversarial Accuracy (%) for MNIST. . . . . . . . . . . . . . . . . . . . . . . 57
2.5 Adversarial Accuracy (%) for CIFAR. . . . . . . . . . . . . . . . . . . . . . . 58
2.6 Average number of batches processed per second across the 3 vision data sets,
as well as the corresponding standard deviations. . . . . . . . . . . . . . . . 59
2.7 Fashion MNIST: Lower bound of adversarial accuracy with uncertainty bounded
in L1 norm by ρ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.8 MNIST: Lower bound of adversarial accuracy with uncertainty bounded in L1
norm by ρ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.9 CIFAR: Lower bound of adversarial accuracy with uncertainty bounded in L1
norm by ρ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1 Loss functions used for DL and all methods in the HDL framework. . . . . . 76
3.2 Results for the Fashion-MNIST data set. For each method, the parameters
with the highest average rank in the validation set were chosen. . . . . . . . 81
3.3 Results for the MNIST data set. For each method, the parameters with the
highest average rank in the validation set were chosen. . . . . . . . . . . . . 81
15
3.4 Results for the CIFAR10 data set. For each method, the parameters with the
highest average rank in the validation set were chosen. . . . . . . . . . . . . 81
3.5 Average slowdown factors of computational time with respect to the nominal
DL method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.6 Performance of prescription trees on the testing set. . . . . . . . . . . . . . . 86
3.7 Significance results for HDL improvement over standard DL approach. . . . 87
4.1 Summary of tabular data, which contains different aspects of a patient’s ad-
mission stay from patient’s high-level demographics to precise lab measurements.100
4.2 Data sizes (number of patient days) for training and testing sets across the
nine healthcare classification tasks. . . . . . . . . . . . . . . . . . . . . . . . 105
4.3 Out-of-sample average AUCs achieved by baseline TabText models with mini-
mally processed data and across 10 random train-validation splits. . . . . . . 107
5.1 Average out-of-sample (OOS) loss and total computation time for inventory
problem with T = q = r = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.2 Out-of-sample profit for the shipment planning problem. . . . . . . . . . . . 140
5.3 Total computation time (seconds) for solving one instance of the shipment
planning problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
A.1 Adversarial Accuracy for CIFAR with CNN architecture and PGD-L2 attacks. 155
A.2 Adversarial Accuracy for Fashion MNIST with CNN architecture and PGD-L2
attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
A.3 Adversarial Accuracy for MNIST with CNN architecture and PGD-L2 attacks. 156
B.1 Natural accuracy results for all UCI and vision data sets, where n denotes the
data size, p denotes the data dimension, and k denotes the number of classes. 158
B.2 Adversarial accuracy results for all UCI and vision data sets, where n denotes
the data size, p denotes the data dimension, and k denotes the number of classes.159
16
B.3 Stability results for all UCI and vision data sets, where n denotes the data
size, p denotes the data dimension, and k denotes the number of classes. . . 160
B.4 Sparsity results for all UCI and vision data sets, where n denotes the data
size, p denotes the data dimension, and k denotes the number of classes. . . 161
C.1 Summary of Data Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
C.2 Precision and Recall under Selected Thresholds for Alerts. . . . . . . . . . . 164
C.3 Unit Deployment Progress Information. . . . . . . . . . . . . . . . . . . . . . 165
C.4 Regression Results of our Difference-in-Difference Model for Estimating the
Impact of our Tool after Deployment. . . . . . . . . . . . . . . . . . . . . . . 166
17
18
Chapter 1
Introduction
Optimization and machine learning stand as two highly successful disciplines in decision-
making, with important applications across various sectors such as healthcare, energy, finance,
and more. Historically, the two were studied independently – optimization as a fundamental
pillar of Operations Research and machine learning burgeoning in Computer Science. The
escalating availability of data in recent years has naturally promoted the joint study of machine
learning with other scientific domains, with optimization being no exception. Optimization
has contributed significantly to the improvement of traditional machine learning models by
enhancing their training methodologies. Conversely, machine learning has helped improving
optimization algorithms, as access to accurate predictions of future outcomes allows to make
better decisions.
Integrating optimization theory with modern machine learning methods, such as neural
networks and kernel functions, presents unique challenges compared to more conventional
models like logistic regression, trees, and support vector machines. Firstly, neural networks
do not satisfy the fundamental convexity assumptions that underlie optimization theory.
Secondly, these models are mostly applied in tasks involving a large number of parameters
and high dimensional observations, requiring the algorithms employed to be highly efficient
and scalable. Maintaining those properties leaves little or no room for discrete variables and
19
general constraints, which are typical in optimization.
This thesis presents novel algorithms that alleviate the challenges described above and
successfully integrate optimization with modern machine learning models. We focus on two
key problems:
• Machine Learning Classification: Deep learning is one of the most important scientific
developments of our time; however, deep learning models face notable challenges when
applied in practical domains. They struggle with small datasets, tabular data formats,
and they lack robustness and stability. Additionally, their computational demands
make them intractable in settings constrained by memory or time. We propose the use
of tools from robust optimization, non-convex optimization, and large language models
to address these limitations.
• Multistage Stochastic Optimization: This active research area has applications to various
problems like patient flow optimization and inventory control. Recent efforts have
utilized predictive analytics to inform decision-making using available side information
and historical data. However, dynamic methods like these are hindered by the curse of
dimensionality and struggle to scale in settings with large datasets or high-dimensional
decisions. We propose leveraging machine learning with kernel methods and functional
optimization to develop a scalable data-driven algorithm for this problem.
This thesis is divided into four chapters, encompassing rigorous theory, computational
tools, and diverse applications. In Chapter 1, we extend state-of-the-art tools from robust
optimization to non-convex and non-concave settings, allowing us to generate neural networks
that are robust against input perturbations. In Chapter 2, we develop a holistic deep learning
framework that jointly optimizes for neural network robustness, stability and sparsity by
appropriately modifying the objective function. In Chapter 3 we introduce TabText, a
flexible methodology that leverages the power of Large Language Models for patient flow
predictions from tabular data. Lastly, in Chapter 4 we present a data-driven approach for
20
solving multistage stochastic optimization problems via sparsified kernel methods.
The following sections offer a summary of the work as well as the contributions within
each chapter of the thesis.
1.1 Robust, Stable and Sparse Optimization for Neural
Networks
Deep learning has emerged as the predominant technology of our times, with important
applications in areas like computer vision, natural language processing, and structural biology.
However, in recent years it has been exposed that standard neural networks lack robustness –
they are susceptible to being misled by both natural and artificial noise in the input data. This
problem becomes particularly relevant when considering applications related to self-driving
cars or medicine, in which these perturbed inputs represent an important security threat.
In Chapter 1, we develop a novel algorithm to train robust neural networks by extending
state-of-the-art tools from robust optimization to non-convex and non-concave settings. In
particular, we find a closed-form expression that provides an upper bound on the worst-case
training loss with respect to bounded input perturbations. By minimizing this upper bound,
our method, Robust Upper Bounds (RUB), not only achieves state-of-the-art performance
empirically, but also provides performance guarantees against the perturbations considered.
We also propose a simple method (Approximated Robust Upper Bound or aRUB) which
uses the first order approximation of the network as well as basic tools from linear robust
optimization to obtain an empirical upper bound of the adversarial loss that can be easily
implemented. Across a variety of tabular and vision data sets we demonstrate the effectiveness
of our approach —RUB is substantially more robust than state-of-the-art methods for larger
perturbations, while aRUB matches the performance of state-of-the-art methods for small
perturbations (Bertsimas et al. 2021a).
Robustness is; however, not the only challenge that neural networks face. These models
21
frequently exhibit instability during the training process, where different train-validation
splits can result in models with significantly varied performance. Additionally, these networks
contain millions of non-zero parameters that need to be stored and accessed for evaluation.
Previous works on robustness, stability and sparsity of neural networks have only addressed
these challenges in isolation, and often at the expense of a substantial increase in computational
time. In Chapter 2, we focus on developing practical algorithms to optimize for these properties
jointly and analyze their trade-offs.
In order to enhance stability with respect to train-validation splits, we modify the loss
function to optimize for the empirical value-at-risk of the error (instead of the average error)
using a mixed-integer programming formulation. In addition, we incorporate a continuous
approximation of the L0 pseudo-norm as a penalty in the objective to enforce sparsity. We
combine these algorithms with the robustness methods from Chapter 1 to develop a holistic
deep learning framework (HDL) that jointly optimizes for the metrics of interest. HDL
demonstrates that it is often possible to simultaneously improve robustness, stability, and
sparsity without sacrificing performance on accuracy. In fact, we show that adding robustness
and stability can significantly improve the accuracy of the network, especially for tabular
data sets. Since our approach (and our code) supports a variety of loss functions, we also
provide guidance to practitioners to help them align the training objective with their specific
use case (Bertsimas et al. 2024).
Contributions
• We develop two new methods for training deep learning models that are robust against
input perturbations. The first method (Approximated Robust Upper Bound or aRUB),
minimizes an empirical upper bound of the adversarial loss for Lp norm bounded
uncertainty sets, for general p. It is simple to implement and performs similar to
state-of-the-art defenses on small uncertainty sets. The second method (Robust Upper
Bound or RUB), minimizes a provable upper bound of the adversarial loss specifically
22
for L1 norm bounded uncertainty sets. This method shows the best performance for
larger uncertainty sets, and more importantly, it provides security guarantees against
L1 norm bounded adversarial attacks.
• We design HDL, a novel framework that jointly optimizes for neural network robust-
ness, stability, and sparsity metrics by appropriately modifying the objective function.
Through extensive ablation experiments across tabular and image sets, we analyze the
individual performance of each metric as well as the interactions and trade-offs between
them.
• We propose a prescriptive approach to provide recommendations on selecting the
appropriate loss function for a classification task depending on the practitioner’s metric
of interest.
1.2 Large Language Models for Tabular Data Represen-
tations
Accurate predictions of patient outcomes can facilitate resource allocation and enhance
personalized care. In collaboration with a large hospital network, we developed machine
learning models that predict short-term and long-term outcomes for all inpatients across their
seven hospitals using electronic medical records. We implemented an automated pipeline
displaying our daily predictions with user-friendly software. Over 200 medical staff currently
use our tool, resulting in a significant reduction in length of stay and projected annual benefits
of millions of dollars for the healthcare system (Na et al. 2023).
Given this successful implementation, the question arises: how could we extend these tools
for the benefit of hospitals with limited resources, small patient populations, and/or non-
standardized healthcare records? Even though electronic medical records are widely available
for most digitized healthcare systems, tabular data in healthcare is generally disorganized, not
23
standardized across institutions and scarce in small healthcare systems. We then identified
two significant limitations in the existing approaches to handling tabular data: they require
labor-intensive data processing, and they ignore contextual information such as column
headers and meta content descriptions which could be used for data augmentation.
To address these limitations, in Chapter 3 we present TabText, a systematic framework that
leverages Large Language Models to extract contextual information from tabular structures,
resulting in more complete and flexible data representations. These new representations
can then be used to train any standard machine learning model for downstream prediction
tasks. Although deep learning models often perform poorly on classification tasks with
structured data, off-the-shelf and pre-trained neural networks can be remarkably helpful
for enhancing pre-processing pipelines. We demonstrated the flexibility of our approach
compared to traditional labor-consuming processing techniques, and we showed that TabText
can significantly improve performance across all patient outcome classification tasks considered,
especially those with small data sizes and high variability (Carballo et al. 2022).
Contributions
• We develop and implement machine learning models that predict several inpatient
outcomes at a healthcare system. We show that after utilizing our user-friendly software
the hospitals observe significant reduction in length of stay.
• We develop TabText, a systematic framework that leverages language to extract contex-
tual information from tabular structures. Our experiments demonstrate that augmenting
electronic medical records with our TabText representations can significantly improve
the AUC score, especially when trained with small-size datasets. We also show that Tab-
Text enables the generation of high-performing predictive models for patient outcomes
with minimal data processing.
24
1.3 Sparse Reproducing Kernels for Stochastic Multistage
Optimization with Covariates
Multistage stochastic optimization arises in numerous applications and remains an important
research area in the optimization community. Recent work has focused on using predictive
analytics to leverage available side information and historical data to make better decisions.
However, these dynamic methods are affected by the curse of dimensionality; they require
scenario tree enumeration and can require many hours for solving problems with only a
few stages. More recently, kernel methods have been used for data-driven, single-period
optimization problems with auxiliary information. This approach overcomes the curse of
dimensionality; however, the number of parameters per decision grows linearly with the
number of observations, resulting in function representations that are as complex as the size
of the data and that become potentially intractable especially in multistage settings.
In Chapter 4 we develop a non-parametric, data-driven, tractable approach for solving
multistage stochastic optimization problems in which decisions do not affect the uncertainty.
The proposed framework represents the decision variables as elements of a reproducing
kernel Hilbert space and performs functional stochastic gradient descent to minimize the
empirical regularized loss. By incorporating sparsification techniques based on function
subspace projections we are able to overcome the computational complexity that standard
kernel methods introduce as the data size increases. We prove that the proposed approach is
asymptotically optimal for multistage stochastic optimization with side information. Across
various computational experiments on stochastic inventory management problems, our method
performs well in multidimensional settings and remains tractable when the data size is large.
Lastly, by computing lower bounds for the optimal loss of the inventory control problem, we
show that the proposed method produces decision rules with near-optimal average performance
(Bertsimas and Carballo 2023).
25
Contributions
• We propose a novel data-driven approach for multistage stochastic optimization problems
with side information based on reproducing kernel Hilbert spaces and sparse projections.
To the best of our knowledge, this is the first tractable application of reproducing kernel
Hilbert spaces to multistage optimization problems with large data sizes.
• We prove that under standard convexity and smoothness conditions on the loss function,
the expected loss achieved with our algorithm achieves asymptotic optimality.
• We demonstrate across several instances of inventory management problems that the
proposed method finds near-optimal solutions using only a few parameters and with
very low computational times even for large instances of the problem.
26
Chapter 2
Robust Deep Learning
2.1 Introduction
Robustness of neural networks for classification problems has received increasing attention in
the past few years, since it was exposed that these models could be easily fooled by introducing
some small perturbation in the input data. These perturbed inputs, which are commonly
referred to as adversarial examples, are visually indistinguishable from the natural input,
and neural networks simply trained to maximize accuracy often assign them to an incorrect
class (Szegedy et al. 2013). This problem becomes particularly relevant when considering
applications related to self-driving cars or medicine, in which adversarial examples represent
an important security threat (Kurakin et al. 2016).
The machine learning community has recently developed multiple heuristics to make
neural networks robust. The most popular ones are perhaps those based on training with
adversarial examples, a method first proposed by Goodfellow et al. (2015) and which consists
in training the neural network using adversarial inputs instead of or in addition to the
standard data. The defense by Madry et al. (2019), which finds the adversarial examples
with bounded norm using iterative projected gradient descent (PGD) with random starting
points, has proved to be one of the most effective methods (Tjeng et al. 2019), although
27
it comes with a high computational cost. Another more efficient defense was proposed by
Wong et al. (2020), which uses instead fast gradient sign methods (FGSM) to find the attacks.
Other heuristic defenses rely on preprocessing or projecting the input space (Lamb et al.
2018, Kabilan et al. 2018, Ilyas et al. 2017), on randomizing the neurons (Prakash et al.
2018, Xie et al. 2017) or on adding a regularization term to the objective function (Ross
and Doshi-Velez 2017, Hein and Andriushchenko 2017, Yan et al. 2018). There is a plethora
of heuristics for adversarial robustness by now. Yet, these defenses are only effective to
adversarial attacks of small magnitude and are vulnerable to attacks of larger magnitude or
to new attacks (Athalye et al. 2018).
Given the lack of an exact and tractable reformulation of the adversarial loss with norm-
bounded perturbations, a recent strand of research has been to leverage upper bounds to
improve adversarial robustness. These upper bounds provide security guarantees against
adversarial attacks, even new ones, by finding a mathematical proof that a network is not
susceptible to any attack, e.g. (Dathathri et al. 2020, Raghunathan et al. 2018b, Katz et al.
2017, Tjeng et al. 2019, Bunel et al. 2017, Anderson et al. 2020, Singh et al. 2018, Zhang
et al. 2018, Weng et al. 2018, Gehr et al. 2018, Dvijotham et al. 2018b, Lecuyer et al. 2019,
Cohen et al. 2019). Replacing the standard loss with these upper bounds during training is a
common technique for obtaining adversarial defenses. Wong and Kolter (2018) for instance,
find an upper bound for the adversarial loss by applying linear relaxations in the network
and computing a convex polytope that contains all possible values for the last layer given
adversarial examples with bounded norm. An upper bound on the adversarial loss is also
computed in Raghunathan et al. (2018a) by solving instead a semidefinite program. Other
more scalable and effective methods based on minimizing an upper bound of the adversarial
loss have been introduced (Gowal et al. 2019, Balunovic and Vechev 2019, Mirman et al.
2018, Dvijotham et al. 2018a, Wong et al. 2018, Zhang et al. 2019).
While these methods can provide security guarantees against adversarial examples, most
of them rely on convex relaxations to recursively compute upper and lower bounds for each
28
layer, which introduces gaps that propagate and can affect the final bound for the last layer.
For instance, the approach proposed in Gowal et al. (2019) computes bounds for each layer by
assuming that the worst-case bounds for all previous layers can be achieved simultaneously.
This often yields a loose upper bound of the adversarial loss whose minimization can be
sensitive to hyperparameters (Zhang et al. 2019). Another example is the aforementioned
defense from Wong and Kolter (2018), where bounds at each layer are computed by solving a
linear program that uses the bounds from previous layers for the linear ReLU relaxations.
Unlike these approaches, the method proposed in Raghunathan et al. (2018a) does not require
computation of intermediate bounds, however, their proposed upper bound only works for
neural networks with two layers.
Upper bounds for the adversarial loss have also been explored in the context of Dis-
tributionally Robust Optimization, in which the data distribution is perturbed within a
Wasserstein ball (Sinha et al. 2017, Shafieezadeh-Abadeh et al. 2019). These works find upper
bounds for the worst-case expected loss and provide generalization guarantees under certain
assumptions. However, it is not evident how to apply these methods for norm-bounded input
perturbations given the different nature of the uncertainty.
A promising yet under-explored approach is the application of state-of-the-art Robust
Optimization (RO) tools (Bertsimas and den Hertog 2022). RO has proven to be effective
in handling uncertainty in parameters that may result from rounding or implementation
errors. Recently, it has also been applied to provide robustness against input perturbations in
some machine learning models, such as Support Vector Machines and Optimal Classification
Trees (Bertsimas et al. 2019), and it could be similarly leveraged for deep learning. While
previous works on robustness of neural networks generally formulate the problem in the
context of RO, they do not utilize the more advanced tools available in this field. Instead,
they mostly depend on linear or convex relaxations and heuristic methods to simplify the
original non-convex problem. In this paper, we use state-of-the-art RO tools to derive a
new closed-form solution of an upper bound of the adversarial loss. Our approach is based
29
on a holistic expansion of the network; it does not rely on convex relaxations or separate
computation of bounds for each layer of the network, and it can still be effectively trained
with backpropagation.
We develop two new methods for training deep learning models that are robust against
input perturbations. The first method (Approximated Robust Upper Bound or aRUB),
minimizes an empirical upper bound of the adversarial loss for Lp norm bounded uncertainty
sets, for general p. It is simple to implement and performs similar to state-of-the-art defenses
on small uncertainty sets. The second method (Robust Upper Bound or RUB), minimizes a
provable upper bound of the adversarial loss specifically for L1 norm bounded uncertainty sets.
This method shows the best performance for larger uncertainty sets, and more importantly, it
provides security guarantees against L1 norm bounded adversarial attacks. More concretely,
we introduce the following robustness methods:
• Approximated Robust Upper Bound or aRUB: We develop a simple method to ap-
proximate an upper bound of the adversarial loss by adding a regularization term for
each target class separately. As an alternative to standard adversarial training (which
relies on linear approximations to find good adversarial attacks), we use the first order
approximation of the network to estimate the worst case scenario for each individual
class. We then apply standard results from Linear Robust Optimization to obtain a
new objective that behaves like an upper bound of the adversarial loss and which can
be tractably minimized for robust training. This method can be easily implemented
and performs very well when the uncertainty set radius ρ is small.
• Robust Upper Bound or RUB: We extended state-of-the-art tools from RO to functions
that like neural networks are neither convex nor concave. By splitting each layer of
the network as the sum of a convex function and a concave function, we are able to
obtain an upper bound of the adversarial loss for the case in which the uncertainty
set is the L1 sphere. Since the dual function of the L1 norm is the L∞ norm, we
convert the maximum over the uncertainty set into a maximum over a finite set. In
30
the end, instead of minimizing the worst case loss over an infinite uncertainty set, the
new objective minimizes the worst case loss over a discrete set whose cardinality is
twice the dimension of the input data. While this represents a significant increase in
memory for high dimensional inputs, we show that this approach remains tractable for
multiple applications. The main advantage of this method is that it provides security
guarantees against adversarial examples bounded in the L1 norm. Additionally, we
also show experimentally that this method generally achieves the highest adversarial
accuracies for larger uncertainty sets.
Also, we show that these methods consistently achieve higher standard accuracy (i.e.,
non adversarial accuracy), than the nominal neural networks trained without robustness.
While this result is not true for a general choice of uncertainty set (see for example Ilyas et al.
(2019)), we observe that when the uncertainty set has the appropriate size it can significantly
improve the classification performance of the network, which is consistent with the results
obtained for other classification models like Support Vector Machines, Logistic Regression
and Classification Trees (Bertsimas et al. 2019).
The paper is organized as follows: Section 2.2 revisits previous works on RO, Section
2.3 defines the robust problem, Section 2.4 presents the first method (Approximate Robust
Upper Bound), and Section 2.5 contains the second method (Robust Upper Bound). Lastly,
Section 5.7 contains the results for the computational experiments.
2.2 Previous Works on Robust Optimization
Over the last two decades, RO has become a successful approach to solve optimization
problems under uncertainty. For an overview of the primary research in this field we refer
the reader to Bertsimas et al. (2011). Areas like mathematical programming and engineering
have long applied these tools to develop models that are robust against uncertainty in the
parameters, which may arise from rounding or implementation errors. For many applications,
31
the robust problem can be reformulated as a tractable optimization problem, which is referred
to as the robust counterpart. For instance, for several types of uncertainty sets, the robust
counterpart of a linear programming problem can be written as a linear or conic programming
problem (Ben-Tal et al. 2009), which can be solved with many of the current optimization
software. While there is not a systematic way to find robust counterparts for a general
nonlinear uncertain problem, multiple techniques have been developed to obtain tractable
formulations in some specific nonlinear cases.
As shown in Ben-Tal et al. (2009), the exact robust counterpart is known for Conic
Quadratic problems and Semidefinite problems in which the uncertainty is finite, an interval
or an unstructured norm-bounded set. More generally, it is shown in Ben-Tal et al. (2015)
that for problems in which the objective function or the constraints are concave in the
uncertainty parameters, Fenchel duality can be used to exactly derive the corresponding
robust counterpart. While the result does not necessarily have a closed-form, the authors show
that it yields a tractable formulation for the most common uncertainty sets (e.g. polyhedral
and ellipsoidal uncertainty sets).
The problem becomes significantly more complex when the functions in the objective
or in the constraints are instead convex in the uncertainty (Chassein and Goerigk 2019).
Since obtaining provable robust counterparts in these cases is generally infeasible, safe
approximations are considered instead (Bertsimas et al. 2023). For instance, Zhen et al.
(2017) develop safe approximations for the specific cases of second order cone and semidefinite
programming constraints with polyhedral uncertainty. These techniques are generalized
in Roos et al. (2020), where the authors convert the robust counterpart to an adjustable
RO problem that produces a safe approximation for any problem that is convex in the
optimization variables as well as in the the uncertain parameters.
Even though the approaches mentioned above consider uncertainty in the parameters of
the model as opposed to uncertainty in the input data, the same techniques can be utilized
for obtaining robust counterparts in the latter case. In fact, the robust optimization RO
32
methodologies have recently been applied to develop machine learning models that are robust
against perturbations in the input data. In Bertsimas et al. (2019), for example, the authors
consider uncertainty in the data features as well as in the data labels to obtain tractable
robust counterparts for some of the major classification methods: support vector machines,
logistic regression, and decision trees. However, due to the high complexity of neural networks
as well as the large dimensions of the problems in which they are often utilized, robust
counterparts or safe approximations for this type of models have not yet been developed.
There are two major challenges with applying RO tools for training robust neural networks:
(i) Neural networks are neither convex nor concave functions: As mentioned earlier, robust
counterparts are difficult to find for a general problem. Although plenty of work has
been done to find tractable reformulations as well as safe approximations, all of them
rely on the underlying function being a convex or a concave function of the uncertainty
parameters. Unfortunately, neural networks don’t satisfy either condition, which makes
it really difficult to apply any of the approaches discussed above.
(ii) The robust counterpart needs to preserve the scalability of the model: Neural networks
are most successful in problems involving vision data sets, which often imply large
input dimensions and enormous amount of data. For the most part, they can still
be successfully trained thanks to the fact that back propagation algorithms can be
applied to solve the corresponding unconstrained optimization problem. However, the
RO techniques for both convex and concave cases often require the addition of new
constraints and variables for each data sample, increasing significantly the number of
parameters of the network and making it very difficult to use standard machine learning
software for training.
A straightforward way to overcome both of these difficulties would be to replace the
loss function of the network with its first order approximation. However, this loss function
is usually highly nonlinear and therefore the linear approximation is very inaccurate. Our
33
method aRUB explores a slight modification of this approach that significantly improves
adversarial accuracy by considering only the linear approximation of the network’s output
layer.
Alternatively, a more rigorous approach to overcome problem (i) would be to piece-wise
analyze the convexity of the network and apply the RO techniques in each piece separately,
but this approach would introduce additional variables that are in conflict with requirement
(ii). For the proposed RUB method we then develop a general framework to split the network
by convexity type, and we show that in the specific case in which the uncertainty set is the
L1 norm bounded sphere, we can solve for the extra variables and obtain an unconstrained
problem that can be tractably solved using standard gradient descent techniques.
2.3 The Robust Optimization problem
We consider a classification problem over data points x ∈ RM labeled with one of K different
classes in [K], where we use the notation [n] to denote the set {1, . . . , n}. Given weight
matrices W ℓ ∈ Rrℓ−1×rℓ and bias vectors bℓ ∈ Rrℓ for ℓ ∈ [L], such that r0 =M, rL = K, the
corresponding feed forward neural network with L layers and ReLU activation function is
defined by the equations
z1(W ,x) = W 1x+ b1, (2.1)
zℓ(W ,x) = W ℓ[zℓ−1(W ,x)]+ + bℓ, ∀ 2 ≤ ℓ ≤ L, (2.2)
where W denotes the set of parameters (W ℓ, bℓ) for all ℓ ∈ [L] and [x]+ is the result of
applying the ReLU function ([x]+ = max{x, 0}) to each coordinate of x. For fixed parameters
W , the network assigns a sample x to the class ŷ = argmaxk zLk (W ,x). And given a data
set {(x , y )}Nn n n=1, where yn∈ [K] is the target class of xn, the optimal parameters W are
34
usually found by minimizing the empirical loss
∑N1
min L(yn, z
L(W ,xn)), (2.3)
W N
n=1
with respect to a specific loss function L : [K]× RK → R≥0.
In the RO framework, however, we want to find the parameters W by minimizing the
worst case loss achieved over an uncertainty set of the input. More specifically, instead of
optimizing the nominal loss in Eq. (2.3), we want to optimize the adversarial loss:
∑N1 ( )
min max L yn, z
L(W , xn + δ) , (2.4)
W N δ∈U
n=1
for some uncertainty set U ⊂ RM .
Unfortunately, a closed-form expression for the inner maximization problem above is
unknown and solving the min-max problem is notoriously difficult. RO provides multiple
tools for solving such problems when the loss function is either convex or concave in the input
variables. For example, in the case of concave loss functions, a common approach would be
to take the dual of the maximization problem so that the problem can be formulated as a
single minimization problem (Bertsimas and den Hertog 2022). If the loss function is instead
convex, Fenchel’s duality as well as conjugate functions can be used to find upper bounds
and lower bounds of the maximization problem. However, there is no general framework
developed for loss functions that do not fall into those categories, like in the case of neural
networks.
In this paper, we will focus on the specific case in which the uncertainty set is the ball of
radius ρ in the Lp space; i.e. U = {δ : ∥δ∥p ≤ ρ}, and we make the following assumptions
about the loss function:
35
Assumption 2.3.1. The loss function L is translationally invariant; i.e. for all y ∈ [K], z ∈
RK, it satisfies
L(y, z) = L(y, z − ce) ∀ c ∈ R, (2.5)
where e ∈ RK denotes the vector with value 1 in all the coordinates.
Assumption 2.3.2. The loss function L is monotonic; i.e. for all y ∈ [K], z, z′ ∈ RK it
satisfies
( ) ( )
∀k ∈ [K], z − z ≤ z′ ′k y k − zy =⇒ L(y, z) ≤ L(y, z
′) . (2.6)
Although some loss functions utilized in deep learning like the squared error or the absolute
error do not satisfy these assumptions, the most popular losses for classification problems
(softmax with cross-entropy loss, multiclass hinge loss, and hardmax with zero-one loss) do
satisfy both of them. Intuitively, Assumption 2.3.1 implies that the loss function L takes
into account the differences between the coordinates of z but not the exact value at each
coordinate; while Assumption 2.3.2 means that a larger difference between the coordinates of
the incorrect classes and the correct class results in a larger loss. These assumptions allow us
to obtain the following result:
Lemma 2.3.3. If the loss function L satisfies Assumptions 2.3.1 and 2.3.2, then for all
x ∈ RM , y ∈ [K] the adversarial loss can be upper bounded as
min maxL(y, zL(W ,x+ δ)) (2.7)
W δ∈U
( ( ))
≤min L y, max zL1 (W ,x+ δ)−z
L
y (W ,x+ δ), . . . ,maxz
L L
K(W ,x+ δ)−zy (W ,x+ δ) .
W δ∈U δ∈U
(2.8)
Proof. See Appendix A.1.1.
36
The robustness methods we propose in the next sections generate upper bounds for the
adversarial differences maxδ∈U zLk (W ,x+ δ)− zLy (W ,x+ δ) for all k ∈ [K], and then apply
the previous lemma to upper bound the adversarial loss.
2.4 Approximate Robust Upper Bound for small ρ
Perhaps the most intuitive approach to tackle problem (2.4) is to consider the first order
approximation of the loss function
( ) ( ) ( )
L y, zL(W ,x+ δ) ≈ L y,zL(W , x) + δ⊤∇xL y, z
L(W , x) ,
since the right hand side is a linear function of δ and the maximization problem can be
more easily solved for linear functions of the uncertainty. For example, it is not hard to
see that the first order approximation reaches its maximum value in U = {δ : ∥δ∥∞ ≤ ρ}
( ( ))
exactly at δ⋆ = ρ sign ∇xL y, zL(W x) . This approach is referred to as fast gradient
sign method and it was first explored in Goodfellow et al. (2015), where the networks are
trained with adversarial examples generated as x+ δ⋆. A similar approach was proposed in
Huang et al. (2016), where the authors considered the cross entropy loss and use the linear
approximation of the softmax layer instead of the approximation of the entire loss function.
In these methods, linear approximations are used to find near optimal perturbations that can
produce strong adversarial examples for training, but not to approximate the adversarial loss.
An alternative to these methods would then be to train the network with the natural data
and replace the loss function with its linear approximation, transforming the problem into
N
1∑ ( ) ( )
min max L y , zL(W ,x ) + δ⊤n n ∇xL yn, z
L(W ,xn)
W N δ∈U
n=1
1∑
N
( ) ( )
=min L yn, z
L(W ,xn) + ρ∥∇ L y , z
L
x n (W ,xn) ∥q, (2.9)
W N
n=1
37
where ∥ ∥q is the dual norm of ∥ ∥p, satisfying 1 + 1 = 1. However, since the loss function isp q
highly nonlinear, this approach (which we call Baseline-Lp) generally performs worse than
training with adversarial examples (see the Baseline method in Section 5.7).
A more promising approach can be derived by noting that each component of the network
zL(W ,x) is in fact a continuous piecewise linear function (see the network definitions in
Section 2.3), which suggests that the first order approximation of zL is more precise than
that of L(y,zL) for small neighborhoods. In fact, we expect the outputs zL(W ,x) and
zL(W ,x+ δ) to be in the same linear piece when x+ δ is close to x. In other words, the
linear approximation
zL(W ,x+ δ) ≈ zL(W ,x) +∇ zL(W ,x)⊤x δ (2.10)
is exact for small enough δ. We can then approximately solve the adversarial problem for
each class k as
max zLk (W ,x+ δ)−z
L
y (W ,x+ δ) ≈ max (e − e
⊤ L ⊤ L ⊤
k y) z (W ,x)+(ek − ey) ∇xz (W ,x) δ
δ∈U δ∈U
= (e − e )⊤k y z
L(W ,x)+ρ∥(ek − ey)
⊤∇ zLx (W ,x)∥q,
(2.11)
where k, y ∈ [K] and ek (respectively ey) is the one-hot vector with a 1 in the kth (respectively
in the yth) coordinate and 0, everywhere else. Applying the result from Lemma 2.3.3 and
defining y∆e :k = ek − ey we obtain the approximate robust upper bound:
min max L(y, zL(W ,x+ δ))
W δ∈U
(
(
⪅ y⊤ y⊤min L y, ∆e zL1 (W ,x) + ρ∥∆e ∇ z
L
1 x (W ,x)∥q, (2.12)
W
)
y ⊤ y ⊤
)
. . . ,∆e LK z (W ,x) + ρ∥∆eK ∇
L
xz (W ,x)∥q . (2.13)
38
Table 2.1: Percentage of times when the aRUB approach yields an upper bound of the
adversarial loss with respect to PGD attacks, i.e., percentage of times when Eq. (2.14) is
larger than Eq. (2.4) evaluated using PGD-Lp attacks. For each row, aRUB-Lp and the
PGD-Lp attacks use the Lp norm indicated on the first column. The loss function utilized
is the cross entropy with softmax. Percentages are computed across all networks trained
(ie., for all tested hyperparameters) in the 46 UCI data sets, every 500 training steps (see
subsections 2.6.1 and 2.6.2 for more details about the networks and data sets). In this way,
the approximate bound is evaluated in a large number of different conditions, including data
sets, training steps and hyperparameters.
ρ = 0.0 0.0008 0.001 0.0015 0.002 0.003 0.01 0.1 0.3 0.5 1.0
L∞ 94% 99% 99% 99% 99% 99% 99% 95% 86% 81% 79%
L1 93.3% 99% 99% 99% 99% 99% 99% 99% 99% 99% 98%
Therefore, we propose to train the network by minimizing Eq. (2.13) instead of the standard
average loss, and we refer to this defense as aRUB-Lp. For the particular case of the cross
entropy loss with softmax activation function in the output layer, the exact optimization
problem to be solved would be the following:
( )
1 ∑
N ∑ yn ⊤ L yn ⊤ L
min log e(∆e ) z (W,x)+ρ∥(∆e ) ∇xz (W,x)∥k k q , (2.14)
W N
n=1 k
This expression may not always be an upper bound of the adversarial loss (Eq. (2.4));
however, we observe across a variety of experiments that usually it is indeed an upper bound
(see Table 2.1). This suggests that the upper bound provided by Lemma 1 compensates for
the errors introduced by the first order approximation of zL. Additionally, in Figure 2.1 we
empirically show that Eq. (2.13) is much closer than Eq. (2.9) to the adversarial loss in Eq.
(2.4) for small values of ρ.
39
2.5 Adversarial Loss 2.5 Adversarial Loss
aRUB-L  Loss ( = 0.01) Baseline-L  Loss ( = 0.01)
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Iteration Step Iteration Step
(a) aRUB-L∞, ρ = 0.01 (b) Baseline-L∞, ρ = 0.01
14 Adversarial Loss 14 Adversarial Loss
aRUB-L  Loss ( = 0.05) Baseline-L  Loss ( = 0.05)
12 12
10 10
8 8
6 6
4 4
2 2
0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Iteration Step Iteration Step
(c) aRUB-L∞, ρ = 0.05 (d) Baseline-L∞, ρ = 0.05
Adversarial Loss Adversarial Loss
20 aRUB-L  Loss ( = 0.1) 20 Baseline-L  Loss ( = 0.1)
15 15
10 10
5 5
0 0
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Iteration Step Iteration Step
(e) aRUB-L∞, ρ = 0.1 (f) Baseline-L∞, ρ = 0.1
Figure 2.1: Adversarial loss (cross entropy loss evaluated at adversarial images bounded in
L∞ norm) vs the loss function being minimized across training iterations (aRUB-L∞ on the
left and Baseline-L∞ on the right); the value of ρ increases for lower rows. Experiments are in
the MNIST data set with the infinity norm (p = ∞). We use a feed forward neural network
with three hidden layers with the softmax-cross-entropy loss (see subsection 2.6.1 for details).
We used a learning rate of 10−3 and a batch size of 32, which we found to work best across
the experiments in this figure.
40
Loss Value Loss Value Loss Value
Loss Value Loss Value Loss Value
2.5 Robust Upper bound for the L1 norm and general ρ.
In this section we derive a provable upper bound for the robust counterpart of the inner
maximization problem in Eq. (2.4) for the specific case in which the uncertainty set is the
ball of radius ρ in the L1 space; i.e. U = {δ : ∥δ∥1 ≤ ρ}.
In the RO framework, finding a provable upper bound for a problem that is convex
(or concave) in the uncertainty parameters relies on using convex (or concave) conjugate
functions to make the problem linear in the uncertainty (Bertsimas and den Hertog 2022).
The following lemmas are a generalization of this approach for the case in which the objective
involves the composition of two functions, with the goal of making the problem linear not in
the uncertainty but in the inner most function. Lemma 2.5.1 makes this generalization when
the outer function is convex while Lemma 2.5.2 focuses on the concave case. The proofs
of these lemmas rely on the definition of conjugate functions as well as the Fenchel duality
theorem (Rockafellar 1970), and can be found in Appendix A.1. Together, these lemmas are
the core of the methodology presented in this section.
Lemma 2.5.1. If f : A→ B is a convex and closed function then for any function z : U → A,
and any function g : A→ B we have
sup f(z(δ)) + g(z(δ)) = sup sup z(δ)Tu− f ⋆(u) + g(z(δ)),
δ∈U u∈dom(f⋆) δ∈U
where the convex conjugate function f ⋆ is defined by f ⋆(z) = sup Tx∈dom(f) z x− f(x).
Proof. See Appendix A.1.2.
Lemma 2.5.2. Let g : A → B be a concave and closed function. If a function z : U → A
satisfies g(z(δ)) <∞ for all δ ∈ U , then
sup g(z(δ)) = inf sup z(δ)Tv − g⋆(v),
δ∈U v∈dom(g⋆) δ∈U
41
where the concave conjugate function is defined by g⋆(z) = inf
T
x∈dom(g) z x− g(x).
Proof. See Appendix A.1.3.
From the lemmas above we can observe that to apply them we will need to compute
convex and concave conjugate functions. The next lemma facilitates these computations for
neural networks with ReLU activation functions.
Lemma 2.5.3. If u,p, q ≥ 0, then the functions f(x) = p⊤[x]+ and g(x) = xTu− q⊤[x]+
satisfy
 
 
 0 if 0 ≤ z ≤ p, 0 if u− q ≤ z ≤ u,
a) f ⋆(z) = and b) g⋆(z) =
  ∞ otherwise, −∞ otherwise.
Proof. See Appendix A.1.4.
As observed in Lemma 2.3.3, we can obtain an upper bound of the min-max problem in
Eq. (2.4) by finding instead upper bounds for max L Lδ∈U zk (W ,x + δ) − zy (W ,x + δ) for
each class k. We will find these upper bounds in the following three steps:
• Step #1 - Linearize the uncertainty: We first make the maximization problem over the
uncertainty set U linear in the first layer of the network (and therefore also linear in the
uncertainty δ). Starting from the last layer we recursively split the objective function
as the sum of a convex function and a concave function, and we then apply Lemmas
2.5.1 and 2.5.2 to make the maximization problem over U linear in the previous layer.
• Step #2 - Optimize over the uncertainty set: Then, we solve the maximization problem
over U . This problem can be exactly solved because the first layer of the network is
linear in the uncertainty and the dual function of the L1 norm is the L∞ norm (a
maximum over a finite set).
• Step #3 - Backtrack: Finally, we backtrack to solve for the variables u,v that Lemmas
2.5.1 and 2.5.2 introduce.
42
For simplicity, we develop and prove the upper bound for the robust counterpart assuming
that the neural network has only two layers; however, the results can be extended to the
general case as shown in Appendix A.2. In addition, all the theorems can be generalized for
residual and convolutional neural networks, since convolutions are a special case of matrix
multiplication.
Step #1 - Linearize the uncertainty
The following theorem shows how the maximization problem over U can be transformed from
a linear problem in the second layer to a linear problem in the first layer of the network. The
proof relies on Lemma 2.5.1 and Lemma 2.5.2.
Theorem 2.5.4. The maximum difference between the output of the correct class and the
output of any other class k can be written as
y
sup z2k(W ,x+ δ)− z
2
y(W ,x+ δ) = sup (∆e )
⊤z2k (W ,x+ δ) (2.15)
δ∈U δ∈U
y
= sup inf sup (p− q)⊤ z1(W ,x+ δ) + (∆e ⊤ 2k) b
0≤s≤1 0≤t≤1 δ∈U
s.t. p = [(W 2)⊤
y
(∆ek)]
+ ⊙ s
y
q = [−(W 2)⊤(∆ek)]
+ ⊙ t,
(2.16)
where ⊙ corresponds to entry-wise multiplication.
Proof. By definition of the two layer neural network z2, we have
y
(∆e )⊤z2
y
(W ,x+ δ) = (∆e )⊤W 2[z1
y
k k (W ,x+ δ)]
+ + (∆ek)
⊤b2
y
= f (z1+ (W ,x+ δ))− f−(z
1(W ,x+ δ)) + (∆ek)
⊤b2,
43
where f+, f− are the convex functions defined by
y y
f+(x) = [(∆e )
⊤W 2]+[x]+k , and f−(x) = [−(∆e
⊤ 2
k) W ]
+[x]+.
Applying Lemma 2.5.1 to the function f+ we then have
y
sup (∆e )⊤k z
2(W ,x+ δ),
δ∈U
1 y=sup f+(z (W ,x+ δ))− f−(z
1(W ,x+ δ)) + (∆e ⊤ 2k) b ,
δ∈U
y
= sup sup u⊤z1(W ,x+ δ)− f ⋆+(u)− f
1
−(z (W ,x+ δ)) + (∆ek)
⊤b2, (2.17)
u∈dom(f⋆ ) δ∈U+
(By Lemma 2.5.1)
⊤ 1 y= sup sup u z (W ,x+ δ)− f 1−(z (W ,x+ δ)) + (∆ek)
⊤b2. (2.18)
u∈dom(f⋆ ) δ∈U+
(By Lemma 2.5.3a)
Defining the concave function g(x) = u⊤x−f−(x), and applying Lemma 2.5.2 to the function
g we obtain
y
sup (∆e )⊤ 2k z (W ,x+ δ)
δ∈U
y
= sup sup g(z1(W ,x+ δ)) + (∆ek)
⊤b2 (2.19)
u∈dom(f⋆ ) δ∈U+
y
= sup inf sup v⊤z1(W ,x+ δ)− g⋆(v) + (∆e )
⊤ 2
k b (By Lemma 2.5.2),
u∈dom(f⋆ ) v∈dom(g⋆) δ∈U+
(2.20)
= sup inf sup v⊤
y
z1(W ,x+ δ) + (∆ek)
⊤b2 (By Lemma 2.5.3b).
u∈dom(f⋆+)
v∈dom(g⋆) δ∈U
(2.21)
Lastly, by Lemma 2.5.3a and 2.5.3b, we know that the variables u and v can be parameterized
44
as
y
u = [(W 2)⊤(∆ek)]
+ ⊙ s,
y y
v = [(W 2)⊤(∆ek)]
+ ⊙ s− [−(W 2)⊤(∆e )]+k ⊙ t
with 0 ≤ s, t ≤ 1. Substituting these values in Eq. (2.21) we obtain Eq. (2.16), as desired.
Step #2 - Optimize over the uncertainty set
Notice that the objective in Eq. (2.16) is linear in z1 and therefore it is also linear in δ,
which facilitates the computation of the exact value of the supremum over U , as shown in
the next corollary.
Corollary 2.5.5. If U = {δ : ∥δ∥p ≤ ρ}, then:
y
sup (∆e )⊤z2k (W ,x+ δ) (2.22)
δ∈U
y
= sup inf ρ∥(p− q)⊤W 1∥q + (p− q)
⊤(W 1x+ b1) + (∆e )⊤b2k
0≤s≤1 0≤t≤1
y
s.t. p = [(W 2)⊤(∆e + (2.23)k)] ⊙ s
y
q = [−(W 2)⊤(∆ek)]
+ ⊙ t,
where ∥ · ∥q is the dual norm of ∥ · ∥p, with
1 + 1 = 1.
p q
Before proceeding to the proof of the corollary, notice that we can recover the approxima-
tion method developed in the previous section by setting
s = t = [sign(z1(W ,x))]+ (2.24)
45
in the objective of problem (2.23) to obtain
y y y
ρ∥((W 2)⊤(∆ek)⊙ [sign(z
1(W ,x))]+)⊤W 1∥ + (∆e )⊤W 2[z1(W ,x)]+q k + (∆e
⊤ 2
k) b ,
(2.25)
which is the same as the linear approximation of y(∆ek)
⊤z2(W + δ) obtained in Eq. (2.11).
Proof. The proof follows directly after applying Theorem 2.5.4 and using the fact that for all
vectors c we have
sup c⊤δ = ρ∥c∥q. (2.26)
δ:∥δ∥p≤ρ
Step #3 - Backtrack
Since neural networks are trained by minimizing the empirical loss over the parameters W , we
want to avoid the computation of supremums in the objective. While the previous corollary
shows how to solve the supremum over the uncertainty set, a new supremum was introduced
in Theorem 2.5.4 over the variables s. The next theorem tells us how we can remove this
new supremums for the specific case p = 1.
Theorem 2.5.6. The maximum difference between the output of the correct class and the
output of any other class k can be upper bounded by
y
sup (∆e ⊤ 2k) z (W ,x+ δ)
δ:∥δ∥1≤ρ
{ }
≤ inf max max g2k,m(W , t,x, ρ), g
2
k,m(W , t,x,−ρ) , (2.27)
0≤t≤1 m∈[M ]
46
where the new network g is defined by the equations
g1m(W , a) = aW
1e 1m +W x+ b
1,
y y y
g2k,m(W , t,x, a) = [(∆ek)
⊤W 2]+[g1m(W , a)]
+ − [−(∆e )⊤k W
2]+[g1m(W , a)]⊙ t+ (∆e )
⊤b2k ,
for a = ρ,−ρ.
Proof. Applying Corollary 2.5.5 with p = 1 and using the min-max inequality we obtain
y
sup (∆e )⊤z2k (W ,x+ δ) (2.28)
δ:∥δ∥1≤ρ
y
≤ inf sup ρ∥(p− q)⊤W 1∥ ⊤∞ + (p− q) (W
1x+ b1) + (∆e )⊤b2k
0≤t≤1 0≤s≤1
s.t. p = [(W 2)⊤ y(∆ek)]
+ ⊙ s (2.29)
y
q = [−(W 2)⊤(∆ek)]
+ ⊙ t.
Defining yp(s) = [(W 2)⊤(∆ek)]
+ ⊙ s and q(t) = [−(W 2)⊤ y(∆ek)]
+ ⊙ t, we have that for fixed
t it holds
y
sup ρ∥(p(s)− q(t))⊤W 1∥∞ + (p(s)− q(t))
⊤(W 1x+ b1) + (∆e )⊤b2k
0≤s≤1
{
= max max sup (p(s)− q(t))⊤(W 1(x+ ρe ) + b1
y
) + (∆e ⊤m k) b
2,
m∈[M ] 0≤s≤1
}
⊤ 1 ysup (p(s)− q(t)) (W (x− ρem) + b
1) + (∆e ⊤ 2k) b
0≤s≤1
{
y ⊤ 2 + 1 y= max max [(∆ek) W ] [W (x+ ρem) + b
1]+− q(t)⊤(W 1(x+ ρe 1 ⊤ 2m)+ b )+(∆ek) b ,
m∈[M ]
}
y y
[(∆ek)
⊤W 2]+[W 1(x− ρem) + b
1]+− q(t)⊤(W 1(x− ρem)+ b
1)+(∆e )⊤b2k
= max max{g2k,m(W , t,x, ρ), g
2
k,m(W , t,x,−ρ)}.
m∈[M ]
47
The theorem then follows after applying the inf over 0 ≤ t ≤ 1.
Notice that in the previous proof it was important to use p = 1, since the dual of the
L1 norm is the L∞ norm, which can be written as a maximum over a finite set. With a
different p, the solution for the variables s would be more challenging to find. However, for
the chosen uncertainty set we obtain an upper bound of Eq. (2.4) by applying the result
from the previous Theorem to Lemma 2.3.3. While we could include the variables t in the
minimization problem over W , we instead use fixed values t = [sign(z1(W ,x))]+ based on
the linear approximation of y(∆e )⊤k z
2(W ,x + δ), as described in Eq. (2.24). Notice that
setting specific values for t does not affect the inequalities: since the upper bound includes
the infimum over t, any 0 ≤ t ≤ 1 yields an upper bound of the robust problem. For the
specific case of the cross entropy loss function, the proposed upper bound for the min-max
robust problem is
N ( ( ))
1 ∑ ∑ max max{g2m∈[M ] (W,[sign(z1(W,x))]+,x,ρ),g2 (W,[sign(z1(W,x))]+,x,−ρ)}min log e k,m k,m .
W N
n=1 k
(2.30)
2.6 Experiments
In this section, we demonstrate the effectiveness of the proposed methods in practice. We first
introduce the experimental setup and then we compare the robustness of several defenses.
2.6.1 Experimental details
Data sets. We use 46 data sets from the UCI collection (Dua and Graff 2017), which
correspond to classification tasks with a diverse number of features that are not categorical.
For each data set we do a 80%/20% split for training/testing sets, and we further reserve 25%
of each training set for validation. In addition, we use three popular computer vision data sets,
namely the MNIST (Deng 2012a), Fashion MNIST (Xiao et al. 2017a) and CIFAR (Krizhevsky
48
et al. 2009a) data sets.
Pre-processing. All input data has been previously scaled, which facilitates the comparison
of the adversarial attacks across data sets. For the UCI data sets, each feature is standarized
using the statistics of the training set, while for the vision data sets each image channel is
normalized to be between 0 and 1, or standarized, depending on what leads to best robustness.
Attacks. We use the implementation provided by the foolbox library (Rauber et al. 2017,
2020) using the default parameters. We evaluate attacks using projected gradient descent
and fast gradient methods. More specifically, we use the following adversarial attacks:
• PGD-Lp: Attack bounded in Lp norm and found using Projected Gradient Descent.
• FGM-Lp: Attack bounded in Lp norm and found using Fast Gradient Method for
p = 1, 2 and Fast Gradient Sign Method for p = ∞.
Defenses. Our comparisons include different defenses denoted as follows:
• aRUB-Lp: Approximate Robust Upper Bound method described in section 2.4 using
the L1 or L∞ sphere as the uncertainty set.
• RUB: Robust Upper Bound method described in section 2.5 using the L1 sphere as the
uncertainty set.
• PGD-L∞: Adversarial training method in which the network is trained using attacks
that are bounded in the L∞ norm and found using Projected Gradient Descent.
• Baseline-L∞: Simple approximation method resulting from minimizing Eq. (2.9) using
the Lp sphere as the uncertainty set.
• Nominal: Standard vanilla training with no robustness (ρ = 0).
49
Architecture. We evaluate a neural network with three dense hidden layers with 200 neu-
rons in each hidden layer. For the vision data sets, we also provide results with Convolutional
Neural Networks (CNNs) in Appendix A.3. The architecture has two convolutional layers
alternated with pooling operations, and two dense layers, as in Madry et al. (2019). The
parameters of the networks were initialized with the Glorot initialization (Glorot and Bengio
2010a).
Hyperparameter Tuning. Each network and defense is trained for different learning
rates ({1, 10−1, 10−2, 10−3, 10−4, 10−5, 10−6}). For the UCI data set we use a batch size of 256
and for the vision data sets we try a batch size of 32 and 256. For the L∞ based training
methods we try all values of ρ from the set ({10−4, 10−5, 10−3, 10−2, 0.1, 0.3, 0.5, 1, 3, 5, 10}).
√
For the methods based on the L1 norm, we scale those values of ρ by a factor of m, since
√
∥x∥∞ ≤ ∥x∥ and ∥x∥ ≤ m∥x∥ for any x ∈ Rm2 1 2 . In this way we ensure that the L1
spheres and the L∞ spheres contain the same L2 spheres, allowing for a fair comparison of
all methods in terms of adversarial attacks that are bounded in the L2 norm. All networks
trained using the UCI data sets are trained for 5000 iterations, and all vision data sets are
trained for 10000 iterations. For each network, data set, batch size, and defense radius ρ, we
have verified that for at least one of the learning rates the validation accuracy converges with
the aforementioned number of training iterations. Finally, for each attack type with radius ρ,
we select on the validation set the best hyperparameters for each defense, i.e., given a data
set, an attack type and its radius ρ, the hyperparameters of a defense (network, learning
rate, batch size, normalization and defense radius ρ) are the ones that lead to the highest
adversarial robustness in the validation set. In total we trained more than 40, 000 networks
across all tested data sets and defenses.
50
2.6.2 UCI data sets
We run experiments on the 46 UCI data sets using different methods for robust training and
compare the adversarial accuracies achieved with multiple types of adversarial attacks. For
each data set we rank every training method, where the method with rank 1 corresponds
to the one with highest adversarial accuracy. The average ranks and the corresponding
95% confidence intervals are shown in Figure 2.2, where we observe a similar pattern across
all types of attacks, namely, we see that the best ranks are achieved with aRUB-L∞ and
PGD-L∞ when ρ is smaller than 10−1; next there is a small range in which PGD-L∞ does
best and finally for larger values of ρ the best rank is that of RUB. In addition, for large
values of ρ we observe better results with Baseline-L∞ than with aRUB-L∞, suggesting that
the linear approximation of the network becomes inaccurate and leads to a large change in
the loss function. We also highlight that looking at ρ = 0, it is clear that robust training
methods achieve better natural accuracy than the Nominal training method.
51
(a) (b)
(c) (d)
(e) (f)
Figure 2.2: Average rank and the corresponding 95% confidence interval for each method
across the 46 UCI data sets for adversarial attacks bounded in L2, L∞ and L1 norm, respec-
tively from top to bottom. The figures on the left use attacks based on Fast Gradient methods,
and the figures on the right use instead attacks based on Projected Gradient Descent.
52
30 14
12
12
25
10
10
20 8
8
15
6 6
10
4 4
5 2 2
0 0 0
10 5 0 5 10 15 20 0 20 40 25 0 25 50 75 100 125
Improvement (%) Improvement (%) Improvement (%)
(a) ρ = 0.01 (b) ρ = 1 (c) ρ = 10
Figure 2.3: Number of UCI data sets for which RUB-L1 improves adversarial accuracy over
PGD-L∞ by a specific percentage. Figures a), b) and c) show the corresponding plots for
PGD-L2 adversarial examples with different values of ρ.
35 30 5
30 25 4
25
20
3
20
15
15 2
10
10
1
5 5
0 0 0
10 0 10 20 30 10 5 0 5 10 15 20 60 40 20 0
Improvement (%) Improvement (%) Improvement (%)
(a) ρ = 0.001 (b) ρ = 0.1 (c) ρ = 1
Figure 2.4: Number of UCI data sets for which aRUB-L∞ improves adversarial accuracy over
PGD-L∞ by a specific percentage. Figures a), b) and c) show the corresponding plots for
PGD-L∞ adversarial examples with different values of ρ.
To better compare the performances of RUB and aRUB against PGD-L∞, we also analyze
the percentage by which each method improves adversarial accuracy across data sets. In
Figure 2.3 we show the number of data sets for which RUB improves L2 adversarial accuracy
over PGD-L∞ by a specific percentage. We observe that the improvement becomes larger
as ρ increases, and in particular, for ρ = 10 we observe that RUB only lowers adversarial
accuracy for 3 data sets, while it shows more than 15% improvement over PGD-L∞ for 8 of
the data sets. Similarly, Figure 2.4 displays the number of data sets for which aRUB-L∞
improves adversarial accuracy over PGD-L∞ by some percentage; we observe that for small
perturbations (ρ = 0.001, ρ = 0.1) aRUB seems to slightly improve over PGD-L∞, while this
last defense has a clear advantage for larger perturbations (ρ = 1).
53
Number of UCI datasets Number of UCI datasets
Number of UCI datasets Number of UCI datasets
Number of UCI datasets Number of UCI datasets
Lastly, in Table 2.2 we display the average number of batches processed per second as well
as the corresponding standard deviation for each method across the 46 UCI data sets. As
expected, we see that Nominal is the method that processes the largest number of batches per
second, and all defense methods except Baseline-L∞ are much slower than Nominal. However,
RUB, RUB-L1 and aRUB-L∞ all process more batches per second compared to PGD-L∞.
Table 2.2: Average number of batches processed per second across the 46 UCI data sets, as
well as the corresponding standard deviations.
Avg no. batches per second Standard Deviation
RUB 65.1 12.6
aRUB -L1 17.5 0.5
aRUB -L∞ 18.8 0.4
Baseline-L∞ 465 56.4
PGD -L∞ 4.5 0.1
Nominal 712.6 87.8
2.6.3 Vision Data Sets
We next show experiment results for the three vision data sets. Specifically, we compare for
different training methods their performance against adversarial attacks as well as the security
guarantees obtained from applying the upper bound from Eq. (2.30). Since the proposed
RUB method significantly increases memory requirements for inputs with large dimensions,
we compare this method against other defenses using the feed forward architecture with
three hidden layers. We do include results using a CNN architecture for all other methods
in Appendix A.3, where we also explain how to extend the theory of the RUB method for
convolutional layers with ReLU and MaxPool activation functions.
54
Performance Against Adversarial Attacks. We evaluate adversarial accuracy for all the
aforementioned methods (Nominal, RUB, aRUB-L1, aRUB-L∞, Baseline-L∞, PGD-L∞), and
we add two other state-of-the-art defenses to make our evaluation even more comprehensive.
These two methods were proposed in Wong and Kolter (2018) and Wong et al. (2020); which
we call COAP-L∞ (Convex Outer Adversarial Polytope with L∞ norm) and FGSM-L∞ (Fast
Gradient Sign Method) respectively, and are representative of state-of-the-art defenses in
terms of robustness and training computational cost, respectively. For each Lp norm we
report the minimum adversarial accuracy achieved using both Projected Gradient Descent
attacks and Fast Gradient Method attacks.
We observe that for the Fashion MNIST data set (Table 2.3), aRUB-L1 and RUB achieve
the best accuracies for small values of ρ; we then observe a small range in which PGD-L∞
does best and lastly for larger values of ρ we see that RUB takes the lead, which is similar to
the average results observed for the UCI data sets. For the MNIST data set, all defenses
achieve similar results when the input perturbations are small, with RUB showing again better
performance with larger radius ρ. In the CIFAR data set we observe a different behavior;
PGD-L∞, FGSM-L∞ and aRUB methods achieve better accuracies at various radius regimes,
although we observe that all methods perform very poorly overall as this is a notoriously
more difficult data set.
Lastly, we again observe that in all three data sets Baseline-L∞ outperforms aRUB-L∞
when ρ is large, and the robust training methods achieve a higher natural accuracy than the
accuracy resulting from standard nominal training.
55
Table 2.3: Adversarial Accuracy (%) for Fashion MNIST. For each choice of Lp norm, defense
method and noise radius ρ, we report the minimum accuracy achieved with PGD and FGM
attacks. Colored cells correspond to accuracies that are within 0.5 percentage units of the
best result in each column, which has bold font.
ρ = 0.00 0.01 0.06 0.28 2.80 8.40 14.00 28.00 56.00
RUB 89.88 89.22 88.95 86.80 60.51 55.47 55.47 55.27 49.41
aRUB-L1 90.31 90.04 89.45 85.98 63.98 29.10 18.24 15.04 9.96
aRUB-L∞ 89.18 89.06 88.75 85.98 65.43 24.18 18.24 16.02 9.96
Baseline-L∞ 89.38 89.02 88.32 85.04 48.20 29.88 28.71 26.52 18.55
PGD-L∞ 89.61 89.38 88.01 85.23 68.20 31.60 25.90 22.85 19.10
FGSM-L∞ 89.41 87.81 87.19 84.06 67.81 35.23 27.11 25.39 19.06
COAP-L∞ 89.14 88.48 87.11 82.58 30.51 20.94 19.69 19.69 19.69
Nominal 87.70 88.40 88.01 85.35 46.60 19.92 16.84 15.39 15.39
ρ = 0.00 0.01 0.06 0.28 2.80 8.40 14.00 28.00 140.00
RUB 89.84 89.80 89.73 89.41 87.93 84.18 81.72 75.70 55.51
aRUB-L1 89.02 89.02 88.98 88.71 87.34 84.57 82.34 75.86 41.21
aRUB-L∞ 89.02 89.02 89.02 88.79 87.03 82.85 79.77 71.25 30.63
Baseline-L∞ 88.83 88.20 88.20 87.85 85.74 82.11 78.09 68.28 30.59
PGD-L∞ 89.02 89.02 88.95 88.95 87.11 84.57 81.37 73.16 39.69
FGSM-L∞ 89.80 89.77 89.73 89.26 85.00 82.19 78.44 73.36 35.62
COAP-L∞ 89.22 89.22 89.22 88.67 82.58 73.01 60.47 48.59 22.19
Nominal 86.99 86.95 86.91 86.91 85.00 82.54 78.83 68.98 21.88
ρ = 0.000 0.001 0.003 0.010 0.100 0.300 0.500 1.00 5.00
RUB 89.38 89.18 88.52 86.45 65.08 56.17 56.13 55.78 37.77
aRUB-L1 90.00 89.73 89.06 86.05 65.82 28.95 11.02 10.00 10.00
aRUB-L∞ 89.77 89.22 88.71 86.68 79.80 67.93 52.81 12.15 10.00
Baseline-L∞ 89.14 88.63 86.99 85.55 56.80 30.20 29.06 27.19 19.06
PGD-L∞ 89.02 89.02 88.24 87.77 80.43 73.09 65.74 31.21 18.52
FGSM-L∞ 89.61 89.10 87.97 86.91 80.55 62.93 53.16 25.74 18.12
COAP-L∞ 89.18 88.24 86.76 85.66 67.15 23.87 17.19 17.19 16.21
Nominal 88.12 87.93 87.03 85.27 58.32 21.80 18.79 16.37 10.94
56
L∞ Attacks L1 Attacks L2 Attacks
Table 2.4: Adversarial Accuracy (%) for MNIST. For each choice of Lp norm, defense method
and noise radius ρ, we report the minimum accuracy achieved with PGD and FGM attacks.
Colored cells correspond to accuracies that are within 0.5 percentage units of the best result
in each column, which has bold font.
ρ = 0.00 0.01 0.06 0.28 2.80 8.40 14.00 28.00 56.00
RUB 97.93 97.85 97.07 97.07 79.30 66.48 62.89 55.00 38.20
aRUB-L1 98.63 98.52 97.73 97.46 80.82 57.58 55.04 46.60 29.96
aRUB-L∞ 97.97 97.97 97.81 97.81 91.17 51.80 49.53 43.40 31.13
Baseline-L∞ 97.58 97.50 97.50 97.11 72.85 64.18 61.29 52.81 34.80
PGD-L∞ 98.24 98.24 98.24 98.09 92.70 66.05 62.46 54.26 35.94
FGSM-L∞ 97.81 97.81 97.50 97.50 93.01 64.96 63.20 53.40 36.52
COAP-L∞ 98.01 97.97 97.54 94.88 54.73 31.29 31.29 28.98 28.48
Nominal 97.62 97.46 97.34 96.41 69.61 19.57 15.90 15.35 11.25
ρ = 0.00 0.01 0.06 0.28 2.80 8.40 14.00 28.00 140.00
RUB 98.01 97.97 97.93 97.89 97.03 97.03 96.25 91.84 66.95
aRUB-L1 98.28 98.28 98.28 98.28 97.89 97.46 96.52 94.49 58.20
aRUB-L∞ 98.40 98.40 98.40 98.24 98.24 97.58 96.91 93.16 51.05
Baseline-L∞ 98.05 98.05 98.05 98.01 97.58 96.52 95.47 90.16 63.83
PGD-L∞ 98.20 98.20 98.20 98.20 98.20 97.50 96.68 93.87 66.84
FGSM-L∞ 98.44 98.44 98.44 98.44 97.19 97.19 95.00 94.53 64.02
COAP-L∞ 97.81 97.81 97.81 97.81 96.99 92.50 83.12 69.30 30.47
Nominal 97.58 97.58 97.58 97.54 97.30 95.94 94.49 89.96 19.38
ρ = 0.000 0.001 0.003 0.010 0.100 0.300 0.500 1.00 5.00
RUB 98.44 98.32 97.93 97.70 86.09 68.52 67.15 61.56 27.15
aRUB-L1 98.52 98.36 98.28 98.09 86.84 61.88 59.61 54.77 21.64
aRUB-L∞ 98.71 98.59 98.36 98.36 96.95 89.96 78.16 49.69 23.24
Baseline-L∞ 98.24 97.81 97.81 97.66 84.69 65.08 63.24 59.49 24.73
PGD-L∞ 98.20 98.20 98.20 98.20 96.99 93.16 86.80 64.57 25.35
FGSM-L∞ 98.59 98.59 98.59 98.59 97.50 91.17 74.10 63.12 26.36
COAP-L∞ 98.12 98.12 98.12 96.56 90.78 37.81 31.72 32.62 24.88
Nominal 97.62 97.62 97.46 96.80 81.91 23.28 16.48 14.61 9.69
57
L∞ Attacks L1 Attacks L2 Attacks
Table 2.5: Adversarial Accuracy (%) for CIFAR. For each choice of Lp norm, defense method
and noise radius ρ, we report the minimum accuracy achieved with PGD and FGM attacks.
Colored cells correspond to accuracies that are within 0.5 percentage units of the best result
in each column, which has bold font.
ρ = 0.00 0.01 0.03 0.06 0.08 0.11 0.17 0.55 5.54
RUB 49.69 48.67 46.33 48.59 48.36 48.12 47.58 44.14 17.73
aRUB-L1 52.42 51.41 51.37 51.13 51.02 50.74 50.20 46.56 21.88
aRUB-L∞ 53.83 52.89 51.84 47.77 47.11 46.48 44.45 41.37 20.66
Baseline-L∞ 53.12 52.07 50.66 45.82 42.50 42.97 42.42 35.86 9.30
PGD-L∞ 53.91 53.32 52.27 48.83 48.63 48.48 48.09 45.12 20.16
FGSM-L∞ 53.52 52.07 50.82 49.34 47.97 47.30 47.30 44.53 24.69
COAP-L∞ 51.45 50.86 49.45 48.59 47.07 45.35 41.76 35.86 11.76
Nominal 46.02 45.78 44.84 44.10 42.89 42.15 40.43 37.11 10.59
ρ = 0.00 0.01 0.03 0.06 0.11 0.55 5.54 16.63 27.71
RUB 50.86 50.86 50.86 50.70 50.55 48.98 47.03 44.84 42.58
aRUB-L1 51.56 51.56 51.56 51.56 51.56 51.37 50.35 47.85 45.98
aRUB-L∞ 53.55 53.48 53.44 53.40 53.24 52.30 44.02 40.59 40.62
Baseline-L∞ 53.32 53.28 53.28 53.24 52.03 50.94 41.72 38.79 34.53
PGD-L∞ 52.97 52.97 52.97 52.97 52.93 52.30 47.89 45.62 43.52
FGSM-L∞ 53.71 53.71 53.67 53.59 53.55 53.20 47.62 45.39 43.36
COAP-L∞ 50.86 50.86 50.78 50.74 50.66 49.84 43.59 29.96 29.96
Nominal 46.02 46.02 46.02 45.98 45.98 45.66 43.87 39.69 36.25
ρ = 0.000 0.001 0.003 0.010 0.100 0.300 0.500 1.00 5.00
RUB 50.86 46.02 47.58 45.08 21.64 9.38 9.30 9.30 8.12
aRUB-L1 53.28 50.94 50.23 47.07 26.91 10.66 10.12 10.12 10.62
aRUB-L∞ 53.52 46.76 45.23 42.54 29.06 10.43 10.78 10.78 9.69
Baseline-L∞ 51.48 47.54 40.90 39.22 9.34 9.34 9.34 9.34 10.00
PGD-L∞ 54.10 49.34 48.05 46.33 25.31 15.90 12.93 12.89 13.16
FGSM-L∞ 54.06 50.39 43.12 40.31 30.63 13.55 10.94 10.94 11.88
COAP-L∞ 51.76 48.52 44.92 41.76 11.56 11.56 11.56 11.56 10.00
Nominal 46.72 45.31 43.44 39.18 9.92 9.92 9.92 9.92 10.00
58
L∞ Attacks L1 Attacks L2 Attacks
Table 2.6: Average number of batches processed per second across the 3 vision data sets, as
well as the corresponding standard deviations.
Avg no. batches per second Standard Deviation
RUB 4.8 0.2
aRUB -L1 53.1 2.4
aRUB -L∞ 56.4 0.2
Baseline -L∞ 343.5 33.4
PGD -L∞ 3.7 0.2
FGSM -L∞ 86 4.1
COAP -L∞ 12.2 0.3
Nominal 473 45.2
In Table 2.6 we present the average number of batches processed per second as well as the
corresponding standard deviation for each method across the 3 vision data sets. We observe
that FGSM-L∞ is the fastest defense after Baseline-L∞, followed by the aRUB methods. We
highlight that contrary to the results obtained with the UCI data sets, the RUB method is
slower than the aRUB defenses. This is attributed to the increased memory requirements for
RUB with high dimensional inputs, which prevented full parallelization during training.
Security Guarantees against L1 Norm Bounded Attacks. Finally, we use the upper
bound of the adversarial loss derived in section 2.5 to find lower bounds for the adversarial
accuracy with respect to attacks bounded in the L1 norm by ρ. Specifically, the RUB-L1
defense finds an upper bound for sup Lδ:∥δ∥ ≤ρ zk (W ,x+ δ)− zLy (W ,x+ δ), and therefore1
when this upper bound is nonpositive for all k ∈ [K] we know that the network zL(W , ·)
correctly classifies all adversarial attacks x + δ for which ∥δ∥1 ≤ ρ. In other words, the
nonpositivity of this upper bound gives the network a security guarantee against the attacks
considered. The percentage of images in the testing set for which this guarantee exists is
therefore a lower bound of the adversarial accuracy achieved by network. In Tables 2.7, 2.8
and 2.9 we report the lower bounds for each method by selecting the hyperparameters that
lead to the best lower bound in the validation set. In particular, notice that for a given
choice of radius ρ and defense method, the selected network might not be the same as the
59
one selected in the previous results for adversarial accuracy.
We observe that, as expected, for all three data sets, CIFAR, Fashion MNIST and MNIST,
the best security guarantees are the ones for the RUB method. While these results are only
lower bounds for the adversarial accuracy and we cannot claim a better accuracy for RUB
than for the rest of the methods, the lower bound for the RUB method shows that this method
indeed performs very well against L1 attacks bounded by large values of ρ. For instance,
in Table 2.7 we can see that for the Fashion MNIST data set the RUB method guarantees
86.02% adversarial accuracy (less than 5% decrease from the best natural accuracy) against
attacks with L1 norm smaller or equal to ρ = 2.8. Similarly, in Table 2.8 we observe that
for this same attacks RUB has at least 97.11% adversarial accuracy (less than 1% decrease
over natural accuracy) for the MNIST data set. And finally, for the CIFAR data set, we can
see in Table 2.9 that RUB achieves 45.66% adversarial accuracy (less than 5% decrease over
natural accuracy) against attacks whose L1 norm is upper bounded by ρ = 5.54.
60
Table 2.7: Fashion MNIST: Lower bound of adversarial accuracy with uncertainty bounded
in L1 norm by ρ.
ρ = 0.00 0.01 0.06 0.28 2.80 8.40 14.00 28.00
RUB 90.27 90.16 90.16 89.49 86.02 80.55 76.21 66.95
aRUB-L1 90.51 90.51 90.39 89.73 85.59 73.98 69.61 47.54
aRUB-L∞ 89.96 89.92 89.84 88.75 76.60 21.37 10.16 9.84
Baseline-L∞ 89.49 89.38 89.38 87.50 77.42 46.41 20.04 15.23
PGD-L∞ 89.92 89.92 89.92 87.85 78.75 40.35 19.22 15.55
Nominal 88.59 88.55 88.48 87.93 77.50 40.98 15.04 9.88
Table 2.8: MNIST: Lower bound of adversarial accuracy with uncertainty bounded in L1
norm by ρ.
ρ = 0.00 0.01 0.06 0.28 2.80 8.40 14.00 28.00
RUB 98.01 97.93 97.93 97.46 97.11 93.67 89.77 74.96
aRUB-L1 98.52 98.48 98.48 98.20 96.48 89.96 80.86 26.05
aRUB-L∞ 98.40 98.40 98.28 97.54 94.69 68.52 16.56 12.19
Baseline-L∞ 98.05 98.05 98.05 97.81 95.23 57.23 20.74 10.35
PGD-L∞ 98.48 98.48 98.44 98.36 95.55 63.59 15.43 11.60
Nominal 97.73 97.73 97.58 97.54 93.87 44.80 10.31 10.23
61
Table 2.9: CIFAR: Lower bound of adversarial accuracy with uncertainty bounded in L1
norm by ρ.
ρ = 0.00 0.01 0.06 0.55 5.54 16.63 27.71 55.43
RUB 50.62 50.62 49.96 48.91 45.66 37.81 32.85 23.28
aRUB-L1 53.40 53.16 51.99 51.33 43.36 33.83 27.19 15.16
aRUB-L∞ 53.67 53.52 53.20 47.07 37.30 14.26 9.65 9.65
Baseline-L∞ 53.32 53.24 52.66 46.09 36.99 13.71 9.61 9.61
PGD-L∞ 54.88 54.80 53.98 49.26 37.42 27.30 14.02 12.46
Nominal 46.88 46.88 46.76 44.69 37.19 14.88 9.02 8.95
2.7 Conclusions
We developed two new methods for adversarial training of neural networks, both of which
provide an upper bound of the adversarial loss by considering the whole network at once
instead of applying convex relaxations and propagating bounds for each layer separately
as in previous works. First, we found an empirical upper bound by incorporating the first
order approximation of the network’s output layer. This method does not provide security
guarantees against adversarial attacks but it performs very well across a variety of data sets
when the uncertainty set is small and it stands out for its simplicity. Second, by extending
state-of-the-art tools from RO to non-convex and non-concave functions, we were able to
construct a provable upper bound of the adversarial loss. Experimental results show that this
method has a performance edge for larger uncertainty sets, and importantly, this method can
certify the non-existence of adversarial attacks bounded in L1 norm. The two proposed upper
bounds are in closed-form and can be effectively minimized with backpropagation. Lastly, we
provide evidence that adding robustness can improve the natural accuracy of neural networks
for classification problems with tabular or vision data.
62
For future work we are interested in extending the RUB approach for other types of norms
as well as understanding how the tightness of the proposed upper bounds change across
layers in order to facilitate further improvements. Adversarial robustness is crucial in the
development of more secure machine learning systems, and we hope that our work will inspire
further research in this important area.
63
64
Chapter 3
Holistic Deep Learning
3.1 Introduction
Neural networks have become increasingly popular due to their remarkable achievements
in computer vision and natural language processing. Their generalization power has been
demonstrated in wide-ranging applications, from classifying photos to recommending products.
However, neural networks face challenges in real-world applications for high-stakes decision-
making, including healthcare, policy-making, and autonomous driving.
First, many standard neural networks are not robust – they can be easily fooled by
natural or artificial noise in the input data (Szegedy et al. 2014), making them vulnerable to
perturbations that may arise in real-world applications. Moreover, neural networks, similar
to other machine learning models, often suffer from instability during the training process –
different train-validation splits could generate models with very different performance (May
et al. 2010, Xu and Goodacre 2018). This reduces the policymakers’ trust in these models
and hinders post-hoc interpretations. Another critical difficulty is that neural networks are
not sparse – the high number of parameters utilized for neural networks prevents efficient
computation and storage (Thompson et al. 2020). Most neural networks have millions of
non-zero parameters to be stored and accessed for evaluation. This is problematic in many
65
decision-making settings with limitations or restrictions on hardware capabilities. Reducing
the number of parameters could make them more applicable in a broader range of scenarios
(Changpinyo et al. 2017, Narang et al. 2017).
The questions around improving robustness, stability, and sparsity metrics have all been
previously studied in the neural network literature. However, they have been almost exclusively
studied in isolation, with a limited understanding of the tradeoffs between these desired
qualities and their effect on natural accuracy (accuracy with respect to the unperturbed
data samples). This paper aims to simultaneously address all these objectives through a
novel comprehensive methodology named Holistic Deep Learning (HDL). In particular, HDL
carefully combines state-of-the-art techniques that address these individual challenges and
demonstrates their collective efficacy through extensive experiments on diverse data sets.
Our findings provide a promising pathway toward developing efficient and reliable machine
learning models across many dimensions for real-world applications.
Specifically, our contributions are as follows:
1. We design HDL, a novel framework that jointly optimizes for neural network robustness
(adversarial accuracy), stability (worst accuracy across train-val splits), and sparsity
(parameters with value zero) metrics by appropriately modifying the objective function.
2. Through extensive ablation experiments and SHAP value analysis (Lundberg and Lee
2017) across 45 UCI data sets (Dua and Graff 2017) and 3 image data sets (MNIST
(Deng 2012b), Fashion MNIST (Xiao et al. 2017b) and CIFAR10 (Krizhevsky et al.
2009b)), we analyze the individual performance of each metric as well as the interactions
and trade-offs between them. We corroborate that imposing robustness, stability, and
sparsity improves the corresponding metrics across all data sets. In addition, we show
that:
• imposing stability and sparsity further improves robustness,
• imposing stability and robustness further improves sparsity,
66
• imposing robustness further improves stability,
• imposing stability and robustness further improves natural accuracy.
The effect of sparsity on natural accuracy is more complex and highly varies across data
sets. However, we show that it is often possible to simultaneously improve robustness,
stability, and sparsity without sacrificing performance on natural accuracy.
3. We propose a prescriptive approach to provide recommendations on selecting the
appropriate loss function depending on the practitioner’s objective. In particular,
simultaneously imposing robustness, stability and sparsity in the loss function leads to
the best results when jointly optimizing for all the metrics.
The paper is organized as follows: Section 3.2 outlines the current literature of robust,
sparse, and stable methods; Section 3.3 describes the Holistic Deep Learning framework, and
Section 3.4 shows the results of the computational experiments.
3.2 Related Work
3.2.1 Robust Neural Networks
Many state-of-the-art deep neural networks are highly vulnerable to small perturbations in the
input data (Szegedy et al. 2014). Adversarial robustness evaluates a neural network’s resistance
against these altered inputs intentionally designed to worsen the network’s performance
(Goodfellow et al. 2014, Carlini and Wagner 2017, Madry et al. 2017).
Multiple methods have been developed in recent years to enhance the adversarial robustness
of neural networks. One of the most popular heuristics is augmenting the data set during
training with adversarial examples (Madry et al. 2017, Goodfellow et al. 2014). Others include
neuron randomization (Prakash et al. 2018, Xie et al. 2017), input space projections (Lamb
et al. 2018, Kabilan et al. 2018, Ilyas et al. 2017) and regularization (Bertsimas et al. 2021a,
67
Ross and Doshi-Velez 2017, Hein and Andriushchenko 2017, Yan et al. 2018). A less common
but more theoretically rigorous approach is to minimize a provable upper bound of the loss
achieved with adversarial examples (Raghunathan et al. 2018b, Singh et al. 2018, Zhang
et al. 2018, Weng et al. 2018, Dvijotham et al. 2018b, Lecuyer et al. 2019, Cohen et al. 2019,
Anderson et al. 2020, Bertsimas et al. 2021a).
While these methods successfully improve the network’s robustness, the extent to which
they do so often depends on the data set, the network size, and the magnitude of the input
perturbations. In particular, heuristic methods generally work well for small perturbations,
while the upper bound methods yield better results when the input noise is larger (Bertsimas
et al. 2021a, Athalye et al. 2018). However, there is a trade-off between effectiveness and
efficiency. The methods providing the strongest adversarial robustness are often computa-
tionally demanding, making it challenging to implement them for large data sets or complex
network architectures.
3.2.2 Sparse Neural Networks
In machine learning, sparse models make predictions based on a limited number of parameters.
Sparsity is often desirable, as it may save memory, enhance model interpretability, and reduce
overfitting (Bertsimas et al. 2020).
There are two typical approaches to sparsity in deep learning. The first one, train-then-
sparsify, consists of removing unnecessary neurons or connections after training the network,
sometimes followed by retraining (Janowsky 1989, LeCun et al. 1989). This approach has been
widely investigated, and several schemes exist to choose which connections to prune (Hoefler
et al. 2021). Han et al. (2015), for example, propose to prune the connections with the
smallest weights. Other methods include formulating a convex optimization problem (Aghasi
et al. 2020), removing filters for which the total absolute sum is low (Li et al. 2016), and
eliminating channels that have limited impact on the network’s discriminatory ability (Zhuang
et al. 2018). The second approach, sparsify-during-training, is achieved by learning a sparse
68
architecture while training the network. Multiple methodologies exist (Bellec et al. 2017,
Mocanu et al. 2017, Mostafa and Wang 2019), including the method to approximate the ℓ0
norm with continuous functions and add a regularization term to the loss function (Louizos
et al. 2017, Savarese et al. 2020). We refer the reader to Gale et al. (2019) and Hoefler et al.
(2021) for more comprehensive surveys on sparsity.
3.2.3 Stable Neural Networks
The stochastic nature of data samples can lead to instability and high dependence of
machine learning models on the specific train-validation split. This can negatively impact the
interpretability of the resulting model and its ability to make reliable predictions (Bertsimas
and Paskov 2020), a key factor to establishing trust in any algorithm.
The sensitivity of machine learning models to the choice of training split has mostly been
studied through the lens of cross-validation and distributionally robust optimization. Cross-
validation can be used to measure the variability from the selection of training split but at a
significant increase in computational cost (Krogh and Vedelsby 1994, Hastie et al. 2001) that
is often intractable for deep learning settings. Distributionally robust optimization has been
used to quantify the worst-case generalization error in the presence of shifts in distribution
or regime (Staib and Jegelka 2019, Goldwasser et al. 2020, Sagawa et al. 2019), but it often
requires pre-defined groups over the training data and expensive group annotations for each
data sample to avoid overly pessimistic uncertain distributions (Sagawa et al. 2019, Liu et al.
2021). A different approach has been studied by Bertsimas and Paskov (2020) and Bertsimas
et al. (2022a), who instead optimize over the worst training set of fixed size without making
any probabilistic assumptions. Although their method was presented in the context of linear
and tree-based models, their framework also applies to neural networks.
69
3.3 The Holistic Deep Learning Approach
3.3.1 The HDL Framework
We introduce the HDL framework for a classification problem with cross-entropy loss using
the same notation as in the previous chapter. We illustrate our approach over a fully-
connected neural network for simplicity of notation, but the framework remains the same for
convolutional neural networks.
The nominal DL approach is to minimize the loss of the network zL described in Eq.
(2.2), which can be written as:
( )
∑N1 ∑
K
yn ⊤
min log e(∆e ) z
L(W,x)
k , (3.1)
W N
n=1 k=1
In our HDL framework, we propose instead to minimize the following optimization problem:
|W|
∑
min λ σ (βsj)+ θ +
s,θ,W ︸︷︷︸
j=1 Stability
︸ ︷︷ ︸
Sparsity (3.2)
 
Robustness +
N ︷ ︸︸ ︷
1∑ ∑
K
 yn ⊤ L yn ⊤ L
log e(∆e ) z (W⊙σ(βs),x)+ρ∥∇x(∆ek ) z (W ⊙ σ(βs),x)∥

k 1 − θ ,
a ︸︷︷︸
n=1 k=1 Stability
where ⊙ corresponds to the element-wise product, σ is the standard sigmoid function, zL(·,x)
was defined in (2.2), λ (resp. ρ) is the regularization coefficient corresponding to the sparsity
(resp. robustness) loss component, and a is the size of the data subsets used for the stability
requirement (see Section 3.2.3). We observe that robustness adds a term to the output, while
stability and sparsity add new parameters (θ and s respectively) to be optimized. This loss
function allows us to simultaneously train robust, sparse, and stable feed-forward neural
networks at scale. In the next section, we provide more details about each metric.
70
3.3.2 Robustness
This section describes our method to introduce the robust component into neural network
training. Since our ultimate goal is to incorporate the sparsity, robustness, and stability of
neural networks together in a tractable way, we avoid algorithms that improve robustness
at the expense of a significant increase in the training time or the algorithm’s complexity
(for instance, the algorithms that perform training with adversarial examples usually require
significantly longer times to optimally find such examples at each gradient descent iteration
(Madry et al. 2017, Bertsimas et al. 2021a)). We follow the approach from Section 2.4 of
using a linear approximation of the neural network to estimate the robust objective. This
approach is simple to implement, produces good adversarial accuracy, and does not require
the extensive training time of other algorithms.
We then minimize the loss function in Eq. (2.14). As shown in Section 5.7, for small
ρ this approach achieves competitive results with state-of-the-art methods while requiring
significantly less computational time across various tabular and image data sets. However,
we emphasize that the methodology developed in this paper could also be performed with
other methods for robust training, like adversarial training or upper bound minimization,
which might be more appropriate for large uncertainty sets.
3.3.3 Sparsity
In this work, we use the specific retraining procedure proposed by Savarese et al. (2020), which
deterministically approximates the ℓ0 regularization utilizing a sequence of sigmoid functions
and adding them as a penalty term in the loss function. Notably, the implementation is
easily compatible with our robustness and stability requirements, since this methodology
relies on a penalty term added in the loss function. Therefore, we can use gradient descent
to simultaneously optimize the objective function comprising the robustness, stability, and
sparsity penalties.
71
Adding ℓ0 regularization explicitly penalizes the number of non-zero weights in the model
to induce sparsity. However, the ℓ0-norm induces a priori a non-convex and non-differentiable
loss function R(W), as follows:
( )
N |W|
1 ∑ ( ) ∑
R(W) = L y , zLn (W ,xn) + λ∥W∥0, ∥W∥0 = I [wj ≠ 0] , (3.3)
N
n=1 j=1
where |W | is the number of parameters, wj is the jth coordinate of W , λ is the regularization
weight and L a loss function (e.g., cross-entropy loss).
The goal is to relax the discrete nature of the ℓ0 penalty to preserve an efficient continuous
optimization while allowing for exact zeros in the neural network weights. To do this, Savarese
et al. (2020) propose to first parameterize the weights wj = H(sj) where H(·) is the Heaviside
step function, and then approximate the non-differentiable step function with the sigmoid
function: σ(βsj) → H(sj) when β → ∞. Therefore, β is the hardness parameter that controls
how close the approximation is to the ℓ0 regularization, and the final loss function can be
written as:
( )
N |W|
1 ∑ ( ) ∑
R(W) ≈ L y Ln, z (W ⊙ σ(βs),xn) + λ σ(βsj). (3.4)
N
n=1 j=1
To achieve a sparse network, we use this loss function (3.4) over multiple training rounds
to gradually reach a sparse initialization before training the final sparse neural network. To
obtain each initialization before a new training round, we start with our initialized auxiliary
sparsity s0 and hardness β = 1 parameters. Over the T training iterations, we gradually
increase β until it reaches a maximum value β̄ when the training procedure is completed
with sparsity sT . Then, we take s′0 = min(β̄sT , s0) to generate the new initialization for
the next round of training. This minimization function essentially keeps the information
of the suppressed weights, i.e., σ(βsj) ≈ 0, while reverting those not suppressed to their
starting position. This process is completed over multiple rounds to find better and sparser
72
initializations for the neural network.
We implement the methodology as suggested by Savarese et al. (2020). In the results
section, we measure sparsity in terms of the percentage of neuron connections (weights) set
to 0.
3.3.4 Stability
Using the measure of stability defined in Section 3.2.3, we apply the methodology developed in
Bertsimas et al. (2022a) for building stable neural networks. At a high level, this corresponds
to constructing a model that is robust to the specific subset of data used to train it. One
way to think about this is to view the training data set as a sample from the true data
distribution and then require the model to be robust to the specific sample. Considering the
partition of the data into training/validation sets as a sampling mechanism from this true
data distribution (each split choice gives a different training set), we desire to build models
that are robust to every partition.
To achieve this, we first associate each observation (xn, yn) with a binary variable zn,
n ∈ [N ] that indicates whether or not (xn, yn) is part of the training set. We then choose the
network’s parameters as to minimize the worst-case loss over all possible allocations of these
zn’s, resulting in a model that is explicitly built to do well not just over one training set, but
over all possible training sets. We start from the same minimization problem introduced in
Section 3.3.1, i.e.,
∑N1
min L(y Ln, z (W ,xn)).
W N
n=1
To obtain network stability we require the model to be robust to every training set of fixed
73
size a, which results in the following optimization problem:
1∑
N
minmax znL(yn, z
L(W ,xn)),
W z∈Z a
n=1
{ } (3.5)
∑N
where Z = z : zn = a, zn ∈ {0, 1}, n ∈ [N ] .
n=1
The value of a indicates the desired proportion between the size of the training and validation
sets. For example, by setting a = 0.7N we recover the typical 70/30 training/validation
split. Since the inner maximization problem is linear in z, the problem is equivalent to
optimizing over the convex hull of Z. This implies that the binary constraints on zn can be
relaxed to 0 ≤ zn ≤ 1, and the inner maximization problem becomes linear in the variables
zn. Computing its dual problem we obtain that the value of the inner maximization problem
is equivalent to:
1∑
N
min θ + un subject to θ + un ≥ L(y Ln, z (W ,xn)), un ≥ 0, n ∈ [N ].
θ,un a
n=1
Therefore, the stability problem becomes
∑N1
min θ + un subject to θ + un ≥ L(yn, zL(W ,xn)), un ≥ 0, n ∈ [N ].
W, a
θ,u ∈R n=1n
Note that the variables un can be solved in closed form as un = [L(yn, zL(W ,xn)) − θ]+.
The final minimization problem with stability then becomes:
∑N1 [ ]+
min θ + L(y , zLn (W ,xn))− θ ,
W,θ a
n=1
which is now an unconstrained problem that can be solved with standard gradient descent
optimization algorithms.
74
3.4 Experiments
This section presents extensive computational experiments comparing the nominal DL ap-
proach (abbreviated DL) with 7 other models resulting from our holistic methodology. We
showcase the merit of our HDL framework and investigate the influence of each studied com-
ponent – robustness, sparsity, and stability – on the overall performance across 4 evaluation
metrics:
• Natural accuracy : Average accuracy on the testing set across the 10 different train-
validation splits with respect to the original input data.
• Adversarial robustness: Average adversarial accuracy on the testing set across the
10 different train-validation splits with respect to adversarial attacks resulting from
perturbations of the original input data. We consider only attacks bounded in the L∞
norm by some radius ρ using Projected Gradient Descent as in Madry et al. (2017).
• Stability : Worst accuracy on the testing set across the 10 different train-validation
splits with respect to the original input data.
• Sparsity : Percentage of network parameters with value 0.
The exact optimization problem solved for each model results from combinations of the loss
functions described in the previous section, and the specific formulations can be found in
Table 3.1 above.
Data We computed experiments on classification tasks with 45 UCI data sets from the UCI
Machine Learning Repository (Dua and Graff 2017). These data sets give various problem
sizes and difficulties to form a representative sample of real-world tabular problems, with the
largest data set having 245,056 observations and the highest number of features being 856.
We also benchmarked our methodologies on three image data sets: MNIST, Fashion-MNIST,
and CIFAR10.
75
Method Optimization Problem
( )
∑N ∑ yDL min 1 log e(∆e n )⊤zL(W,xk n)W N n=1 k
( )
∑ ∑ y y
Robust N n ⊤ L n ⊤ Lmin 1W n=1 log k e
(∆e ) z (W,x)+ρ∥∇x(∆e ) z (W,x)∥k k 1
N
( )
∑ ∑ y
Stable N n ⊤ LminW,θ θ + 1 [log e(∆e ) z (W,x )k nn=1 k − θ]
+
a
( )
∑
Sparse |W|
∑N ∑ yn ⊤ L
minW,s λ j=1 σ(βsj) +
1
n=1 log k e
(∆e ) z (W⊙σ(βs),x
k n
)
N
Robust
∑|W|
min 1
( W,s
λ j=1 σ(βsj) + ×N
+ )∑N ∑ y(∆e n ylog e )
⊤zL(W⊙σ(βs),x)+ρ∥∇x(∆e
n )⊤zL(W,x)∥
k k 1
n=1 k
Sparse
Stable
( )
∑
+ |W|
∑ ∑ y
min λ σ(βs ) + θ+ 1
N n ⊤ L
[log e(∆e ) z (W⊙σ(βs),xn) +W,θ,s j kj=1 n=1 k − θ]a
Sparse
Stable
( )
∑N ∑ y+ 1 (∆e n )⊤zL
y
min θ + [log e (W,x)+ρ∥∇x(∆e
n )⊤zL(W,x)∥1 +
W,θ k ka n=1 k
− θ]
Robust
∑|W|
HDL minW,θ,s λ j=1 σ(βsj) + θ +
1×
∑ aN ∑ y y(∆e n )⊤zL(W⊙σ(βs),x)+ρ∥∇ n ⊤ L
k x
(∆e ) z (W,x)∥1 +k
n=1[log ke − θ]
Table 3.1: Loss functions used for DL and all methods in the HDL framework.
Implementation Our code is written in Python 3.8 (Van Rossum and Drake 2009a).
Neural networks are coded using Tensorflow v1 (Abadi et al. 2015). We trained each model
on a system equipped with an Intel Xeon Gold 6248 processor, which included 4 CPU cores
and one Nvidia Volta V100 GPU.
Training Methodology For each data set, we used 20% of the data to obtain a fixed
test set, and we randomly generated 10 different 80%-20% train-validation splits with the
remaining data points. The same 10 train-validation partitions were used across all methods
for a fair comparison. Given a choice of model and evaluation metric, we selected the
hyperparameters that led to the best average performance in the validation set for the metric
in question. We then reported the average performance of the chosen parameter configuration
on the test set with respect to the given metric. For all evaluation metrics, the average
76
performance is computed over the 10 training-validation splits initially generated.
Neural network architectures For our experiments on UCI data sets, we used a feed-
forward neural network architecture with 2 hidden layers, each with 128 neurons and ReLU
activations. For our experiments on the image data sets, we used a convolutional neural
network with the AlexNet architecture (Krizhevsky et al. 2012). We used the Glorot uniform
initialization (Glorot and Bengio 2010b) for the network weights W and 0 as initialization
for the sparsity variable s0. The choice of architecture and initialization was made to re-
flect typical settings utilized in the machine learning community (e.g. Madry et al. (2017),
Savarese et al. (2020), Bertsimas et al. (2021a)) while maintaining moderate size networks
that facilitate exhaustive experimentation across dozens of data sets. Importantly, the same
architecture is used across all methods been evaluated.
Hyperparameter search For each model, we cross-validated the values of the following
hyperparameters:
• Adam learning rate: {1e−2, 1e−3} for UCI data sets, {1e−3, 1e−4} for image data sets.
• Number of epochs: 150 for UCI data sets, 50 for vision data sets.
• Batch Size: 32 for UCI data sets, 64 for image data sets.
• Robustness radius ρ : {1e−1, 1e−2, 1e−3, 1e−4, 1e−5}.
• Sparsity regularization parameter λ: {1e−6, 1e−8, 1e−10}.
• Sparsity temperature parameter β̄ : {200, 1000}.
• Stability parameter a: {0.7, 0.8, 0.9, 1}.
77
3.4.1 UCI Data sets
We split the 45 UCI data sets into 6 roughly even-sized groups based on their difficulty level.
Specifically, we consider the ranges 0%-70%, 70%-80%, 80%-90%, 90%-95%, 95%-98% and
98%-100% of natural accuracy achieved by the nominal DL approach. We first investigate
the performance of the HDL framework with respect to a single evaluation metric. In Figure
3.1, we evaluate all methods in terms of natural accuracy, adversarial accuracy with ρ = 0.1,
stability, and sparsity.
Figures 3.1a and 3.1c show that those data sets for which the nominal approach achieves
accuracy in the 70%-90% range are the ones that benefit the most from the HDL framework
(especially the Robust, Stable, and Stable+Robust models) when the evaluation metric
corresponds to natural accuracy or stability. For the data sets with natural accuracy above
90%, none of the models significantly improve over the natural accuracy or stability achieved
by the nominal DL model. However, for data sets in the 98-100% range sparsity slightly
improves accuracy and robustness slightly helps for stability.
Figure 3.1b shows the adversarial robustness achieved with perturbation parameter ρ = 0.1.
We see a substantial adversarial robustness improvement in all methods that included the
robust component. Moreover, combining robustness with stability and/or sparsity leads to
higher adversarial accuracy than that achieved with robustness alone. In terms of parameter
sparsity, Figure 3.1d shows that all models with imposed sparsity (Sparse, Stable+Sparse,
Robust+Sparse, and HDL) have a much lower percentage of nonzero parameters compared to
the models without it. And importantly, both robustness and stability help achieve sparser
models when combined with sparsity.
78
40
8 DL
Robust 35
Stable
6
Sparse 30
Robust+Sparse
4 25Stable+Sparse
Stable+Robust 20
2 HDL
15
0 10
2 5
0
4
0-70 70-80 80-90 90-95 95-98 98-100 0-70 70-80 80-90 90-95 95-98 98-100
Natural Accuracy (%) Natural accuracy (%)
(a) Natural accuracy. (b) Adversarial accuracy with ρ = 0.1.
70
2
1 60
0 50
1 40
2 30
3 20
4
10
5
0
0-70 70-80 80-90 90-95 95-98 98-100 0-70 70-80 80-90 90-95 95-98 98-100
Natural accuracy (%) Natural accuracy (%)
(c) Stability. (d) Sparsity.
Figure 3.1: Evaluation of the different methods depending on the natural accuracy of the
nominal DL approach on the UCI data sets.
Since we are also interested in models that are simultaneously accurate, sparse, robust, and
stable, we consider a multi-objective metric using the rank of each method (ranks start at 1,
with lower ranks corresponding to better performance). For each method, we use the natural
accuracy, adversarial accuracy, stability, and sparsity achieved in the validation set respectively
to rank all its hyperparameter configurations 4 times. Then for each hyperparameter
configuration, we compute the average rank across the 4 metrics and select the configuration
that leads to the method’s highest average rank. Finally, we rank the 8 selected models (for
the 8 different methods) with respect to each evaluation metric on the testing set to obtain
their out-of-sample average rank.
79
Worst Acc Improvement (%) Natural Accuracy Improvement (%)
Parameter Sparsity (%)
Adv Acc Improvement (%)
As shown in Figure 3.2, all 7 models from the HDL framework outperform the nominal
DL approach with respect to this holistic metric. Moreover, the HDL model typically achieves
the best results across data set complexities.
5.5
DL 5.0
Robust
Stable
Sparse 4.5
Robust+Sparse
Stable+Sparse 4.0
Robust+Stable
HDL
3.5
3.0
0-70 70-80 80-90 90-95 95-98 98-100
Natural Accuracy (%)
Figure 3.2: Average multi-objective rank.
3.4.2 Image Data Sets
In this section, we evaluate all methods using the MNIST, Fashion-MNIST, and CIFAR10 data
sets. For each method, we select the parameters based on the multi-objective metric utilized
for the UCI data sets in the validation set and report the performance across metrics. In
Tables 3.2 and 3.3, we see that for MNIST and Fashion-MNIST, the HDL model outperforms
the DL model for all objectives. In particular, HDL achieves higher accuracy using only
around 70% of the parameters. The results for the CIFAR10 data set (Table 3.4) are a
bit different since adding sparsity slightly hurts natural accuracy. However, the accuracy
achieved by the HDL model is comparable to those achieved by the non-sparse models and
the number of parameters is reduced by 47%.
80
Average Multi-Objective Rank
Method DL Rob. Stab. Sparse Rob. + Stab. + Stab. + HDL
Sparse Sparse Rob.
Avg. Accuracy 91.8 92.0 91.9 91.4 92.1 91.4 92.0 92.1
Adv. Acc. ρ = 0.01 78.7 81.1 78.3 80.8 86.9 80.2 86.8 87.1
Stability 91.5 91.7 91.8 91.3 91.9 91.2 91.7 91.8
Sparsity 0 0 0 36.2 26.6 48.4 0 26.8
Table 3.2: Results for the Fashion-MNIST data set. For each method, the parameters with
the highest average rank in the validation set were chosen.
Method DL Rob. Stab. Sparse Rob. + Stab. + Stab. + HDL
Sparse Sparse Rob.
Avg. Accuracy 99.2 99.3 99.2 99.1 99.2 99.2 99.3 99.3
Adv. Acc. ρ = 0.1 49.6 78.4 51.5 42.6 74.7 27.7 79.5 76.0
Stability 99.1 99.2 99.2 99.1 99.1 99.0 99.3 99.2
Sparsity 0 0 0 16.1 27.9 31.7 0 28.0
Table 3.3: Results for the MNIST data set. For each method, the parameters with the highest
average rank in the validation set were chosen.
Method DL Rob. Stab. Sparse Rob. + Stab. + Stab. + HDL
Sparse Sparse Rob.
Avg. Accuracy 70.1 70.1 70.1 69.8 69.2 69.3 69.8 69.3
Adv. Acc. ρ = 0.01 26.7 27.1 26.6 27.3 27.4 27.7 29.1 30.6
Stability 69.7 69.7 69.8 69.3 68.9 68.5 69.4 68.8
Sparsity 0 0 0 28.7 47.2 47.8 0 47.8
Table 3.4: Results for the CIFAR10 data set. For each method, the parameters with the
highest average rank in the validation set were chosen.
81
3.4.3 Computational Times
Since modifying the loss function often affects the training computational time, we quantify
the slowdown effect for all the methods in the HDL framework. Specifically, for each of the
45 UCI data sets as well as the 3 image data sets introduced in the previous section, we
calculate how many times slower each method is when compared to the nominal DL approach
in terms of batches per second as well as number of iterations needed. The average slowdown
factors are shown in Table 3.5.
We observe that robustness and sparsity both decrease the number of batches per second
by approximately a factor of 3, while stability preserves the same speed as the DL approach.
In addition, since we used 5 training rounds for the methods incorporating sparsity, they
require 5 times as many training iterations as the other methods. On average, the HDL
method is only 16 times slower, and methods that don’t optimize for sparsity only increase
the computational time by less than 3 times.
Method Batches/sec No. Iterations Total Slowdown
Slowdown Factor Increase Factor factor
Robust 2.7 1 2.7
Stable 1.0 1 1.0
Sparse 1.2 5 5.9
Robust+Sparse 3.2 5 16.1
Stable+Sparse 1.1 5 5.5
Stable+Robust 2.7 1 2.7
HDL 3.2 5 16.2
Table 3.5: Average slowdown factors of computational time with respect to the nominal DL
method.
3.4.4 SHAP Values
To gain a deeper understanding of the interplay between individual loss components (robust-
ness, stability, sparsity) and the metrics we measure, we employ the SHAP values method
(Lundberg and Lee 2017). We compute the SHAP values for each UCI data set and average the
82
results over three data set categories: Low Accuracy (< 80%), Medium Accuracy (80%-95%),
and High Accuracy (> 95%), with 15 data sets each. The results are shown in Figure 3.3.
SHAP Value for Accuracy SHAP Value for Adversarial Accuracy
Low Acc. 0.4% -0.8% 1.1% Low Acc. 22.3% 4.9% 2.2%
Medium Acc. 0.1% -0.4% 0.3% Medium Acc. 22.8% 3.2% 0.1%
High Acc. -0.3% -0.7% 0.2% High Acc. 25.1% 5.3% 0.3%
robust sparse stable robust sparse stable
(a) Accuracy. (b) Adversarial accuracy with ρ = 0.1.
SHAP Value for Worst Case Accuracy SHAP Value for Reduction in % Nonzero Entries
Low Acc. 0.9% -1.1% 1.0% Low Acc. 2.9% 46.7% 3.7%
Medium Acc. -0.1% -0.4% 0.7% Medium Acc. 1.7% 43.4% 3.5%
High Acc. -1.1% -2.1% 0.4% High Acc. 2.5% 42.6% 4.1%
robust sparse stable robust sparse stable
(c) Stability. (d) Sparsity.
Figure 3.3: SHAP values on various metrics across different UCI data set categories. Blue/red
indicates that the feature has a positive/negative SHAP value on a specific category of UCI
data set.
Our findings confirm that robustness, stability, and sparsity techniques improve the
corresponding metrics across all data set categories. More intriguingly, these techniques also
positively impact metrics beyond their intended purposes. For example, sparsity and stability
enhance adversarial accuracy, while robustness and stability yield sparser networks. This
indicates that combining techniques does not necessarily result in any adverse effects and
that it is feasible to attain networks with good performance across all metrics. Additionally,
83
the benefits of these techniques are more pronounced in data sets with low initial accuracy,
particularly for the accuracy and stability metrics. Lastly, we observe that sparsity generally
hurts accuracy and stability, although this highly varies across data sets, as observed in
Section 3.4.1.
3.4.5 Prescriptive Approach
In this section, we develop a prescriptive approach that allows users to choose a training
loss function based on the specific objective they wish to maximize, which can be a single
evaluation metric or a weighted combination of several metrics. Depending on the data set
characteristics and the performance scores of the nominal DL model, we propose a tree-based
recommendation model to suggest the most suitable HDL loss function for optimal results
with respect to the desired objective.
We train our models using an Optimal Policy Tree (OPT) algorithm (Amram et al. 2022),
which uses observational data of the form (xi, yi, zi). While it is possible to include variability
and complexity indicators of the data set as part of xi (Lorena et al. 2019), given the extensive
and diverse range of data sets in consideration, we choose to capture complexity using the
performance metrics achieved by the nominal DL approach on the corresponding classification
tasks. In our case, each observation (i.e., data set) i encompasses:
• Data set features x 8i ∈ R : number of features, number of target classes, nominal DL
accuracy, nominal DL adversarial accuracy with ρ = 0.001, nominal DL adversarial
accuracy with ρ = 0.01, nominal DL adversarial accuracy with ρ = 0.1, nominal DL
stability, nominal DL worst case accuracy.
• Prescriptions zi ∈ 1, . . . , 8: DL, Robust, Stable, Sparse, Robust+Sparse, Stable+Sparse,
Stable+Robust, HDL.
• Outcomes yi ∈ R8, which represent the performance improvement of each method
compared to the nominal DL model with respect to the metric set by the user.
84
Figure 3.4: Optimal policy tree for max-
imizing natural accuracy.
Figure 3.5: Optimal policy tree for maximizing
robustness (ρ = 0.1).
Our prescriptive task is to find the optimal policy that, given the information x of a data set,
prescribes the method z leading to the best metric score y. We randomly split the 45 UCI
data sets into a training set (40 data sets) and a test set (5 data sets from different difficulty
levels). We cross-validated the optimal tree depth and complexity using the training set.
Figures 3.4 and 3.5 represent the OPTs obtained for maximizing two different objectives:
natural accuracy and adversarial accuracy. The tree in Figure 3.4 highlights that the Stable
and Stable+Robust methods are the best suited to obtain high natural accuracy, with the
former being preferred when the nominal DL approach has very low adversarial accuracy
(ρ = 0.1). To maximize robustness, the tree in Figure 3.5 prescribes HDL, Stable+Robust, or
Robust+Sparse depending on the adversarial accuracy achieved by the nominal DL method.
In addition, we obtained single-leaf trees when maximizing the stability and sparsity
objectives. The recommended methods are Stable+Robust for optimizing stability and
Stable+Sparse for maximizing sparsity. Lastly, HDL was always the prescribed method when
the desired objective was the equally weighted average of all 4 previous metrics.
Finally, Table 3.6 reports the out-of-sample performance of these prescription trees on the
5 UCI data sets from the test set (cnae-9, hill-valley, libras-movement, magic-gamma, and
thyroid-ann). We emphasize that the performance of the prescribed methods is higher than
that of the nominal DL approach across all objectives and data sets, and it often matches
the performance of the best method.
85
Test Data Sets Objective Value
Objective Method
cnae-9 hill- libras- magic- thyroid-
valley move gamma ann
DL 93.70 47.21 79.44 87.11 98.86
Nat. Acc. Prescribed 94.07 53.61 80.00 87.28 99.05
Optimal 94.07 53.61 82.50 87.28 99.05
DL 0.00 7.54 0.00 15.07 48.42
Robustness (ρ = 0.1) Prescribed 3.80 36.39 2.50 64.59 91.79
Optimal 3.80 40.16 4.72 64.59 91.79
DL 91.20 43.44 75.00 86.65 98.28
Stability Prescribed 93.06 45.08 81.94 87.01 98.54
Optimal 93.06 49.18 81.94 87.01 98.81
DL 0.00 0.00 0.00 0.00 0.00
Sparsity Prescribed 73.43 34.89 71.00 57.52 53.22
Optimal 73.43 41.94 71.00 57.52 53.22
(Nat. Acc. +Robustness DL 46.25 24.55 38.61 47.21 61.39
+Stability Prescribed 57.75 39.35 52.76 58.06 73.03
+Sparsity)/4 Optimal 62.07 40.66 52.82 59.62 73.03
Table 3.6: Performance of prescription trees on the testing set.
3.4.6 Significance Analysis
To further validate the improvements achieved by the HDL framework, we analyze the
significance of our results with one-sided Welch’s t-tests with different variance groups.
Specifically, for each evaluation metric and each leaf of the corresponding optimal prescriptive
tree, we consider all the UCI and image data sets that fall within that leaf. For those data
sets, we test the null hypothesis that the average performance achieved by the prescribed
86
method is equal to that one achieved by the nominal DL approach, with alternative hypothesis
corresponding to the average performance achieved by the prescribed method being higher.
As shown in Table 3.7, all p-values are below the 0.05 significance level, concluding that the
prescribed methods have statistically significant higher performance than the nominal DL
approach across all performance metrics.
Objective Leaf Prescription p-value
Stable 0.025
Nat. Acc.
Stable+Robust 0.0462
HDL 1.508e−6
Robustness (ρ = 0.1) Stable+Robust 0.001
Robust+Sparse 1.727e−5
Stability Stable+Robust 0.0161
Sparsity Stable+Sparse 1.188e−38
Nat. Acc.+Robustness+Stability+Sparsity HDL 4.472e−26
4
Table 3.7: Significance results for the null hypothesis that the average performance achieved
by the prescribed method is equal to that one achieved by the nominal DL approach, with
alternative hypothesis corresponding to the average performance achieved by the prescribed
method being higher.
3.5 Conclusions
This paper presents a unifying methodology to obtain deep learning models that are accurate,
robust, stable, and sparse by appropriately modifying the objective function to be minimized.
Across multiple computational experiments, we show how these 4 metrics interact and
demonstrate that we can often train models that simultaneously improve adversarial accuracy,
worst-case accuracy, and parameter sparsity without sacrificing natural accuracy. Finally, we
provide prescriptive trees that use general features of the data set (e.g. dimension, number
of target classes, nominal accuracy, etc.) to recommend which method is more appropriate
87
depending on the desired objective to be maximized, and we show that the improvements
achieved by the prescribed methods are statistically significant.
For future research we aim to explore how HDL performs with respect to other data set
indicators like variability and complexity, as this could offer further guidance on which method
to select. We would also like to test our framework in real world applications; for instance in
the area of healthcare, where trustworthy models are crucial and memory constraints are
often required for practical use. Consequently, improving the interpretability of the HDL
framework would be essential to make it more suitable for such applications. We deem
adversarial robustness, stability and sparsity as critical qualities in the development of more
reliable machine learning algorithms, and we hope this work will motivate further research in
this important field.
88
Chapter 4
Large Language Models for Patient Flow
Predictions
4.1 Introduction
Increasing data availability from Electronic Medical Records (EMR) combined with advances
in machine learning (ML) generates new opportunities for enhancing decision-making within
healthcare institutions. For instance, anticipating short-term discharges informs about
bed availability and can facilitate resource utilization and personalized delivery of care.
Furthermore, detecting patients with high mortality or ICU risk can alert the medical team
and call their attention to those who need it the most.
Despite a growing literature on data-driven approaches for healthcare, deploying these
models in practice remains a difficult task. Not only there is a need for close relationships
between academics and medical teams, but also there are several data challenges that can
make such collaborations difficult. For instance, unstructured healthcare data like images
and notes are often difficult to access for model development due to privacy concerns and
high computational costs. Structured healthcare data, or Electronic medical records (EMRs),
are often available for most digitalized healthcare systems, but the tables are generally
89
disorganized, not standardized across institutions and very scarce for small healthcare
systems.
These challenges highlight two significant limitations in the existing approaches to handling
EMRs and tabular data in general: 1) they require labor-intensive data processing that is
unique to each institution, and 2) they ignore contextual information such as column headers
and meta content descriptions which could be used for data augmentation.
In contrast to the standard tabular approaches, language is a very flexible data modality
that can easily represent information about different data points without imposing any
structural similarity between them. Furthermore, recent developments on off-the-shelf large
language models (LLMs) based on the Transformer architecture (Vaswani et al. 2017) offer
state-of-the-art performances on a wide range of language tasks, including translation,
sentence completion, and question answering. These pre-trained models are often developed
with very large and diverse data sets, allowing them to exploit prior knowledge and make
accurate predictions with very few new training samples. Some LLMs are trained to target
specific domain knowledge and technical challenges, making them particularly useful in the
corresponding applications. For example, LLMs fine-tuned on clinical notes and biomedical
corpora such as ClinicalBERT (Alsentzer et al. 2019), BioBERT(Lee et al. 2019), and
BioGPT(Luo et al. 2022) offer substantial advantages for medical learning tasks, and LLMs
that specifically target long token sequences unveil opportunities for dealing with data that
contains long texts (Beltagy et al. 2020, Li et al. 2022).
These successful language models offer a natural solution to represent and process con-
textual information from tabular structures. Standard machine learning models only utilize
the explicit table contents, disregarding all accompanying context like column headers and
table descriptions. Incorporating these metadata into the model via language could give
meaning to the data values within the broader context. For example, a numerical value might
be very relevant for disease prediction if it represents a person’s age but not so much if it
corresponds to the ward census. Moreover, LLMs could save significant manual labor for
90
selecting, encoding, and imputing data (Sweeney 2017, Geneviève et al. 2019, Nan et al. 2022).
Missing data, in particular, is a challenging and frequent problem that requires attentive
processing and expert knowledge. Current predictive models either exclude such attributes,
potentially ignoring rare-occurring but valuable data or impute missing values with very few
recorded instances. Additional processing challenges arise when units of measurement or data
types are inconsistent across tabular data systems. By leveraging language, these difficulties
could be addressed, for instance, by simply writing that particular values are missing and
converting inconsistent values into text.
Previous works using LLMs have shown the potential of using natural language processing
(NLP) models to systematically and efficiently process tabular data in the form of language
(Herzig et al. 2020, Yin et al. 2020, Padhi et al. 2021, Somepalli et al. 2021). However, they
have mainly relied on training fixed BERT-based models that are not flexible to changes in
tabular structures. These works have mostly assumed that encoding data using LLMs leads to
better performance than traditional data processing methods, but concrete evidence has not
been provided. Other works augment tabular data with external unstructured data (Harari
and Katz 2022) but do not leverage contextual data from the original tabular source. In
addition, language models are considered sensitive to their input representations (Miyajiwala
et al. 2022), and most previous works do not thoroughly investigate how the choice of language
affects their results. Hegselmann et al. (2023) do investigate different language variants, but
in the context of zero-shot and few-shot classification as opposed to feature representation.
Thus, guidance on the best way to construct the language data remains in need.
In this chapter we first present a successful real-world implementation of machine learning
models at a large healthcare system, and then we build and evaluate a new feature extraction
methodology that leverages LLMs to improve and generalize these models. The main
contributions of this work are as follows:
1. We develop and implement machine learning models that predict several inpatient
outcomes for a large hospital network. We show that after utilizing our user-friendly
91
software the hospitals observe significant reduction in length of stay and millions of
dollars in financial benefits.
2. We create TabText, a systematic framework that leverages language to extract contextual
information from tabular structures. Our experiments demonstrate that augmenting
electronic medical records with our TabText representations can significantly improve
the AUC score of patient flow predictions, especially when trained with small-size
datasets.
3. We investigate the impact of several language syntactic parsing schemes on the perfor-
mance of TabText representations and demonstrate that TabText enables the generation
of high-performing predictive models for patient outcomes with minimal data processing.
4.2 Patient Flow Predictions
Access to accurate predictions of patients’ outcomes can enhance decision-making within
healthcare institutions. In collaboration with a large hospital network, we develop machine
learning models that predict short-term and long-term patient outcomes such us discharge,
intensive care unit transfers and end-of-stay mortality. We implement an automated pipeline
that extracts data and updates predictions every morning, as well as user-friendly software
and a color-coded alert system to communicate these patient-level predictions to clinical
teams. Since its deployment, over 200 doctors, nurses, and case managers across seven
hospitals have been using the tool in their daily patient review process.
Collaboration HHC is the largest healthcare network in Connecticut; it contains 7 hospitals
ranging from Hartford Hospital (HH), one of the largest teaching hospitals in New England
(867 beds), to small and medium sized (130–520 beds) community / teaching hospitals. We
have been collaborating with HHC since 2020, starting with daily data extraction from their
EMR system. In 2021, we constituted physician pilot users who iteratively collected feedback
92
and helped improve the models and the software tool. We extended models for the other 6
hospitals at HHC and deployed them in production between May 2022 and January 2023.
4.2.1 Data and Feature Extraction
First, we build EMR data extracts containing medical records of all inpatients from HHC
over a four-year period. The data set includes tables for demographics, patient status (e.g.,
oxygen device), clinical measurements (e.g., blood pressure), laboratory results, diagnoses,
orders, procedures, notes, and others. We create a feature space where each row represents
each patient day (in total 1,375,215 patient days). Given these raw tabular data files, we
perform several data processing steps to obtain the final feature space, as described below.
String Parsing Some columns in string format require string parsing to extract numerical
features as continuous variables. For instance, the normal ranges of laboratory tests in forms
such as “50–70” are replaced with two columns: one with a value of 50 for the lower bound
and another with a value of 70 for the upper bound.
Categorical Encoding Categorical columns (e.g., department, mobility level, the reason
for visit) must be converted to ordered numerical levels (consecutive integers) using label-
encoding or binary categories using one-hot encoding. Due to the large number of categories,
we use label encoding for all categorical variables.
Feature Engineering To better capture the clinical information, we compute various
auxiliary variables:
1) Current conditions extracted from records (e.g., whether the patient is in ICU or IV)
2) Normal indicators (whether the clinical measurement is within the normal/critical
range) instead of the ranges themselves.
3) Counts (e.g., number of days in ICU, number of attending physicians).
93
4) Pending procedures/results (time until surgery, whether MRI is pending, etc.).
5) Historical record linked to the patient (e.g., number of days since the previous admission
and length of the previous stay).
6) Non-patient-specific operational variables (e.g., day of the week, ward census and
utilization, hospital admission volume on the previous day).
Missing Data Imputation Since the raw data comes from a hospital system, it contains
many missing values. We impute most missing entries with 0, except for a few cases. From
communications with the hospital, we impute certain variables with prior knowledge of the
meaning of missingness (e.g., missing Do Not Resuscitate (DNR) means the patient did not
sign a DNR form). For some auxiliary variables, we apply some rules, such as imputing
counts with 0 if no record exists and imputing the number of days since previous admission
with a large number (e.g., 9999) if no previous admission exists.
4.2.2 Machine Learning Models
Prediction Tasks We consider several binary classification tasks related to the length of
stay, ICU, and mortality for each inpatient and on each day in the hospital. Two discharge-
related outcomes are whether patients are discharged or not within the next 24 hours (resp.
48 hours). Four ICU-related outcomes include whether the patient will enter (resp. leave) the
intensive care unit (ICU) for patients currently not in the ICU (resp. in the ICU) within the
next 24 hours (resp. 48 hours). Two short-term expiration outcomes concern whether each
patient will die in the next 24 hours (resp. 48 hours). One end-of-stay mortality outcome
indicates whether patients die or not at the end of their stay.
Models We evaluate a variety of machine learning models to make predictions, including
Optimal Classification Trees (Bertsimas and Dunn 2017), sparse classification (Bertsimas
et al. 2021c), XGBoost, LightGBM (Ke et al. 2017), and Tabnet (Arik and Pfister 2021).
94
With the highest performance, XGBoost (multi-class and binary class) classification models
are trained for each prediction task for each hospital and tuned with hyper-parameters
using a validation set (chronologically splitter). Furthermore, we ensure that all our models
are well-calibrated, by using the isotonic regression method (Zadrozny and Elkan 2002) to
calibrate the models on the first half of the testing set and assessing the final calibration on
the second half.
Predictive Analytics for Decision-Making It can be difficult to grasp the implications
of raw probability scores and use them efficiently for decision-making. To turn these predictive
analytics into a decision support tool that is sustainably used by practitioners, we complement
the predictions with a color-coded alert system. We send green alerts for patients who are ready
for short-term discharge (probability of 24hr or 48hr discharge is above certain thresholds).
On the other hand, we send red alerts to warn about patients who have a high risk or are
exacerbating (mortality risk or the increase of mortality risk from the previous day reaches
certain thresholds).
Model Evaluation All models achieve high out-of-sample AUC (75.7%–92.5%) across all
prediction tasks and are well calibrated for all seven hospitals. After threshold tuning and
discussions with doctors, we select to send a green alert when the 24-hr discharge probability
exceeds 0.25 or the 48-hr probability exceeds 0.4, which gives 0.621 precision and 0.746 recall.
On the other hand, we raise a red alert when the mortality risk exceeds 0.2 or when its
absolute change compared with the previous day exceeds 0.1, which gives 0.477 precision and
0.705 recall.
4.2.3 Results
To evaluate the empirical impact of the tool, we consider a treatment group of 15 HHC units
that used our tool at varying degrees between 2022 and 2023, and a control group of 12 units
that had not yet fully incorporated the tool as of April 15, 2023. To estimate the effect of
95
Figure 4.1: Empirical Analysis for Treatment Effect on Length of Stay. All the units in the
control and treatment groups are medicine or cardiology units offering general level of care.
our tool, we use a Difference-in-Differences (DiD) technique (Abadie 2005, Bertrand et al.
2004) and compare the average change in LOS among patients discharged in the treatment
group to that in the control group. We control for similar population fixed effects and time
non-stationarity effects, as we cover units of the same level of care and specialty, and the
same January 16 - April 15 period over the past three years. We assume that the difference
in LOS over time would have been the same between the two groups if the tool had not
been used (parallel trend assumption). We use this assumption to impute the counterfactual
average LOS for the treatment group, had there been no treatment (light green dashes in
Figure 4.1).
The LOS of the control group showed a steady increase after 2020, rising from 4.97 in
2021 to 5.38 in 2022 and eventually reaching 5.83 in 2023. Between 2021 and 2022, the
treatment group’s LOS increased from 4.76 to 5.07, which was in line with the parallel
trend but slightly lower, potentially due to the pilot’s partial treatment effect. After full
deployment, the treatment group’s LOS dropped to 4.99 in 2023, while the control group’s
LOS continued to rise from 4.96 to 5.85. The difference between the parallel counterfactual
and actual treatment group’s LOS resulted in an estimated benefit of reducing the average
96
LOS by 0.63 days per patient.
By reducing the average LOS by 0.63 days among patients in the 15 treatment units
(49,424 patients annually), we can save 31,137.12 patient days. If all beds are backfilled,
this would make room for an additional 6,239.9 patients per year, leading to a projected
total annual contribution margin increase of $67,365,60.4. Alternatively, under a no backfill
scenario, at an average direct cost for a medical/surgical inpatient of $1,661 per patient
day, the LOS reduction would be translated into estimated annual savings of $51,718,76.
Therefore, in practice, HHC is projected to obtain annual financial benefits in the range of
$51–$67 million, with some beds backfilled and others not.
To support these observations, we conduct a DiD regression analysis with variations in
treatment times (Callaway and Sant’Anna 2021, Goodman-Bacon 2021) in Appendix Section
C.3, which confirms a significant reduction in average LOS (and its logarithm) of similar
magnitude (see Table C.4). In addition, we observe a significant reduction in the time between
the green alert and the discharge order placed by physicians, supporting the hypothesis that
the reduction in LOS in the treated units is partially due to physicians better anticipating
the administrative process associated with discharges.
The main strength of our empirical validation is its multi-center nature, unlike the simple
before-and-after evaluation of prior work from the literature (Bertsimas et al. 2021b). We also
control for time trends and seasonality (time fixed effects) and unit/hospital heterogeneity.
However, the main limitation comes from the staged roll-out design over which we had no
control and which can be lead to estimation biases (Baker et al. 2022). In addition, a small
number of physicians in the control units had access and used the tool. We considered these
units as control nonetheless, which could lead to a conservative estimation of the effect.
97
4.3 TabText
As observed in the previous section, traditional machine learning approaches using tabular
data typically require thorough data cleaning before data is input into the models. Standard
model development requires a series of data pre-processing steps, including merging raw data
tables, parsing string columns, encoding categorical variables, constructing features, and
imputing missing data, as described in Section 4.2.1. In fact, in our collaboration with HHC
it took approximately one year of joint effort by machine learning researchers and hospital
specialists to obtain, clean, and process the data.
TabText is a new feature extraction methodology to represent contextual information from
tabular sources, which can replace data cleaning techniques or serve as a method for data
augmentation. We process tabular data by creating a text representation for each data sample.
This text contains the column attribute with its corresponding value and potentially other
available contextual information. We then use this text as input for a finetuned pre-trained
model that generates TabText embeddings of a fixed dimension. Finally, we augment the
tabular features with these TabText embeddings to train any standard machine learning
model for downstream prediction tasks.
The overall TabText framework can be visualized in Figure 4.2.
98
Tabular Format Language Format
Demographics
Age Height … BMI The following is the demographics data for 
this patient: age is high, height is normal, … , We want to predict health risks.
65 5’8’’ 21 
… BMI is normal. The following is the demographics data (high) (normal) (normal) for this patient: age is high, height is 
normal, … , BMI is normal.
Patient status The following is the current status for 
Level of The following is the current status for this this patient: service is cardiology, 
Service O2 Device … Care patient: service is cardiology, oxygen device 
oxygen device is ventilator, ... , level of 
is ventilator, ... , level of care is ICU. care is ICU.
Cardiology Ventilator … ICU The following is the labs for this patient: 
glucose is low, ... .
Labs …
The following is the vitals for this 
Platelet Glucose … WBC patient: temperature is very high, ... , 
The following is the labs for this patient: heart rate is high, blood pressure is 
65 glucose is low, ... . high.
NA (low) … NA
Vitals
Blood 
Temperature Heart Rate …
Pressure The following is the vitals for this patient: Large Language Model 
104 105 130 temperature is very high, ... ,heart rate is (e.g., Clinical-Longformer)
…
(very high) (high) (high) high, blood pressure is high.
Tabular Tabular Tabular Tabular 
… TabTextDemographics Patient Status Labs Vitals 
Data Data Data Data Embeddings
Input
Linear Models SVM, Clustering,… Tree-based models Neural Networks
Clinical or Operational (e.g., XGBoost)
Task Modeling
(e.g., Classification or Regression)
Figure 4.2: End-to-end TabText framework.
4.3.1 Methodology
As part of our methodology, we need to answer three main questions: 1) which LLM we
are using, 2) how we are constructing the language data, and 3) if we are fine-tuning the
pretrained model or not.
To address each one of those questions we use a data set from a large teaching hospital
over a three-month period, where each data point represents a patient day. There are 160
columns of different patient attributes on demographics, patient status, vital signs, laboratory
results, diagnoses, treatments, and other information. The summary of the tables utilized
99
…
…
can be found in Table 4.1.
Table Table Meta Information Example Columns
#
1 Lab values Platelet, Sodium
2 Chart measurements Respiratory rate, oxygen concentration
3 Counting statistics Number of medications, number of
orders
4 Current condition Oxygen device, is in ICU
5 Historical patient record Previous admission, previous length of
stay
6 Non-patient-specific data Day of the week, ward census
Table 4.1: Summary of tabular data, which contains different aspects of a patient’s admission
stay from patient’s high-level demographics to precise lab measurements.
LLM Selection We consider two different Transformer models, BioGPT and Clinical-
Longformer (Li et al. 2022), both of which were pre-trained with MIMIC-III clinical notes
(Johnson et al. 2016). Following the TabText framework, we convert the tables into simple
text: for each row, the cell from column “attribute” with value X is transformed into “attribute:
X” and the texts from all columns are concatenated into a single sentence with the comma
character. We next create TabText embeddings and finally train gradient-boosted tree models.
We use 60, 000 data samples for training and validation, and 10, 000 for testing. Figure 4.3a
shows the boxplots for the out-of-sample AUC over 10 random 75%-25% train-validation
splits for each task and each model. Both NLP models achieve similar performance across
tasks, and we therefore choose the Clinical-Longformer model, as it allows for input text of
larger size.
100
Language Construction The versatility of language creates a challenge for consistency,
as multiple textual expressions can convey the same information. Moreover, tabular data in
healthcare is often split across multiple tabular sources (e.g., vitals table, medications table),
some of which include information only for a particular group of patients. This results in
even more possibilities for textual representation.
The TabText framework creates a single paragraph for each data sample (e.g., for each
patient day) as follows: we first create a sentence for each column in each table. Next, for
each table, we concatenate contextual information and the sentences of its columns using
the colon (“:”) and comma (“,”) characters, respectively. We then merge the text from all
tables into a single paragraph using the period (“.”) character. While the exact punctuation
doesn’t significantly impact BERT-based transformers (Ek et al. 2020), the exact text chosen
to build each sentence might have a larger impact on the final embedding. We therefore
investigate different ways to construct sentences for each column attribute.
Descriptiveness: We consider whether or not to use descriptive language to construct text
sentences. Specifically, consider a cell from column “attribute” that has value “X”. If the
column is non-binary, we consider the following options:
• Non-Descriptive Sentence: “attribute: X”;
• Descriptive Sentence: “attribute is X”.
For binary columns, we consider the verb associated with the specific attribute. For instance,
if the column attribute is associated with the verb “to have” we consider
• Non-Descriptive Sentence: “has X: yes” or “has X: no”;
• Descriptive Sentence: “has X” or “does not have X”.
Missing Values: When the value for a column “attribute” is missing, we consider two options,
to explicitly mention in the text that this information is not available (“attribute is missing”),
101
or to simply skip this column when building the text representation.
Numerical Data: Transformer models often struggle to represent language with numerical
data (Gorishniy et al. 2022). Therefore, we also consider whether or not to replace numerical
values with text. For replacement, we compute the average (AVG) and standard deviation
(SD) of the corresponding column with respect to the training data. We then replace a given
cell value X as follows:
• “very low” if X < AVG − 2SD;
• “low” if AVG − 2SD ≤ X < AVG − SD;
• “normal” if AVG − SD ≤ X < AVG + SD;
• “high” if AVG + SD ≤ X < AVG + 2SD;
• “very high” if AVG + 2SD < X.
Including Metadata: We investigate the added value of including metadata as part of the text
representation. This corresponds to descriptions of table content (e.g., “This table contains
information about the medications administered to this patient”) or the prediction task of
interest (e.g., “We want to predict mortality risk”).
For each possible sentence configuration, we use default values of the Clinical-Longformer
model to obtain TabText embeddings that are given as input to a gradient-boosted tree model.
For this small experiment, we utilize 63 data features corresponding to laboratory results.
We use 60, 000 data samples for training and validation, and 10, 000 for testing. In Figure
4.3b, the Language Construction results show the boxplot for the rank achieved with each
configuration across tasks, where lower numbers correspond to better ranking. We choose
the sentence configuration with the lowest median ranking; specifically, we use descriptive
102
language, omit missing values from the text, replace numerical values with text, and include
metadata.
Fine-Tuning Although Clinical-Longformer was pre-trained with large language data
sets, we can further improve its performance with a few more training iterations using
our training data. Specifically, we convert our training data into language following the
sentence configuration selected in Section 4.3.1, and we use it to fine-tune Clinical-Longformer
following the original BERT training methodology, which includes self-supervised masked
word prediction. We fine-tune for 3 epochs and with the default values for all hyperparameters.
We then generate embedings that are given as input to a gradient-boosted tree model, using
60, 000 data samples for training and validation, and 10, 000 for testing. We show in Figure
4.3 the boxplots for the out-of-sample AUC over 10 random 75%-25% train-validation splits
for each task. We see that fine-tuning the model with our local data slightly improves
performance for eight out of the nine classification tasks of interest.
103
a) Large Language 
LLM
Model Selection
Clinical BioGPT
Longformer
b) Language 
Descriptive
Construction
Yes No
Include Include 
Missing Missing
Yes No Yes No
Replace Replace Replace Replace 
Numbers Numbers Numbers Numbers
Yes No Yes No Yes No Yes No
Add Add Add Add Add Add Add Add 
Metadata Metadata Metadata Metadata Metadata Metadata Metadata Metadata
Yes No Yes No Yes No Yes No Yes No Yes No Yes No Yes No
c) Fine-Tuning Fine-Tune
Yes No
(a) LLM Selection. (b) Language Construction. (c) Fine-Tuning.
Figure 4.3: Overview of our overall methodology. We start with the selection of an LLM.
Figure (a) shows that BioGPT and Clinical Longformer achieve similar results, and we
therefore choose Clinical Longformer, as it allows for input texts of larger sizes. Notice that
the specific LLM can flexibly be replaced as novel models become available. Then, we look for
the best language representations of the original patient data. In Figure (b) we observe the
boxplots of the ranks for different sentence configurations, which were tested across different
prediction tasks (lower rank is better). We select the configuration with the lowest median
ranking; specifically, we use descriptive language, omit missing values, replace numerical
values with text, and include metadata. Lastly, we fine-tune the LLM using this sentence
configuration, as it leads to better performance as shown in Figure (c).
104
4.4 TabText Results
This section presents extensive computational experiments evaluating the performance of our
TabText framework. First, we show how our pipeline can quickly generate machine-learning
models with competitive performance without any data cleaning by leveraging the flexibility
of language. We then demonstrate with pre-processed data that augmenting standard tabular
representations with our TabText embeddings can increase out-of-sample AUC by up to 6%,
with the largest improvements observed for the most challenging predictions.
Data: For the final experiments we consider a large data set from the same teaching hospital
used in the previous section, with inpatient data for the four years following the three-month
period used in Section 4.3.1. we summarize the number of data points for each prediction
task in Table 4.2.
Prediction Task Training Testing
Discharge 24 hr 572,964 265,917
Discharge 48 hr 572,964 265,917
Enter ICU 24 hr 385,132 180,075
Leave ICU 24 hr 73,013 34,669
Enter ICU 48 hr 292,659 138,947
Leave ICU 48 hr 68,472 33,011
Expire 24 hr 572,964 265,917
Expire 48 hr 572,964 265,917
Mortality 572,964 265,917
Table 4.2: Data sizes (number of patient days) for training and testing sets across the nine
healthcare classification tasks.
Text Encoder: We first convert the input training data from tabular to textual format as
105
described previously in Section 4.3.1. We use the sentence configuration that led to the highest
average AUCs (i.e., skipping sentences for missing values, replacing numbers with text, using
descriptive language, and adding metadata). Then, we use the fine-tuned Clinical-Longformer
model to extract language embeddings of size 768.
Training Methodology: For each prediction task, we compare two approaches: our TabText
framework (see Figure 4.2) and the standard Tabular approach in which only the tabular
data is given as input to the machine learning model. We use gradient-boosted tree models
in all experiments performed. For all reported results, the average performance is computed
over 10 random 75%-25% train-validation splits (identical 10 splits across all experiments)
for a fair comparison. The optimal model is selected using a hyperparameter grid search (see
details in the appendix) based on its performance on the validation set. The hyperparameters
that we grid-searched to obtain the optimal XGBoost model are:
• Number of estimators: {100, 200, 300},
• Maximum depth: {3, 5, 7},
• Learning rate: {0.05, 0.1, 0.3},
• L2 regularization parameter: {1e−2, 1e−3, 1e−4, 1e−5, 0}.
Implementation: All our code is written in Python 3.8.2. We trained all models using one
Intel Xeon Platinum 8260 or Intel Xeon Gold 6248 CPU and GPU. We conducted all of our
predictive experiments using the XGBoost (Chen and Guestrin 2016) library from Python.
The Clinical-Longformer model is directly accessed from HuggingFace.
4.4.1 High Performance with Minimal Pre-Processing
The TabText framework can be leveraged to replace heavy data cleaning by simply creating a
text representation for each data sample using the information as it appears in the raw data
106
tables. In particular, columns that require data cleaning to be converted to appropriate data
types can be simply transformed into text. For example, the sentence corresponding to a
numerical column for a sedation score with the value “-4 → deep sedation” can be written as
“sedation score is -4 → deep sedation”, as opposed to parsing the original string into a numeric
value of -4 as part of the traditional pre-processing steps. Therefore, TabText representations
enable us to quickly build baseline machine learning models utilizing the tabular data in its
raw form.
We predict the same nine classification tasks described in Section 4.2.2 using the raw
tables without data cleaning. Only minimal data preprocessing was required, including
constructing the meta information of the tables and categorizing columns for different
language representations, which is estimated to have taken only a couple of hours of manual
work. We then followed our TabText pipeline to train a gradient-boosted tree model for
each classification task. As shown in Table 4.3, the baseline TabText models with minimally
processed data already achieve high out-of-sample AUC performance, where the average AUCs
across 10 random 75%-25% train-validation splits are close or above 0.8 for all prediction
tasks except for Enter ICU 48 hr, which is a notoriously difficult classification task (Na et al.
2023).
Prediction Task TabText Baseline AUC
Discharge 24 hr 0.803
Discharge 48 hr 0.790
Enter ICU 24 hr 0.801
Leave ICU 24 hr 0.839
Enter ICU 48 hr 0.757
Leave ICU 48 hr 0.835
Expire 24 hr 0.943
Expire 48 hr 0.933
Mortality 0.895
Table 4.3: Out-of-sample average AUCs achieved by baseline TabText models with minimally
processed data and across 10 random train-validation splits. All models are highly accurate,
reaching practically implementable benchmarks in hospital systems.
107
4.4.2 Enhanced Performance with Contextual Representation
We next process the data following the same cleaning steps as in Section 4.2.1 and feed into
the TabText Framework from Figure 4.2. We perform experiments on the same data and
classification tasks as in Section 4.4.1 but using the cleaned data this time. The results
obtained using the standard Tabular approach and our TabText framework are shown in
Table 4.4. The average AUCs across 10 random 75%-25% train-validation splits for the Enter
ICU 48 hr and Discharge 48 hr prediction task are improved by an additive increment of
1.2%–1.4%. We also see a substantial but smaller benefit for Mortality risk prediction. For
the remaining tasks, Tabular and TabText achieve similar performance with differences in
average AUC smaller than 0.25%. We also notice in Figure 4.4 that the largest TabText
benefits occur for the classification tasks with the lowest Tabular performance (high variability
and low AUCs), while practically no effect was observed for the tasks with stable Tabular
results (low variability and high AUCs).
4.4.3 Larger Benefits for Harder Predictions
To better understand the regimes in which TabText provides the largest improvements in
AUC performance, we repeat this experiment using smaller and larger training data sets. For
each prediction task, we consider the original training data size as well as smaller data sizes
(ranging from 2000, 3000, 5000, 10000, 25000, and 50000 patient days). We plot in Figure 4.5a
(resp. Figure 4.5b) the average (resp. worst-case) TabText AUC improvement percentage
across 10 random 75%-25% train-validation splits, where the x-axis corresponds to the average
(resp. worst-case) AUC of the standard Tabular approach and the y-axis quantifies the relative
percentage improvement on average (resp. worst-case) AUC achieved with TabText. Each
scatter point represents the result of a prediction task (denoted by legends) on one of the
7 different data subsets. As in Section 4.4.2, we observe larger improvements on the more
difficult prediction tasks with Tabular AUCs below 85%. On easier prediction tasks, where
108
Tabular TabText
0.873
0.850 0.883
0.870
0.848 0.880
0.868
0.845 0.878
0.865
0.843 0.875
0.863
0.840 0.873
0.860
0.838 0.870
0.858
0.835 0.868
0.855
0.833 0.865
Discharge 24 hr Discharge 48 hr Leave ICU 24 hr
0.880
0.875
0.877 0.812
0.872
0.875 0.810
0.870
0.872 0.807
0.867
0.870 0.805
0.865
0.867 0.802
0.862
0.865 0.800
0.860
0.862 0.797
0.857
0.795
Leave ICU 48 hr Enter ICU 24 hr Enter ICU 48 hr
0.983 0.925
0.973
0.980 0.922
0.970
0.978 0.920
0.968
0.975 0.917
0.965
0.973 0.915
0.963
0.970 0.912
0.960
0.968 0.910
0.958
0.965 0.907
0.955
Expire 24 hr Expire 48 hr Mortality
Figure 4.4: Boxplots for the out-of-sample AUCs across 10 random train-validation splits
using Tabular vs. TabText models. We see that the largest TabText benefits occur for
the classification tasks with high variability and low AUCs, while practically no effect was
observed for the tasks with low variability an10d9high AUCs.
Out-of-Sample AUC
Tabular models already achieve AUCs over 90%, the benefit of TabText is near or below
zero. When the Tabular AUCs are less than 78%, Tabtext brings a positive improvement
on all results, including several instances of improvement over 5–6%. This suggests more
potential benefits of augmenting tabular models with TabText representations for tasks with
low Tabular performance.
7 Discharge 24 hr
Discharge 48 hr 8
6 Leave ICU 24 hr
Leave ICU 48 hr
5
Enter ICU 24 hr 6
4 Enter ICU 48 hr
Expire 24 hr
3 Expire 48 hr 4
Mortality
2
2
1
0
0
1
0.65 0.70 0.75 0.80 0.85 0.90 0.95 0.65 0.70 0.75 0.80 0.85 0.90 0.95
Tabular AUC Tabular AUC
(a) Average out-of-sample AUC (b) Worst-case out-of-sample AUC im-
improvement at varying data sizes. provement at varying data sizes.
Figure 4.5: TabText AUC improvement over the standard Tabular approach at varying data
sizes. We observe that the improvement of TabText is most prominent when standard Tabular
models do not perform well. For high-performing tasks, the advantage is less pronounced. An
important implication of this observation is that challenging medical prediction tasks with a
lack of difficult-to-observe risk factors or a small sample size can benefit from our framework.
4.5 Conclusion
We first developed a system of machine learning models predicting short-term discharge,
ICU risk, expiration, as well as end-of-stay mortality. All models achieve state-of-the-art
predictive accuracy. These models are deployed in 7 hospitals and used by over 200 medical
staff at HHC, who experienced first-hand benefits to shorten the length of stay, decrease the
cost of care, enhance patient safety, and improve the overall patient experience. Empirically,
we observe a reduction in average patient length of stay by 0.63 days and project an annual
contribution margin increase of $67 million dollars.
Next, we introduce TabText, a novel framework for processing tabular data by converting
110
TabText AUC Improvement (%)
TabText AUC Improvement (%)
it into a text representation that captures important contextual information such as column
descriptions. Our experiments show that augmenting standard tabular data with our TabText
representations can improve the performance of standard machine learning models across
all healthcare predictions tasks considered, with larger improvements observed for the more
challenging tasks. In addition, we demonstrate the efficiency of TabText in simplifying
data pre-processing and cleaning, offering an alternative and flexible pipeline for generating
high-performing baseline models for hospitals that have small number of patients or non-
standardized, disorganized medical records.
Our experiments reveal the potential of TabText for improving the performance of
standard machine learning models, and there are several research directions for further
improving our framework. For instance, TabText relies on the use of NLP models that
could generate high-quality embeddings for the input text, which motivates the development
of more LLMs pre-trained with domain-specific data. Moreover, augmenting tabular data
with TabText embeddings adds a layer of complexity to the interpretability of the model
output, and developing tools for maintaining the interpretability of the tabular data would
be an interesting direction for future research. TabText is a general framework that can
be particularly useful for difficult classification tasks (for instance, those with limited data
availability), and we hope that this work motivates further research for leveraging language
in general machine learning applications.
111
112
Chapter 5
Multistage Stochastic Optimization via
Kernels
5.1 Introduction
Multistage stochastic optimization arises in numerous applications (e.g., supply chain man-
agement, energy planning, inventory management among others) and remains an important
research area in the optimization community (Birge and Louveaux 2011, Shapiro et al. 2014,
Bertsimas et al. 2011). In these problems, the decision variables are split across multiple
periods and decisions are made sequentially as more information becomes available. The goal
is to make high quality decisions that minimize the expectation of a given cost function by
accurately modeling future uncertainty. In practice, decision makers can use historical data
to get a sense of the future uncertainty. For instance, consider a retailer selling products with
short life cycles who needs to make frequent orders to restock inventory without knowing the
future demands. To minimize costs the retailer must use the remaining inventory quantities
as well as historical data to gain insight into future demands. Another example is energy
planning, in which operators decide daily production levels without knowing how weather
conditions will affect the output of the wind turbines. In this case historical wind patterns
113
are valuable for better planning.
Besides historical data, auxiliary covariates are often available and can help predict
uncertainty. For example, in the fashion industry, color and brand are useful factors to
predict demand of a new item. Accordingly, recent work has focused on using predictive
analytics to leverage available side information and historical data to make better decisions.
Ban et al. (2019) for instance, fit covariate and historical data to a regression model and
prove theoretical guarantees for the dynamic procurement problem. Another approach is
that of Bertsimas et al. (2022c), which considers an uncertainty set around each data sample
and applies robust optimization tools to find linear decision rules that are asymptotically
optimal under mild conditions. This framework was generalized in Bertsimas and McCord
(2019), where machine learning methods are incorporated to find weights that produce more
accurate approximations of the objective. However, these dynamic methods are affected by
the curse of dimensionality; they require scenario tree enumeration and can require many
hours for solving problems with only a few stages.
In this paper, we propose a non-parametric, data-driven and tractable approach to solving
multistage stochastic optimization problems. By restricting the decision variables to be in a
reproducing kernel Hilbert space (RKHS) generated by a universal kernel, we can approximate
a large class of functions using non-parametric functional representations. We incorporate
sparsification techniques based on function subspace projections that allow our proposed
algorithm to overcome the complexity growth that kernel methods introduce when directly
applying the Representer Theorem to large data sets. The input to our algorithm is historical
data and we make no assumptions on the correlation structure of the uncertainties across
stages. We perform computational experiments on real-world multistage stochastic problems,
and show how our method not only produces near optimal solutions but also remains tractable
in higher dimensions and with large data sizes.
114
5.1.1 Related Literature
Kernel methods have been used in recent work to solve stochastic multistage optimization
problems with side information. Hanasusanto and Kuhn (2013), for example, approximate
the objective using kernel regression, and Pflug and Pichler (2016) apply a kernel density
estimator to the historical data to develop a non-parametric predict-then-optimize approach
that comes with asymptotic optimality guarantees under strong conditions. However, these
are local methods in which the predictions are made based only on those data points that are
similar to the current observation. As noted in Bertsimas and Koduri (2022), such approaches
require more data and perform worse on high dimensions compared to global methods, which
instead optimize over functional variables that make the predictions.
The Machine Learning community has long applied kernel methods to solve online learning
problems (Wheeden 2015, Norkin and Keyzer 2009), but they have focused purely on predictive
and not on prescriptive tasks. More recently, Bertsimas and Koduri (2022) has aimed to
extend kernel methods to data-driven, single-period optimization problems with auxiliary
information by using the Representer Theorem to transform the optimization over functions
into an optimization over parameters. They show that this approach overcomes the curse of
dimensionality; however, its main disadvantage is that the number of parameters per decision
grows linearly with the number of observations, resulting in function representations that
are as complex as the size of the data and that become potentially intractable especially in
multistage settings.
Works on stochastic optimization in a RKHS have developed multiple heuristics to reduce
the number of parameters in the function representation. For instance, Zhang et al. (2013)
uses random dropping, Kivinen et al. (2004) introduces forgetting factors, and Honeine
(2011) as well as Engel et al. (2004) apply compressive sensing techniques. These approaches
sucessfully achieve sparser functional representations, but they usually produce suboptimal
approximations (Honeine 2011, Engel et al. 2004).
115
We instead follow the approach from Koppel et al. (2016) of applying Functional Stochastic
Gradient Descent (FSGD) and projecting the iterates onto sparse subspaces that are found
by removing parameters associated with data points that do not contribute much to the value
of the decisions (Pati et al. 1993). This approach maintains optimality while addressing the
complexity growth that kernel methods exhibit as the data size increases. Intuitively, since
stochastic gradient descent iterates are a noisy signal for the optimal solution, by projecting
the iterates to have small model order we can ignore some of the noise while preserving the
goal signal. The sparse subspaces of the RKHS onto which projections are made can be
effectively found using kernel orthogonal matching pursuit (Vincent and Bengio 2002), an
algorithm which given a function f and an error bound ϵ, generates a sparse approximation
of f that is in a neighborhood of f of radius ϵ in Hilbert norm. Koppel et al. (2016) show
that for a specific choice of ϵ and of step-size for the FSGD algorithm, the projected FSGD
iterates produce decisions that converge in mean to the optimal solution.
5.1.2 Contributions
In this paper, we propose a novel data-driven approach for solving multistage stochastic
optimization problems with side information using kernels. Specifically, we represent the
controls as elements of a reproducing kernel Hilbert space and use loss-minimizing machine
learning methods to predict them. In addition, we incorporate sparsification techniques
to reduce the total number of parameters per control. We prove that this approach is
asymptotically optimal, guaranteeing near optimal approximations for problems with large
amounts of data. We also show that our approach remains computationally tractable in high
dimensions and with large data sizes. In detail, our contributions are as follows.
1. We propose a novel data-driven approach for multistage stochastic optimization prob-
lems with side information based on reproducing kernel Hilbert spaces. The approach
takes as input historical data and minimizes the regularized empirical loss by applying
functional stochastic gradient descent to optimize the decision rules, i.e., the functions
116
which specify what decision to make in each stage. To the best of our knowledge,
this is the first tractable application of reproducing kernel Hilbert spaces to multi-
stage optimization problems with large data sizes. While a kernel based formulation
of the multistage stochastic optimization problem is briefly suggested (without any
computational experiments) in Bertsimas and Koduri (2022), their non-stochastic and
non-sparse approach is not tractable for large data sizes since both time and memory
requirements increase cubically with the amount of data.
2. We extend sparsification techniques used by Koppel et al. (2016) to multistage opti-
mization settings in order to reduce both, space and time complexities of our algorithm.
Specifically, we use Functional Stochastic Gradient Descent (FSGD) to minimize the
objective and project each iterate onto a sparse subspace that is found by removing
parameters corresponding to data points with small contributions. We show that
applying FSGD without any sparsification results in methods that do not scale to
larger number of periods or data sizes. If sparsity is not added, the computational
cost and the storage requirement increase quadratically with the data size. With the
proposed method, however, both space and time complexities present linear growth
with a constant factor that depends on the step size of the FSGD algorithm.
3. We prove that if the loss function is convex, Lipschitz and differentiable almost every-
where, then the expected loss achieved with our algorithm converges in probability to
the expected loss of the optimal decision rules in the space of continuous functions.
4. We demonstrate across several instances of inventory management problems that the
proposed method finds near-optimal solutions using only a few parameters and with
very low computational times. We show that increasing the number of periods, the
117
dimension of the data, the dimension of the controls or the data size does not affect the
tractability of our approach.
The paper is organized as follows: Section 5.2 introduces the exact framework for the problem
being solved, Section 5.3 contains the data-driven formulation of the multistage stochastic
optimization problem with side information, Section 5.4 presents the proposed algorithm,
Section 5.5 states the convergence theorems, Section 5.6 analyses the complexity of the
proposed method, and Section 5.7 shows the results for several computational experiments.
5.2 Problem Setting
We consider a discrete-time, convex, multistage stochastic problem over a finite horizon T .
Initially, we observe some auxiliary covariates x ∈ X ⊆ Rq0 . Then, random disturbances wt
that belong to a known set Wt ⊆ Rqt are sequentially observed over time. At every stage
t, after observing the covariates x and the previous disturbances (w1, . . . ,wt−1), a decision
u rtt ∈ R is made. The total cost for the observed sequence of covariates, disturbances and
decisions is c(u1, . . . ,uT ,x,w1, . . . ,wT ).
A standard decision rule ū(·) = (ū1(·), . . . , ūT (·)) consists of functions ūt : W1 ×
. . . × Wt−1 → Rrt that at each stage t take as input the disturbances up to that point
and output a decision for the given stage. Specifically, denoting w := (w1, . . . ,wT ) and
w1:t := (w1, . . . ,wt), we have that the standard decision rule ū(·) applied to w outputs
( )
ū(w) = ū1, ū2(w1:1), . . . , ūT (w1:T−1) . The multistage optimization problem over the space
of continuous decision rules F̂ conditioned on some observed covariates x0 can then be written
as
min Ew|x [c(ū(w),x,w) | x = x0] , (5.1)
ū∈F̂
where c(·) is a convex loss function. As noted in Bertsimas and Koduri (2022), the conditional
problem in Eq. (5.1) can be formulated as an unconditional optimization problem by
augmenting the domain of the decision rules to also take the covariates as input, and then
118
evaluating the observed covariates in the decision rules found. In this paper, we adopt the
( )
same approach and therefore we consider augmented decision rules u(·) = u1(·), . . . ,uT (·)
with augmented domains ut : X ×W1 × . . . ×W → Rrtt−1 . The augmented decision rule
applied to the data point w with covariates x outputs
( )
u(x,w) = u1(x),u2(x,w1:1), . . . ,uT (x,w1:T−1) .
From now on we will join the covariates and the disturbances into a single random variable
z := (x,w) to simplify notation, and we index z starting at time 0 instead of time 1, so
that z0:t := (x,w1, . . . ,wt). Defining F as the space of continuous augmented decision rules,
and Z := X ×W1 × . . .×WT , we obtain that solving Eq. (5.1) is equivalent to solving the
problem
[ ( )]
min Ez c u(z), z (5.2)
u∈F
and evaluating the optimal solution u∗(·) at x = x0 to obtain the standard decision rule
ū∗(w) = u∗(x0,w).
5.3 Reproducing Kernel Hilbert space formulation for
Multistage Optimization
We now propose a data-driven approach for multistage stochastic optimization problems
with side information based on a Reproducing Kernel Hilbert space (RKHS). We include
an overview of these spaces in Appendix D.1. We will assume that we have historical
observations S = {zn}N = {(xn,wn, . . . ,wnn=1 1 T )}Nn=1 that are independently and identically
distributed according to some unknown distribution. Let Kt : X ×W1 × . . .×Wt−1 → R
be a positive universal kernel and Ht the reproducing Kernel Hilbert space generated by Kt.
We consider the Cartesian product Hilbert space, H := Hr11 × . . .×H
rt
T with inner product
119
defined by
〈( ) ( )〉
(u1,1, ... , u1,r ), ... , (uT,1, ... , u1 T,r ) , (v1,1, ... , v1,r ), ... , (vT 1 T,1, ... , vT,r )T
H
∑T ∑rt
:= ⟨ut,i, vt,i⟩H ,t
t=1 i=1
where ⟨u, v⟩H corresponds to the inner-product between u and v with respect to the Hilbertt
space Ht. We can approximate the solution of problem (5.2) by applying its empirical
regularized version and restricting the decision rules to be in H:
∑N1 n λmin c(u(z ), zn) + ∥u∥2H. (5.3)
u∈H N 2
n=1
Even though problem (5.3) is not equivalent to problem (5.2), if λ vanishes with the data
size then the regularized empirical average becomes a closer estimate of the expectation as
N increases. We will then focus on solving problem (5.3), and later in Corollary 5.5.5 we
show that as the data size goes to infinity, the expected loss converges in probability to the
optimal solution of problem (5.2).
One way to solve the regularized empirical problem (5.3) is to use the multidimensional
version of the Representer Theorem (Wahba 1990, Soentpiet et al. 1999, Schölkopf et al. 2002,
Shawe-Taylor et al. 2004), which says that for each t = 1, . . . , T there exists a scalar matrix
At such that the optimal solution to (5.3) satisfies
ut(·) = AtKt(Zt, ·),
where Kt(Z, ·) := [Kt(z1, ·), . . . , K N Tt(z , ·)] , and the time subscript for a data matrix
D = [d1, . . . ,dN ] refers to Dt = [d1 N0:t−1, . . . ,d0:t−1]. However, with this approach each
decision ut,i has as many scalar parameters as data points, which generates both memory
and performance problems as the number of data points becomes large. We instead want an
algorithm for which more data yields better results overall, without increasing its complexity
120
or worsening performance. General sparisification techniques like those found in Kivinen
et al. (2004), Zhang et al. (2013) or Engel et al. (2004), successfully reduce the number of
parameters; however they do so at the cost of compromising optimality. We therefore take
the pruning approach developed in Koppel et al. (2016) to solve problem (5.3); we apply
functional gradient descent to minimize the objective and at each iteration we drop those
parameters that add near zero contribution to the value of the decisions, ensuring convergence
to an optimal solution.
5.4 Sparse Multistage Optimization with Kernels
In this section, we extend sparsification techniques used by Koppel et al. (2016) to the
multistage optimization setting described in the previous section in order to reduce both,
space and time complexities of our algorithm. Specifically, we describe an iterative algorithm
for solving (5.3) using Functional Stochastic Gradient Descent and sparse projections. In
order to ease notation, we first make the following definitions for an augmented decision rule
u:
E(u) := Ez [c(u(z), z)] , (5.4)
Eλ
λ
(u) := E(u) + ∥u∥2H, (5.5)2
∑N ( )
Eλ
1 λ
S(u) := c u(z
n), zn + ∥u∥2
N 2 H
, (5.6)
n=1
( )
λ λEn(u) := c u(z
n), zn + ∥u∥2H. (5.7)2
The algorithm relies on the fact that the expectation of Eλn(u) over data yields Eλ(u) to
make stochastic gradient updates that converge to the optimal solution, while at the same
time removing unnecessary parameters along the descent trajectory.
121
5.4.1 Functional Stochastic Gradient Descent (FSGD)
Thanks to the fact that a RKHS preserves distance and to the continuity properties of real
spaces, a derivative with respect to an element f of a RKHS (a function) can be well defined
and it satisfies the standard properties of derivatives of real functions. Following Kivinen et al.
(2004), we can then derive a generalization of the Stochastic Gradient Descent algorithm for
elements of H. This method is referenced as functional stochastic gradient descent.
We compute the gradient of Eλn(u) with respect to the functions u using the identity
ut,i(z0:t−1) = ⟨K(z0:t−1, ·), ut,i⟩H, which is known as the reproducing property of kernels.
Differentiating on both sides of this equation we obtain
〈 〉
∂ut,i(z0:t−1) ∂ ut,i, Kt(z0:t−1, ·)
= = Kt(z0:t−1, ·), ∀ i ∈ [rt], t ∈ [T ], (5.8)
∂ut,i ∂ut,i
where [K] = {1, . . . , K}. The stochastic functional gradient can then be computed using the
chain rule:
( )
∇ n n n n nu c u(z ), z = ∇ c(u(z ), z )K (z , ·), (5.9)t ut(z0:t−1) t 0:t−1
=⇒ ∇u E
λ
n(u) = ∇ c(u(z
n), zn)Kt(z
n
0:t−1, ·) + λut, (5.10)t ut(z0:t−1)
( ) ( )
where ∇ c u(zn), zn corresponds to the derivative of c u(z), z with respect to its
ut(z0:t−1)
scalar arguments u1t (z0:t−1), . . . , u
rt
t (z0:t−1) evaluated at zn:
[ ]
∂c(u(zn), zn) ∂c(u(zn), zn)
∇ c(u(zn), zn) = , . . . , .
ut(z0:t−1) ∂u1t (z0:t−1) ∂u
rt
t (z0:t−1)
Thus, the update rule for the standard functional stochastic gradient descent (FSGD)
122
algorithm becomes
un+1t =u
n
t − ηn∇
λ
u En(u
n)
t
= (1−η nnλ)ut − ηn∇ c(u(z
n), zn)K n
u (z ) t(z0:t−1, ·), (5.11)t 0:t−1
where ηn is the step-size of the algorithm and the sequence of controllers is initialized at some
fixed function u0 ∈ H.
Using the update rule in Eq. (5.11), we can easily show by induction on n that if the
initial decision is of the form u0t (·) = A0tKt(D0t , ·) for some initial data matrix D0 and
initial parameters A0t , then the solutions un produced at every iteration also have this form.
Specifically, for each n > 0 and for all t ∈ [T ], there exist a scalar matrix Ant and a data
matrix Dn such that unt (·) = An nt · Kt(Dt , ·). In fact, this parametrization allows us to
rewrite the functional update rule in Eq. (5.11) as a nonfunctional (scalar) update on the
data matrix Dn and the parameters An1 , . . . ,AnT as follows:
[ ( )]
Dn+1 = [Dn, zn], An+1 = (1− η λ)An, η ∇ c un(zn nn n u(z) ), z ,
where    
( )
An1 ∇u (z )c u(z), z   1 0 
 .  ( )  
An :=  ..  , and
.
∇u(z)c u(z), z :=    ..  .
   
( )
AnT ∇u (z c u(z), zT 0:T−1)
Notice that this update forces the data matrix to have one more column after every iteration,
which brings us back to the same problem we had when applying the Representer Theorem.
However, because this is an iterative algorithm, we will reduce the dimension of the data
matrix Dn after every iteration by measuring the contribution of each individual observation
zn and removing those observations that added almost no value to the decision.
123
5.4.2 Proximal Projection
We now describe how to reduce the number of observations in the data matrix Dn with the
goal of reducing the dimension of the parameters An. We observed that the Representer
Theorem as well as the FSGD algorithm generate decisions ut,i that belong to the subspace
of Ht spanned by the functions Kt(z1 N0:t−1, ·), . . . , Kt(z0:t−1, ·). What we want is to produce
decisions that belong to a smaller subspace, one generated using fewer observations.
Suppose that D̃n+1, and Ãn+1 are the values resulting from the FSGD iterative rule in
Eq. (5.11), i.e,
[ ]
D̃n+1 = [Dn, zn] and Ãn+1 = (1− ηnλ)An, ηn∇u(z)c(un(zn), zn) ,
which represent the decisions ũn+1(·) = Ãn+1t t Kt(D̃
n+1
t , ·), and assume that we want to
generate a decision that only uses observations from a smaller data matrix Dn+1. We
can approximate ũn+1 with a decision un+1 that only depends on observations in Dn+1
by projecting each decision ũn+1t,i onto the subspace of Ht that is spanned by the functions
K n+1t(Dt , ·). If we denote this projection by ΠDn+1(·) then we can define
( )
un+1 := Π (ũn+1 n n n nDn+1 ) = ΠDn+1 (1− ηnλ)u − ηn∇uc(u (z ), z ) . (5.12)
The projection operator can be computed by solving the least squares problem
∑T ∥ ∥2
∥ ∥
An+1 = argmin n+1 n+1∥Ãt Kt(D̃t , ·)− Â
n+1
t Kt(D
n+1
t , ·)∥ , (5.13)rt
Ân+1
H
t=1 t
which has a closed form solution given by
( )
An+1 = K [Dn+1,Dn+1
−1
t t t t ] K [D
n+1, D̃n+1t t t ]Ã
n+1
t , for all t ∈ [T ]. (5.14)
We then have a simple way to project the FSGD solution onto the Hilbert subspace generated
124
by a smaller data matrix Dn+1, but we are still left the question: how do we find the right
data matrix Dn+1? As in Koppel et al. (2016), we use a method called destructive kernel
orthogonal matching pursuit (KOMP) with pre-fitting, which was developed in Vincent and
Bengio (2002). The KOMP algorithm takes as input a function ũ ∈ H (represented by its
data matrix D̃ as well as the corresponding parameters Ã), and a maximum error bound ϵ.
For each element d in the data matrix D̃, the algorithm computes the approximation function
u = ΠD̃\{d}(ũ) obtained by removing observation d from D̃. Next, the algorithm removes the
observation that produced the lowest error, updates the current function accordingly and then
repeats this procedure to remove the next element. The algorithm stops removing elements
when the difference between the current function and the best approximation function is
larger than ϵ. The exact algorithm can be found in Algorithm 1.
Algorithm 1: Kernel Orthogonal Matching Pursuit (KOMP)
Input: Function ũ represented by data matrix D̃ with M̃ columns, parameters Ã,
and ϵ > 0.
Initialize D = D̃, M = M̃ , A = Ã, and u = ũ ;
while D is non-empty do
for j = 1, . . . , M̃ do
Find minimal approximation error with data matrix element dj removed:
∥ ∥
2 2γ = ∥ ∥j u− ΠD\{dj}(u) H
∑T ∥ ∥
j 2
= min ∥At ·Kt(Dt, ·)− Ât ·Kt(Dt\{dt}, ·)
∥ .
Ht
Â
t=1
end
Find the index with minimum approximation error: j∗ = argmin γj
if γj∗ > ϵ then
stop;
else
Prune data matrix: ∗D = D\{dj };
Update M =M − 1;
Update the parameters:
∑ ∥ ∥T j 2
A = argmin ∥Â t=1 At ·Kt(Dt, ·)− Ât ·Kt(D \{d }, ·)
∥
t t .Ht
end
end
Output: D, A, u.
125
5.4.3 The Algorithm
By combining Functional Stochastic Gradient Descent with the Kernel Orthogonal Matching
Pursuit we are able to develop an algorithm that approximates the minimizer of Eλ(u) with
decision rules that are represented using only a few parameters. The algorithm is initialized
with a decision rule u0 0 0t = A Kt(D , ·), which in practice is usually set to 0. Then, in each
iteration, it performs one FSGD step and then applies the KOMP algorithm in order to obtain
an approximated decision with fewer observations. Notice that if we define the projected
gradient ∇̃ by
un − Π n λ nλ n Dn+1 [u − ηn∇uE∇̃ E (u ) := n
(u )]
u n , (5.15)ηn
then we can write the iterative updates of this procedure in the same form as the standard
iterative updates of FSGD:
un+1 = un − η ∇̃ Eλ(unn u n ). (5.16)
Since stochasticity does not guarantee a strict objective descent, the algorithm keeps track of
the best decision rules observed and at the end it outputs the decision u∗S with the lowest
empirical error EλS with respect to the data set S. The exact formulation can be found in
Algorithm 2.
5.5 Convergence Analysis
In this section, we show that for a specific choice of step-size the objective value of the
decision output by the algorithm converges to the objective value of the true minimizer. We
first present the three main assumptions that we make on the problem settings in order to
126
Algorithm 2: Sparse Multistage Optimization via Kernels [SMOK]
Input: Data points S = {zn}n=1,...,N , error bounds ϵn, learning rate ηn, and initial
decision u0 represented with data matrix D0 and parameters A0.
for n = 1, . . . , N do
Take FSGD step using the nth sample zn to obtain
[ ]
D̃n+1 = [Dn, zn] and Ãn+1= (1− η λ)An, η ∇ c(un(zn), znn n u(z) ) .
Reduce the data matrix and number of parameters using
Dn+1,An+1,un+1 = KOMP(D̃n+1, Ãn+1, ϵn)
end
Output: u∗S = argmin
λ
u∈{u1,...,uN} ES(u).
guarantee convergence of the algorithm:
Assumption 5.5.1. The data space Z is compact, the kernels Kt are universal, and there
exists a constant κ such that
Kt(z1:t−1, z1:t−1) ≤ κ, ∀ z ∈ Z, ∀ t ∈ [T ].
Assumption 5.5.2. There exists a constant C such that for all z ∈ Z the loss function
satisfies
∣ ∣
∣c(u, z)− c(u′, z)∣ ≤ C∥u− u′∥ , ∀ u,u′ ∈ Rr1+...+rT2 .
Assumption 5.5.3. The loss function c(u(z), z) is convex and differentiable with respect to
the scalar arguments u(z) for all z ∈ Z.
Assumption 5.5.1 naturally holds for most data domains, and this is a necessary assumption
to ensure that the Hilbert norm of the optimizer of Eλ is bounded. Assumption 5.5.2 holds
whenever the cost function c as well as the constraint functions gq are Lipschitz. This
assumption implies that the gradient of c with respect to the scalars u(z) is bounded as
∥∇u(z)c(u(z), z)∥2 ≤ C, (5.17)
127
[ ]
which in turn allows us to upper bound the expected norm of the gradient E ∥∇uEλ 2n(u)∥H .t
Assumption 5.5.3 is a standard condition for convergence of descent methods, and it can be
relaxed to the case in which the loss function is almost everywhere differentiable by applying
subgradients instead of gradients.
Theorem 5.5.4. Let u∗ := argmin EλS u∈{u1,...,uN} S(u) be the decisions generated by Algorithm
2 when given the set S = {zn}Nn=1 as input, and let u
λ be the true minimizer of Eλ(u) over
H. If we use constant step-size η and constant error bounds ϵ = P2η
2 for some constant
P2 > 0, then under Assumptions 1-3, we have that
[ ] (η)
E Eλ(u∗ )− EλS (uλ) ≤ O .λ
Proof. See Appendix D.3.
Corollary 5.5.5. Let u∗ be the true minimizer of E(·) over F . If we use constant step-size
with η = √P1 < 1 , and P1 > 0, constant error bounds ϵ = P
2
2η for some constant P2 > 0, andN λ
√
regularization parameter λ such that λ −−−→ 0 and λ N −−−→ ∞, then under Assumptions
N→∞ N→∞
1-3 we have that
lim E[|E(u∗ ∗S)− E(u )|] = 0. (5.18)
N→∞
Proof. See Appendix D.3.
Since L1 convergence implies convergence in probability, the corollary also implies that
the expected loss achieved with Algorithm 2 converges in probability to the optimal solution.
In addition, from Theorem 5.5.4 we observe that setting η = √P1 makes the objective
N
value of the solution found by Algorithm 2 converge to the optimal solution of problem (5.3)
( )
with a rate of convergence of O √1 . Convergence can also be achieved under diminishing
λ N
( )
step size, although with a slower rate of O 1 . In practice, a diminishing step size or a
λ logN
very small constant step size might make our data matrix Dn grow arbitrarily large, since
128
little or no pruning would be done at each iteration. A constant step size is then what allows
us to control the trade-off between accuracy and memory required; we want to use a step size
η that is small enough to make the error in Theorem 5.5.4 small, but large enough for the
pruning to be done.
5.6 Complexity Analysis
Let Mn be the size of the data matrix Dn during the nth iteration of Algorithm 2. We analyze
both space and time complexities per iteration in terms of Mn.
Space: At each iteration we need to store the kernel matrix K [Dnt t ,Dn Mn×Mnt ] ∈ R and its
∑
inverse as well as the parameters Ant ∈ Rrt×Mn for each t. This results in
T
O(TM2n+Mn t=1 rt)
memory requirement.
∑
Time: For the FSGD step, computing the gradient takes TO(Mn t=1 rt) time. Computing
from scratch the kernel matrices Kt[Dnt ,Dnt ] ∈ RMn×Mn and its inverses (needed for the
∑
pruning step) takes O(M2 T 3n t=0 qt) and O(TMn) time respectively. However, by using a
recursive rule to compute these matrices in terms of the corresponding values in the previous
∑
iteration, the times become TO(Mn t=0 qt) and O(M
2
n) respectively. In addition, the matrix
multiplication in Eq. (5.14) takes O(M2n) time for each t. Since at most Mn elements can be
removed from the dictionary at the nth iteration, we obtain that in the worst case scenario
∑
the time per iteration becomes O(TM3 2 Tn +Mn t=0 qt).
Let us now discuss the size of Mn. In the worst-case, we know that for all iterations the
size of the data matrix is upper bounded by the covering number M of the data domain (Zhou
2002). More specifically, for fixed step size η and fixed error bound ϵ = P2η2, we have that if
the data space Z is compact (Assumption 5.5.1), then Mn is upper bounded by the minimum
number of balls of radius P2η needed to cover the compact set K1(Z0, ·)× . . .×KT (Z0:T−1, ·)C
129
of kernel transformations (see for example the proof of Theorem 3 in Koppel et al. (2016)).
While an exact expression for this cover number M is unknown, the number is finite (Anthony
and Bartlett 2009) and it decreases as η or P2 increases. In particular, the maximum number
of samples in the data matrix depends on the step size η and the constant P2, but not on the
data size N .
Denoting the cover number described above by M and considering fixed values of T and
of the dimensions r1, . . . , rT and q0, . . . , qT , we obtain that the worst case total time across
the N iterations of Algorithm 2 can be upper bounded by O(NM3) and worst case total
space required is O(NM2). While the worst case scenario cannot happen for all iterations
(for example, if M elements are pruned in one iteration, the next iteration is very fast), this
bound is enough to conclude that total time and total space are in the worst case linear in the
number of iterations. Notice that if we removed the pruning step, the entire algorithm would
require Ω(N2) space to store the kernel matrix and Ω(N2) time for computations, showing
that Algorithm 2 indeed reduces the overall complexity as the number of iterations becomes
much larger than M .
5.7 Computational Experiments
We perform computational experiments for the inventory control and the shipment planning
problems to analyze the average out-of-sample performance as well as the tractability of the
proposed algorithms. For both applications we compare the SMOK algorithm proposed in
Algorithm 2, to the MOK algorithm (Multistage Optimization with Kernels), which is the
result of applying the FSGD algorithm without the pruning step. Moreover, we compare the
SMOK and MOK algorithms against three other benchmarks:
1. SRO: Sample robust optimization approach from Bertsimas et al. (2022c), in which all
samples are assigned equal weight 1 . We use uncertainty sets bounded by ϵ in the ℓ
N 1
norm as well as multi-policy approximation with linear decision rules.
130
2. SRO-knn: Sample robust optimization with covariates approach developed in Bertsimas
and McCord (2019), using uncertainty sets bounded by ϵ in the ℓ1 norm as well as
multi-policy approximation with linear decision rules. The weights were obtained using
the kN -nearest neighbors approach.
3. SAA-knn: Sample average approximation method, which is equivalent to the SRO-knn
approach with (ϵ = 0).
We analyze the computational results for several instances of the inventory control problem.
First, we consider a high dimensional instance of the problem to show the tractability of the
SMOK algorithm as well as to compare its performance against other methods. Next, we
analyze how the performance of the proposed algorithms varies with the dimensions of the
problem (number of periods, data size, dimension of the data as well as dimension of the
controllers). For instances in which the number of periods is less than 5 we are also able to
compute lower bounds for the loss achieved by the optimal decision rules, which enables us
to quantify the optimality gap of the proposed methods.
For the shipment planning application we reproduce the results from Bertsimas and
McCord (2019) to compare the SMOK and MOK algorithms against sample robust opti-
mization (with and without covariates) and sample average approximation. For training all
these benchmarks we use the same parameter values reported in Bertsimas and McCord (2019).
Handling Constraints: Often the sequence of decisions u(z) must satisfy certain convex
constraints for all possible disturbances, transforming the problem of interest into
[ ( )]
min Ez c u(z), z
u∈F (5.19)
( )
s.t. gq u(z) ≤ 0, ∀ z ∈ Z, ∀ q ∈ [Q].
We address this problem by relaxing the constraints into the objective with a penalty function.
More specifically, in Algorithm 2 we replace the cost c(u(z), z) with a new loss function cψ
131
defined as
Q
( ) ( ) ∑ ( ( ))2
cψ u(z), z := c u(z), z + ψ max 0, gq u(z) , (5.20)
q=1
where ψ is the penalty parameter. Although feasibility is not guaranteed, the constraint
violation is expected to vanish for large enough ψ (see Lemma D.2.7). Convergence analy-
sis for the SMOK algorithm applied to this constrained problem can be found in Appendix D.3.
Parameter Settings: We train the SMOK and MOK algorithms using Gaussian kernels
and constant step size. The values for λ, ψ and θ were found using validation, and the
decisions were projected onto the space of feasible decisions before making any evaluations,
both at training and testing stages (this means that the decisions evaluated had 0 constraint
violation). For each instance of the problem the constant step size η was initially set to 10−5
and it was repeatedly increased by factors of 5 so long as the average training loss did not
worsen and the iterations were reaching convergence. The parameter P2 for the error bound
ϵ was initially set to 0.1 and was repeatedly increased by factors of 2; we stopped increasing
it when the average training loss significantly worsened.
Software Utilized: Experiments were implemented in Python 3 (Van Rossum and Drake
2009b) using the NumPy library (Harris et al. 2020). We clarify that Eq. (5.14) can often be
difficult to compute due to numerical instability in the calculations for the inverse matrix. To
address this issue we add a small value λ = 1e−7 to the diagonal of a matrix before computing
its inverse. In terms of hardware, all experiments where run on an Intel(R) Core(TM)
i7-8557U CPU @ 1.70GHz processor with 4 physical cores (hyper-threading enabled). The
machine has a 32KB L1 cache and 256KB L2 cache per core, and an 8MB L3 cache. There is
a total of 16GB DRAM.
132
5.7.1 Inventory Control Problem
We consider a multistage inventory control problem with linear constraints. At each stage t
with initial inventory st, a retailer places procurement orders u rt ∈ R at various suppliers,
and later observes the demands w qt ∈ R . At the end of each stage, the firm incurs a per-unit
holding cost of ht and a back-order cost of bt. The inventory is not backlogged, and therefore
the initial inventory for the next period is given by the linear equation s ⊤ ⊤t = st−1+1 ut−1 wt,
with zero initial inventory for the first period. In addition, the procurement orders are upper
bounded by a constant L and the sum of procurement orders for two consecutive stages
cannot exceed a constant ℓ. As in Ban et al. (2018), we consider the scenario in which
retailers can observe auxiliary covariates x that relate to the future demands (e.g. in the
fashion industry color and brand are useful factors for predicting demand of the products).
For a problem with T periods, we can formulate this optimization problem as
[ ∣ ]
∑T ∣
E + +min w|x ht [st] + bt [−st] ∣ x = x∣ 0u1:T
t=1
s.t. st = s ⊤t−1 + 1 ut − 1⊤wt, ∀t ∈ [T ],
ut ≥ 0, ∀t ∈ [T ],
ut ≤ L1, ∀t ∈ [T ],
ut + ut+1 ≤ ℓ1, ∀t ∈ [T − 1].
The parameters ht, bt were chosen to be 2 and 1, respectively. The data sets used in these
experiments were generated by sampling x from a Truncated Gaussian Distribution with
mean 2 and standard deviation 0.5, and with truncating bounds 0 and 6. The demands wt
were then obtained as a linear function of the covariates with some added noise; specifically,
wt = αtx+ ϵt, where ϵt was sampled from a standard distribution and the constants αt were
selected to be close to 50.
We first consider a large instance of the problem with T = q = r = 10, and we set the
133
control bounds as L = 150 and ℓ = 200. We use a training set with 2000 sample paths
and we approximate the expected loss achieved by each method by averaging the losses
across a common testing set with 104 sample paths. Since the SRO and SRO-knn methods
become intractable for problems of this magnitude, in this experiment we only compare the
SMOK and MOK methods to SAA-knn. We use validation to choose the best parameters
for all methods and we evaluate the results on the testing set. In table 5.1 we observe that
both SMOK and MOK outperform SAA-knn in terms of average out-of-sample loss and
computational time. Moreover, the number of parameters needed for the SMOK algorithm is
smaller by two orders of magnitude compared to the other methods. Even though we observe
an increase in computation time for SMOK with respect to MOK (due to the overhead
computation time for the pruning step), we also see that adding sparsity helped SMOK
achieve a better average loss.
Avg OOS Loss Total Time (hours) No. of Params
SMOK 491.30 0.3 1.5× 103
MOK 493.74 0.1 2× 105
SAA-knn 496.04 14.36 2.2× 105
Table 5.1: Average out-of-sample (OOS) loss and total computation time for inventory
problem with T = q = r = 10.
We next consider other instances of the inventory problem to analyze how the dimensions
of the problem affect the overall performance of the SMOK and MOK algorithms. We
compared these two methods to a third algorithm ADR (Affine Decision Rules), which refers
to the common approximation technique of restricting the space of decision rules to be affine
functions. We train all methods using the same training sets and the same validation sets
(with size equal to 30% of the training size), and we approximate the expected loss achieved
by averaging across a common testing set of 105 sample paths. In addition, we compute lower
bounds for the optimal expected loss when T ≤ 5 (see Appendix D.4), which allows us to
134
analyze the optimality gap for the different methods.
Multiple data sets were generated to analyze the performance of the algorithms as we
increase the number of periods, the training size, the dimension of the data and the dimension
of the controls. In each case we analyze the average out-of-sample loss and the size M
of the data matrix, which refers to number of parameters per control. We also analyze
the computational time for each iteration of Stochastic Gradient Descent (projected or not
projected), and the evaluation time (time it takes to evaluate the empirical loss function
EλS(u) given the parameters for the functional representation of u). Notice that since the
stochastic gradient descent algorithm does not strictly descend, the empirical loss of the
validation set needs to be evaluated every certain number of iterations, which makes the
evaluation time part of the total training time.
Varying the Number of Periods: (L = 150, ℓ = 200, q = r = 1, N = 2000, T =
2, 3, 4, 5)
In Figure 5.1a, we observe that the convergence trajectory is not significantly affected by the
pruning step, and the number of iterations needed until convergence does not change much
for T ≥ 3. In addition, we see in Figure 5.1b that ADR results in very poor performance,
while both the SMOK and MOK algorithms are quite close to the lower bounds found for
the optimal expected loss. In Figure 5.1c we observe that the time per iteration of stochastic
gradient descent grows linearly for both SMOK and MOK, but MOK takes longer times due
to the overhead introduced by the pruning step. The evaluation time (Figure 5.1d) also grows
linearly for both algorithms, although unlike the time per iteration, the slope is larger for
MOK than for SMOK because the number of parameters is significantly smaller for this last
method (SMOK algorithm reduced the size M of the data matrix from 2000 to values below
15).
135
(a) (b)
(c) (d)
Figure 5.1: Expected loss and computational time for varying number of periods.
Varying the Data Size: (L = 150, ℓ = 200, q = r = 1, T = 3,
N = 10, 100, 1000, 4000, 7000, 10000)
Figure 5.2b shows that, as anticipated, the expected loss achieved by both MOK and SMOK
algorithms decreases as the size of the training set becomes larger. The number of iterations
required to reach convergence (Figure 5.2a) does not change much with the data size and
the expected loss achieved remains relatively constant after a large enough training size,
which occurs around N = 1000. In Figures 5.2c,5.2d we can observe a significant memory
improvement of SMOK over MOK when N becomes very large. For N = 104, for example,
SMOK outputs decision rules with only 11 parameters, while SMOK requires 104 parameters
per control. The evaluation time in Figure 5.2d grows quadratically with the number of
parameters in each control (the quadratic factor comes from computing the kernel matrix
Kt[Dt,Dt]), which in the case of MOK corresponds to the size of the training set. Since the
SMOK algorithm has much fewer parameters, it takes under half a second to evaluate the
average loss of 1000 samples regardless of the training data size. Notice that the time per
136
iteration (Figure 5.2c) is higher for SMOK than for MOK when N is small due to the pruning
step. However, we observe that the time per iteration increases linearly for MOK while it
stabilizes for SMOK, implying that for bigger values of N the SMOK method actually takes
less time per iteration and per evaluation.
(a) (b)
(c) (d)
Figure 5.2: Expected loss and computational time for varying data sizes.
Varying Data Dimension: (L = 150, ℓ = 200, r = 1, T = 3, N = 2000, q =
1, 10, 20, 30, 40, 50)
∑
When generating data sets for this part we enforce that the value q (wt)q remains constant
for all t ∈ [T ], which guarantees that the optimal expected loss is the same across instances. In
Figure 5.3a, we observe that the trajectories of the expected loss across the FSGD iterations
are quite similar for all the different dimensions of the data. More importantly, the error
gap does not worsen as the dimension of the data increases (Figure 5.3b), showing that
the accuracy of our algorithms does not worsen for data sets in large dimensional spaces.
Additionally, in Figure 5.3d we observe that there is a slight linear increase in the evaluation
time for both SMOK and MOK algorithms, which is expected since the dimension of the
137
demand vector affects the computation of the exponent in the Gaussian kernel. In terms of
the iteration time (Figure 5.3c), we can see that SMOK remains quite stable around 4 seconds
per 1000 iterations, while MOK shows linear increase. As in the previous examples, the
number of parameters of the SMOK algorithm is quite similar across the different experiments
and remains under 15.
(a) (b)
(c) (d)
Figure 5.3: Expected loss and computational time for varying data dimensions.
Varying Control Dimension: (L = 150, ℓ = 200 , q = 1, T = 3, N = 2000, r =
r
1, 3, 5, 10)
In order to make a fair comparison, we set L = 150 and ℓ = 200 , which guarantees that the
r r
optimal expected loss is the same across instances. We observe in Figure 5.4b that the SMOK
and MOK algorithms achieve very similar average out-of-sample loss across the different
dimensions, and there are a couple of scenarios in which the pruning step helped to improve
the expected loss. In addition, the number of iterations required for convergence (Figure
5.4a) does not seem to depend on the dimension of the control. Lastly, in Figure 5.4c we
observe a slight linear increase in iteration time for both SMOK and MOK algorithms, with
138
MOK having an advantage of around 4 seconds per 1000 iterations. In terms of evaluation
time (Figure 5.4d) both algorithms grow linearly. As in the previous examples, the number
of parameters for the SMOK algorithm is very low and varies between 13 and 14 across the
different experiments.
(a) (b)
(c) (d)
Figure 5.4: Expected loss and computational time for varying control dimensions.
5.7.2 Shipment Planning
We next analyze a two-stage shipment planning problem, following the same problem setting
as in Bertsimas et al. (2022b) and Bertsimas and Kallus (2020). In this example, a decision
maker has access to side information x (market trends, advertisements, etc.) and the goal
is to ship items from the production facilities to multiple locations as to satisfy demand at
minimum cost. First, the decision maker chooses an initial inventory quantity u1f ≥ 0 to
be produced in each of the production facilities f ∈ [F ] at a per unit cost of p1. Next, the
demands wℓ ≥ 0 are observed in each location ℓ ∈ [L]. If needed, the decision maker can
produce additional units in each facility to satisfy demand, but at a higher per unit cost
p2 > p1. Finally, demand is fulfilled by shipping u2fℓ units from facility f to location ℓ at
139
per-unit cost cfℓ, and each unit of satisfied demand generates revenue a > 0. The multistage
optimization problem can then be written as
 [ ] 
F F L + ∣∑ ∑ ∑ ∑ ∑F ∑L ∣
minE w|x p1 u1f− a w + p u ∣ ℓ 2 2fℓ − u1f + cfℓ u2fℓ x = x
u ,u ∣
0
1 2
f=1 ℓ∈[L] f=1 ℓ=1 f=1 ℓ=1
∑F
s.t. u2fℓ ≥ wℓ, ∀ℓ ∈ [L],∀w ∈ W ,
f=1
where W is the set of all possible demand realizations. We reproduced the computational
experiments performed in Bertsimas et al. (2022b) using the same parameters, the same
data generation procedure as well as the same data set sizes. More specifically, we use
F = 4, L = 12, p1 = 5, p2 = 100 and a = 90. The costs c and covariates x are also generated
in an identical manner as in Bertsimas et al. (2022b).
We compare the SMOK and MOK algorithms against SRO, SRO-knn and SAA-knn.
We train all methods over 100 independent training sets and evaluate them on a test set of
size 100. The average out-of-sample profits achieved across the different methods are shown
in Table 5.2. We observe that both MOK and SMOK outperform the other methods, with
MOK achieving the highest revenues. However, as observed in Table 5.3, only the SAA and
SMOK methods have tractable growth as the data size increases. In particular, the SMOK
algorithm achieves high accuracies using only around 60 parameters per decision even when
the data size increases to large numbers.
N SRO SRO SRO-knn SRO-knn SAA-knn MOK SMOK
(ϵ = 100) (ϵ = 500) (ϵ = 100) (ϵ = 500)
100 160007.0 159866.7 157522.9 158671.5 156639.6 161536.9 160737.0
200 160221.1 160075.0 157863.5 159136.9 156911.9 164050.1 163039.2
300 160431.0 160145.6 158697.6 159656.2 157669.6 164860.6 163703.8
Table 5.2: Out-of-sample profit for the shipment planning problem.
140
N SRO SRO SRO-knn SRO-knn SAA-knn MOK SMOK
(ϵ = 100) (ϵ = 500) (ϵ = 100) (ϵ = 500)
100 8 6 30 35 4 38 150
200 11 12 78 75 4 42 260
300 19 21 125 132 4 46 240
500 38 39 276 280 5 50 245
1000 74 76 772 790 10 65 255
5000 559 581 54000 54100 54 288 252
Table 5.3: Total computation time (seconds) for solving one instance of the shipment planning
problem.
5.8 Conclusion
In this work, we developed a tractable data-driven approach for solving multistage stochastic
optimization problems in which the uncertainties are independent of previous decisions.
We represented the decision rules as elements of a reproducing kernel Hilbert space and
performed functional stochastic gradient descent to minimize the empirical regularized loss.
We next incorporated sparsification techniques based on function subspace projections, which
decreased the number of parameters per controller. We prove that the proposed approach is
asymptotically optimal for multistage stochastic programming with side information.
The practical value of the proposed data-driven approach was shown across various
computational experiments on stochastic inventory management problems, demonstrating
that it produces high-quality decisions, does not worsen in multidimensional settings and
remains tractable even with large data sizes. This approach does not rely on the traditional
use of approximation with scenario trees, and provides a novel method for leveraging advances
in machine learning to solve multistage stochastic optimization problems.
141
142
Appendix A
Chapter 2 Appendix
A.1 Proofs of Lemmas
In this Appendix, we prove Lemmas 2.3.3, 2.5.1, 2.5.2 and 2.5.3.
A.1.1 Proof of Lemma 2.3.3
Proof: By Assumption 2.3.1 with c = zLy (θ,x+ δ) we know
( )
min max L(y, zL(θ,x+ δ)) = min max L y,zL(θ,x+ δ)− zLy (θ,x+ δ)e .
θ δ∈U θ δ∈U
Define z̄(δ) := zL(θ,x + δ) − zLy (θ,x + δ)e and z̄′ = (maxδ∈U z̄1(δ), . . . ,maxδ∈U z̄K(δ)).
Notice that the yth coordinates of z̄(δ) and z̄′ are both zero, and therefore for all k ∈ [K] we
have
z̄k(δ)− z̄y(δ) = z̄k(δ) ≤ max z̄
′ ′ ′
k(δ) = z̄k = z̄k − z̄y.
δ∈U
143
Therefore, we can apply Assumption 2.3.2 with z = z̄(δ) and z′ = z̄′ to obtain
( )
min max L y, zL(θ,x+ δ)− zLy (θ,x+ δ)e
θ δ∈U
( ( ))
≤min L y, max zL1 (θ,x+ δ)− z
L
y (θ,x+ δ), . . . ,max z
L L
K(θ,x+ δ)− zy (θ,x+ δ) .
θ δ∈U δ∈U
We then conclude
min max L(y,zL(θ,x+ δ))
θ δ∈U
( ( ))
≤min L y, max zL1 (θ,x+ δ)− z
L
y (θ,x+ δ), . . . ,max z
L
K(θ,x+ δ)− z
L
y (θ,x+ δ) .
θ δ∈U δ∈U
A.1.2 Proof of Lemma 2.5.1
Proof. Since f is convex and closed, we have f = (f ∗)∗ (Rockafellar 1970), and applying the
definition of the convex conjugate function we obtain
f(z(δ)) = (f ⋆)⋆(z(δ)) = sup z(δ)⊤u− f ⋆(u),
u∈dom(f∗)
which implies
sup f(z(δ)) + g(z(δ)) = sup sup z(δ)Tu− f ⋆(u) + g(z(δ))
δ∈U δ∈U u∈dom(f⋆)
= sup sup z(δ)Tu− f ⋆(u) + g(z(δ)),
u∈dom(f⋆) δ∈U
as desired.
144
A.1.3 Proof of Lemma 2.5.2
Let Z = {z(δ) : δ ∈ U}. Defining the indicator function

 0 if z ∈ Z,
γ(z|Z) =
 ∞ otherwise,
and applying the Fenchel duality theorem (Rockafellar 1970), we obtain:
sup g(z(δ)) = sup g(z) = sup g(z)− γ(z|Z) = inf γ⋆(v|Z)− g⋆(v). (A.1)
δ∈U z∈Z z∈dom(g)∩dom(γ) v∈dom(g⋆)
Finally, since γ⋆(v|Z) = supz∈Z z⊤v, we conclude
sup g(z(δ)) = inf sup z⊤v − g⋆(v) = inf sup z(δ)
⊤v − g⋆(v).
δ∈U v∈dom(g⋆) z∈Z v∈dom(g⋆) δ∈U
A.1.4 Proof of Lemma 2.5.3
We first prove part a). By definition, we have
f ⋆(z) = sup z⊤x− p⊤[x]+.
x
Notice that if the ith component of z is negative for any i, then f ⋆(z) = ∞ because x can be
the vector with an arbitrarily large negative value in the ith coordinate and 0 everywhere
else. Similarly, if the ith component of z is larger than the ith coordinate of p for any i, then
again f ⋆(z) = ∞ because x can be the vector with an arbitrarily large positive value in the
ith coordinate and 0 everywhere else. Moreover, if 0 ≤ z ≤ p, then
sup z⊤x− p⊤[x]+ ≤ sup z⊤x− z⊤[x]+ = supz⊤(x− [x]+) ≤ 0.
x x x
145
Since x = 0 achieves an objective value of 0, we conclude that 0 ≤ z ≤ p implies f ⋆(z) = 0
as desired.
Next, we proceed to prove part b). By definition of the concave conjugate we have
g⋆(z) = inf z
⊤x− (x⊤u− q⊤[x]+) = inf(z − u)⊤x+ q⊤[x]+.
x x
If the ith component of z is larger than the ith component of u for any i, then g⋆(z) = ∞
because x can be the vector with an arbitrarily large negative value in the ith coordinate and
0 everywhere else. Similarly, if the ith component of z is smaller than the ith coordinate of
(u− q) for any i, then again g⋆(z) = ∞ because x can be the vector with an arbitrarily large
positive value in the ith coordinate and 0 everywhere else. In addition, if u− q ≤ z ≤ u,
then
inf(z − u)⊤x+ q⊤[x]+ ≥ inf(z − u)⊤x+ (u− z)⊤[x]+ = inf(u− z)⊤([x]+ − x) ≥ 0.
x x x
Since x = 0 achieves an objective value of 0, we conclude that u − q ≤ z ≤ u implies
g⋆(z) = 0 as desired.
146
A.2 Generalized Results
We now state and proof the generalization of Theorem 2.5.4, Corollary 2.5.5 and Theorem
2.5.6 for the case in which the neural network has more than 2 layers.
Theorem A.2.1 (Generalization of Theorem 2.5.4). For all 2 ≤ l ≤ L, it holds
y
sup (∆ek)
⊤zℓ(θ,x+ δ) = (A.2)
δ∈U
∑L−1
y
sup inf . . . sup inf sup (pl − q
⊤ l−1
l) z (θ,x+ δ) + (pℓ+1 − qℓ+1)
⊤bℓ + (∆e ⊤ Lk) b
s tL L s
t
l l δ∈U ℓ=l
s.t. p = [(W L)⊤
y
L (∆ek)]
+ ⊙ sL
L ⊤ yqL = [−(W ) (∆e )]
+
k ⊙ tL
p = (([W ℓ]+ℓ )
⊤p ℓ + ⊤ℓ+1 + ([−W ] ) qℓ+1)⊙ sℓ ∀ ℓ = l, . . . , L− 1
q = (([−W ℓℓ ]
+)⊤pℓ+1 + ([W
ℓ]+)⊤qℓ+1)⊙ tℓ ∀ ℓ = l, . . . , L− 1
0 ≤ sℓ, tℓ ≤ 1 ∀ ℓ = l, . . . , L.
(A.3)
Proof. We will proceed by backward induction on the layer number l.
Case l = L:
The proof is equivalent to the case L = 2 already proved in Section 2.5.
Case l − 1:
Suppose the theorem holds for some fixed l with l > 2. We have
(pl − ql)
⊤zl−1(θ,x+ δ)
=(p − q )⊤(W l−1[zl−2(θ,x+ δ)]+ + bl−1l l )
=f l−2 l−2 ⊤ l−1+(z (θ,x+ δ))− f−(z (θ,x+ δ)) + (pl − ql) b ,
147
where
f (x) = (p⊤[W l−1+ l ]
+ + q⊤l [−W
l−1]+)[x]+, and
f−(x) = (p
⊤
l [−W
l−1]+ + q⊤[W l−1]+l )[x]
+.
By Lemma 2.5.1 we then obtain
sup (p − q )⊤zl−1l l (θ,x+ δ)
δ∈U
=sup f+(z
l−2(θ,x+ δ))− f l−2−(z (θ,x+ δ)) + (p − q )
⊤bl−1l l
δ∈U
= sup supu⊤ zl−2(θ,x+ δ)− f (zl−2(θ,x+ δ)) + (p − q )⊤bl−1l−1 − l l .
u ⋆l−1∈dom(f δ∈U+)
Defining the concave function g(x) = u⊤l−1x− f−(x), and applying Lemma 2.5.2 we obtain
sup (pl − q
⊤ l−1
l) z (θ,x+ δ) (A.4)
δ∈U
= sup inf supv⊤ zl−2(θ,x+ δ) + (p ⊤ l−1l−1 l − ql) b . (A.5)
u ⋆ v ∈dom(g )l− ⋆1∈dom(f ) l−1 δ∈U+
Lastly, by Lemma 2.5.3 we can substitute
ul−1 = ([(W
l−1]+)⊤pl + ([−W
l−1]+)⊤ql)⊙ sl−1
= pl−1, and
v = (([W l−1]+)⊤p + ([−W l−1]+)⊤q )⊙ s − (p [−W l−1 +l−1 l l l−1 l ] + ql[W
l−1]+)⊙ tl−1
= pl−1 − ql−1,
which together with the induction hypothesis imply that Eq. (A.3) is equivalent to
148
y
sup (∆e )⊤k z
ℓ−1(θ,x+ δ) =
δ∈U
∑L−1
y
sup inf . . . sup inf sup (p ⊤ l−2 ⊤ ℓ ⊤ Ll−1 − ql−1) z (θ,x+ δ) + (pℓ+1 − qℓ+1) b + (∆ek) b
s t tL L sl−1 l−1 δ∈U ℓ=l−1
s.t. ypL = [(W L)⊤(∆e +k)] ⊙ sL
y
qL = [−(W
L)⊤(∆ek)]
+ ⊙ tL
pℓ = (([W
ℓ]+)⊤pℓ+1 + ([−W
ℓ]+)⊤qℓ+1)⊙ sℓ ∀ l − 1 ≤ ℓ ≤ L− 1
qℓ = (([−W
ℓ]+)⊤p + ([W ℓ]+ℓ+1 )
⊤qℓ+1)⊙ tℓ ∀ l − 1 ≤ ℓ ≤ L− 1
0 ≤ sℓ, tℓ ≤ 1 ∀ ℓ = l − 1, . . . , L,
and therefore the theorem holds for l − 1 as desired.
Corollary A.2.2 (Generalization of Corollary 2.5.5).
If U = {δ : ∥δ∥p ≤ ρ}, then:
y
sup (∆ek)
⊤zL(θ,x+ δ) (A.6)
δ∈U
∑L−1
y
=sup inf . . . sup inf ρ∥(p2 − q
⊤ 1 ⊤ 1 ⊤
2) W ∥q + (p2 − q2) W x+ (pℓ+1 − qℓ+1) b
ℓ+(∆ek)
⊤bL
s tL L s
t
2 2 ℓ=1
y
s.t. p = [(W L)⊤L (∆ek)]
+ ⊙ sL
q = [−(W L ⊤
y
L ) (∆e
+
k)] ⊙ tL
pℓ = (([W
ℓ]+)⊤p + ([−W ℓ]+)⊤ℓ+1 qℓ+1)⊙ sℓ ∀ ℓ = 2, . . . , L− 1
qℓ = (([−W
ℓ]+)⊤p ℓ + ⊤ℓ+1 + ([W ] ) qℓ+1)⊙ tℓ ∀ ℓ = 2, . . . , L− 1
0 ≤ sℓ, tℓ ≤ 1 ∀ ℓ = 2, . . . , L,
(A.7)
where ∥ · ∥q is the conjugate norm of ∥ · ∥p.
Proof. The proof follows directly after applying Theorem A.2.1 with l = 2 and using again
149
Eq. (2.26).
Definition A.2.2.1. We introduce the following definitions to simplify notation:
s := (s2, . . . , sL)
t := (t2, . . . , tL)
y
p (s, t) := [(W 2)⊤(∆e )]+L k ⊙ sL
y
qL(s, t) := [−(W
2)⊤(∆ek)]
+ ⊙ tL
( )
pℓ(s, t) := ([W
ℓ]+)⊤p (s, t) + ([−W ℓ]+)⊤ℓ+1 qℓ+1(s, t) ⊙ sℓ ∀ 1 ≤ ℓ < L
( )
qℓ(s, t) := ([−W
ℓ]+)⊤p ℓ + ⊤ℓ+1(s, t) + ([W ] ) qℓ+1(s, t) ⊙ tℓ ∀ 1 ≤ ℓ < L
∑L−1
⊤ ′
Rℓ(s, t) := (pℓ′+1(s, t)− qℓ′+1(s, t)) b
ℓ .
ℓ′=ℓ
Theorem A.2.3 (Generalization of Theorem 2.5.6).
y
sup (∆e )⊤zLk (θ,x+ δ)
δ:∥δ∥1≤ρ
{ }
≤ inf max max gL Lk,m(θ,x, t, ρ), gk,m(θ,x, t,−ρ) , (A.8)
0≤t≤1 m∈[M ]
where the new network g is defined by the equations
g1m(W ,x, t, a, r) = r(aW
1 +W 1m x+ b
1)
gℓm(θ,x, t, a, r) = [rW
ℓ]+[gℓ−1m (W ,x, t, a, 1)]
++ [−rW ℓ]+[gℓ−1m (W ,x, t, a,−1)]⊙ tℓ + rb
ℓ
y
gLk,m(θ,x, t, a) = [(∆ek)
⊤W L]+[gL−1m (θ,x, t, a, 1)]
++
y y
[−(∆ek)
⊤W L]+[gL−1m (θ,x, t, a,−1)]⊙ tL + (∆e
⊤ L
k) b ,
150
for all 1 < ℓ < L, 1 ≤ k ≤ K, a ∈ {ρ,−ρ}, and r ∈ {−1, 1}.
The proof of this theorem relies on the following lemma.
Lemma A.2.4. For all 2 ≤ ℓ ≤ L− 1 it holds
sup p (s, t)⊤gℓ−1(θ,x, t, a, 1) + q (s, t)⊤gℓ−1ℓ m ℓ m (θ,x, t, a,−1) (A.9)
0≤sℓ≤1
=p (s, t)⊤gℓℓ+1 m(θ,x, t, a, 1) + qℓ+1(s, t)
⊤gℓm(θ,x, t, a,−1)− (p
⊤ ℓ
ℓ+1(s, t)− qℓ+1(s, t)) b .
(A.10)
Proof. Let 2 ≤ ℓ ≤ L− 1, we have
sup pℓ(s, t)
⊤gℓ−1m (θ,x, t, a, 1) + qℓ(s, t)
⊤gℓ−1m (θ,x, t, a,−1)
0≤sℓ≤1
( )
ℓ + ⊤ ℓ + ⊤ ⊤= sup (([W ] ) p ℓ−1ℓ+1(s, t) + ([−W ] ) qℓ+1(s, t))⊙ sℓ gm (θ,x, t, a, 1)+
0≤sℓ≤1
( )⊤
(([−W ℓ]+)⊤pℓ+1(s, t) + ([W
ℓ]+)⊤q ℓ−1ℓ+1(s, t))⊙ tℓ gm (θ,x, t, a,−1)
( ) ( )
= sup ([W ℓ]+)⊤p ℓ + ⊤
⊤
ℓ+1(s, t) + ([−W ] ) qℓ+1(s, t) g
ℓ−1
m (θ,x, t, a, 1)⊙ sℓ +
0≤sℓ≤1
( )⊤ ( )
([−W ℓ]+)⊤pℓ+1(s, t) + ([W
ℓ]+)⊤q ℓ−1ℓ+1(s, t) gm (θ,x, t, a,−1)⊙ tℓ
( )⊤
= ([W ℓ]+)⊤pℓ+1(s, t) + ([−W
ℓ]+)⊤qℓ+1(s, t) [g
ℓ−1
m (θ,x, t, a, 1)]
++
( )⊤ ( )
([−W ℓ]+)⊤pℓ+1(s, t) + ([W
ℓ]+)⊤qℓ+1(s, t) g
ℓ−1
m (θ,x, t, a,−1)⊙ tℓ
=pℓ+1(s, t)
⊤gℓm(θ,x, t, a, 1) + qℓ+1(s, t)
⊤gℓm(θ,x, t, a,−1)− (pℓ+1(s, t)− qℓ+1(s, t))
⊤bℓ,
as desired.
Proof of theorem A.2.3.
151
By Corollary A.2.2 with p = 1 we know
y
sup (∆e ⊤k) (z
L(θ,x+ δ)− bL) (A.11)
δ:∥δ∥1≤ρ
≤ sup inf, . . . sup inf ρ∥(p2(s, t)− q2(s, t))
⊤W 1∥∞ + (p2(s, t)− q2(s, t))
⊤W 1x+R1(s, t)
s tL L s
t
2 2
(A.12)
≤ inf sup ρ∥(p2(s, t)− q2(s, t))
⊤W 1∥∞ + (p2(s, t)− q2(s, t))
⊤W 1x+R1(s, t), (A.13)
tL...t2 sL...s2
where the last inequality follows from the min-max inequality. Observe that pℓ′(s, t) and
qℓ′(s, t) are independent on sℓ for all ℓ′ > ℓ, which in turn implies that Rℓ′(s, t) does not
depend on sℓ for all ℓ′ > ℓ. We can then solve the optimization problem in Eq. (A.7) for
fixed t as follows:
sup ρ∥(p2(s, t)− q2(s, t))
⊤W 1∥∞ + (p2(s, t)− q2(s, t))
⊤W 1x+R1(s, t)
0≤sL,...,s2≤1
{
= max max sup (p ⊤ 1 12(s, t)− q2(s, t)) (W (x+ ρem) + b ) +R2(s, t),
m∈[M ] 0≤sL,...,s2≤1
}
sup (p2(s, t)− q2(s, t))
⊤(W 1(x− ρem) + b
1) +R2(s, t)
0≤sL,...,s2≤1
{
= max max sup p (s, t)⊤g1 ⊤ 12 m(θ,x, t, ρ, 1) + q2(s, t) gm(θ,x, t, ρ,−1) +R2(s, t),
m∈[M ] 0≤sL,...,s2≤1
}
sup p ⊤ 1 ⊤ 12(s, t) gm(θ,x, t,−ρ, 1)+q2(s, t) gm(θ,x, t,−ρ,−1)+R2(s, t) .
0≤sL,...,s2≤1
By repeatedly applying Lemma A.2.4 for each ℓ = 2, . . . , L− 1 we obtain
152
sup ρ∥(p2(s, t)− q2(s, t))
⊤W 1∥∞ + (p2(s, t)− q2(s, t))
⊤W 1x+R1(s, t)
0≤sL,...,s2≤1
{
= max max sup p (s, t)⊤gL−1(θ,x, t, ρ, 1) + q (s, t)⊤gL−1L m L m (θ,x, t, ρ,−1),
m∈[M ] 0≤sL≤1
}
sup p (s, t)⊤gL−1(θ,x, t,−ρ, 1) + q (s, t)⊤gL−1L m L m (θ,x, t,−ρ,−1)
0≤sL≤1
{
( )
y
= max max sup [(∆e ⊤ L + L−1k) W ] gm (θ,x, t, ρ, 1)⊙sL
m∈[M ] 0≤sL≤1
y
+[−(∆e )⊤W L]+(gL−1k m (θ,x, t, ρ,−1)⊙ tL),
( )
y
sup [(∆e )⊤W L]+k g
L−1
m (θ,x, t,−ρ, 1)⊙ sL
0≤sL≤1
}
y
+[−(∆e )⊤W L]+(gL−1k m (θ,x, t,−ρ,−1)⊙ tL)
{ }
y
= max max gLk,m(θ,x, t, ρ), g
L
k,m(θ,x, t,−ρ) − (∆ek)
⊤bL,
m∈[M ]
as desired.
153
A.3 Convolutional Neural Networks
While in this paper we only consider feed forward neural networks, it is possible to extend
the RUB method to convolutional neural networks that use ReLU and MaxPool activation
functions. In fact, Lemma 2.5.3 can be modified as follows:
Lemma A.3.1. Let x ∈ RA×B. Define MP (x) ∈ RC×D as the MaxPool function whose
(c, d) coordinate corresponds to maxi∈I {xi} for fixed sets of indices Icd, and denote by ⊛ thecd
convolution operation. If u ∈ RA×B, p, q ∈ RC×D have all nonnegative coordinates, then the
functions f(x) = p⊛MP [x]+ and g(x) = x⊛ u− q ⊛MP [x]+ satisfy

 ∑
0 if 0 ≤ i∈I zi ≤ pcd,∀ c ∈ [C], d ∈ [D],cd
a) f ⋆(z) = and

∞ otherwise,

 ∑0 if ucd − qcd ≤ i∈I zi ≤ ucd, ∀ c ∈ [C], d ∈ [D],cd
b) g⋆(z) =
−∞ otherwise.
The lemma above allows to obtain an upper bound for the adversarial loss of convolutional
networks with a very similar proof to that of Theorem A.2.3. However, convolutional networks
are notoriously more memory consuming and therefore computation of the robust upper
bound requires more resources. We have then left this computation for future work, but we
here report results for the other robust training methods using these more complex neural
networks.
We evaluate a convolutional neural network (denoted as CNN ) that has been commonly
used in previous works of adversarial robustness Madry et al. (2019). It has two convolutional
layers alternated with pooling operations, and two dense layers. We compare adversarial
accuracy across four different methods: aRUB-L∞, aRUB-L1, PGD-L∞ and Nominal training.
Results for the CIFAR data set are shown in Table A.1. We observe that the proposed
methods aRUB-L1 and aRUB-L∞ yield the highest adversarial accuracies with respect to
154
PGD-L2 attacks. For the MNIST data set and the FASHION MNIST data set (Table A.2
and Table A.3, respectively), we see that aRUB-L∞ has the highest adversarial accuracies
with respect to PGD-L2 attacks when ρ ≤ 0.1, whereas PGD-L∞ does best for larger values
of ρ.
Table A.1: Adversarial Accuracy for CIFAR with CNN architecture and PGD-L2 attacks.
ρ = 0.000 0.010 0.020 0.030 0.100 1.000 3.000 5.000 10.000
aRUB-L1 71.17 70.98 70.70 70.51 69.88 60.31 44.69 35.35 17.89
aRUB-L∞ 72.03 71.76 71.45 71.25 69.84 56.13 43.98 34.26 16.25
PGD-L∞ 70.78 70.55 70.43 70.39 69.65 59.96 43.59 29.10 16.21
Nominal 71.29 71.13 71.05 70.66 69.38 48.16 16.05 10.04 10.04
Table A.2: Adversarial Accuracy for Fashion MNIST with CNN architecture and PGD-L2
attacks.
ρ = 0.000 0.010 0.020 0.030 0.100 1.000 3.000 5.000 10.000
aRUB-L1 91.25 91.09 90.78 90.55 89.02 82.46 68.95 54.80 33.20
aRUB-L∞ 91.37 91.13 91.05 90.86 90.59 82.81 71.45 60.23 30.86
PGD-L∞ 90.59 90.55 90.43 90.23 89.30 84.45 73.20 67.54 57.77
Nominal 91.02 90.90 90.62 90.43 89.02 77.85 56.72 40.51 10.16
155
Table A.3: Adversarial Accuracy for MNIST with CNN architecture and PGD-L2 attacks.
ρ = 0.000 0.010 0.020 0.030 0.100 1.000 3.000 5.000 10.000
aRUB-L1 99.38 99.38 99.38 99.34 99.30 98.32 91.68 70.51 33.83
aRUB-L∞ 99.38 99.38 99.38 99.38 99.30 98.79 95.70 87.46 47.93
PGD-L∞ 99.22 99.22 99.14 99.14 99.02 98.55 96.37 91.29 55.39
Nominal 99.30 99.26 99.26 99.26 99.18 97.93 89.69 60.00 9.34
156
Appendix B
Chapter 3 Appendix
B.1 Results Tables
We present the evaluation results for natural accuracy, adversarial accuracy, stability, and
sparsity on the test sets across all data sets and methods discussed in the paper. The natural
accuracy results can be found in Table B.1, adversarial accuracy results in Table B.2, stability
results in Table B.3, and sparsity results in Table B.4.
157
Name n p k DL Rob. St. Sp. Rob. St. St. HDL
+Sp. +Sp. +Rob.
echocardiogram 131 7 3 40.00 42.22 42.96 40.00 42.22 42.22 37.78 36.30
hill-valley 605 100 2 47.21 47.54 53.61 47.70 47.87 52.30 51.48 51.64
planning-relax 181 12 2 53.51 52.97 54.05 54.05 54.05 54.05 54.59 54.05
poker-hand 25009 10 10 54.87 55.24 55.09 54.89 54.61 54.41 55.19 54.17
hill-valley-noise 605 100 2 56.23 53.44 59.67 57.87 51.31 53.11 53.61 50.82
yeast 1483 8 10 58.45 59.06 59.12 59.80 60.88 59.80 59.46 60.54
haberman-survival 305 3 2 62.58 61.94 61.61 63.23 61.61 61.94 60.97 62.90
glass-identification 213 9 6 64.65 59.53 65.58 61.40 61.40 63.72 61.40 61.40
brst-cancer-ws-prog 197 32 2 71.00 70.50 72.50 71.50 73.00 69.50 71.50 66.50
hayes-roth 131 4 3 74.81 79.26 77.78 73.33 82.22 77.78 74.81 79.26
spectf-heart 79 44 2 76.25 86.25 73.75 73.75 80.00 75.00 83.75 87.50
hepatitis 154 19 2 78.71 80.00 77.42 78.71 79.35 79.35 77.42 79.35
connectionist-bench 989 10 11 79.19 80.30 83.54 78.08 72.32 74.44 83.23 76.67
libras-movement 359 90 15 79.44 82.50 80.00 75.83 80.56 77.22 79.44 80.00
bld-transf-serv-ctr 747 4 2 79.47 79.07 79.07 79.33 77.20 78.93 79.20 79.60
connect-bench-sonar 207 60 2 83.33 86.67 89.52 86.19 85.71 88.10 83.81 86.67
image-segmentation 209 19 7 83.81 87.14 84.29 83.33 89.52 84.76 87.14 83.81
ecoli 335 7 8 84.71 84.41 85.29 85.59 85.59 85.29 84.71 85.00
qsar-biodegradation 1054 41 2 85.12 85.69 85.21 84.55 84.74 85.21 84.93 84.36
parkinsons 194 21 2 86.67 85.64 87.18 88.72 86.15 86.15 86.67 85.13
magic-gamma-tel 19019 10 2 87.11 87.20 86.98 87.17 86.66 86.50 87.28 86.80
letter-recognition 19999 16 26 90.26 89.57 91.15 82.63 87.60 86.31 90.84 88.79
statlog-proj-landsat 4434 36 6 90.39 90.67 90.64 90.44 89.97 90.53 90.46 90.03
wall-robot-nav-24 5455 24 4 92.36 92.23 92.11 92.49 92.89 92.88 92.47 93.08
spambase 4600 57 2 92.73 93.36 93.42 92.94 92.81 93.03 93.12 92.83
seeds 209 7 3 92.86 92.38 91.43 93.33 93.33 92.86 92.38 93.33
ozone-level-eight 2533 72 2 93.69 93.06 93.89 93.37 93.73 93.49 93.69 93.96
cnae-9 1079 856 9 93.70 93.98 94.07 92.87 93.24 92.59 93.89 92.87
balance-scale 624 4 3 94.24 92.96 93.60 91.68 88.64 93.12 94.24 94.56
ionosphere 350 34 2 94.93 94.93 94.37 95.21 91.83 93.24 95.77 94.08
brst-cancer-ws-orig 698 9 2 96.00 96.29 95.86 96.14 96.86 96.00 96.43 96.29
brst-cancer-ws-diag 568 30 2 97.37 96.49 97.37 96.49 96.49 96.14 97.19 96.67
ozone-level-one 2535 72 2 97.40 97.13 97.20 97.01 97.44 97.44 97.28 97.28
wall-robot-nav-4 5455 4 4 97.77 97.82 97.69 98.04 98.44 98.13 97.80 98.33
climate-simu-crash 539 18 2 97.78 97.59 97.78 96.67 96.85 96.67 97.59 97.04
optical-recog-digits 3822 64 10 97.83 98.01 97.73 97.59 98.09 98.07 98.14 97.93
wall-robot-nav-2 5455 2 4 97.88 98.13 98.02 98.30 98.13 98.30 98.33 98.39
dermatology 365 34 6 98.38 98.38 98.38 98.65 98.38 98.38 98.38 98.92
thyroid-disease-new 214 5 3 98.60 99.07 98.14 98.60 99.07 96.74 99.07 97.67
thyroid-disease-ann 3771 21 3 98.86 98.91 98.78 98.70 98.89 98.86 99.05 98.91
wine 177 13 3 98.89 98.33 98.89 100.00 97.78 97.22 97.78 98.33
pen-recog-digits 7493 16 10 99.23 99.05 99.16 99.11 99.41 99.31 99.19 99.41
skin-segmentation 2450563 2 99.91 99.90 99.90 99.90 99.88 99.89 99.91 99.89
banknote-authent 1371 4 2 99.93 100.00 100.00 99.85 97.16 89.75 100.00 95.71
iris 149 4 3 100.00 99.33 98.00 100.00 92.67 98.67 99.33 99.33
MNIST 70000 784 10 99.19 99.29 99.21 99.13 99.22 99.16 99.31 99.30
Fashion-MNIST 70000 784 10 91.77 91.88 91.93 91.54 92.12 91.43 92.00 92.15
CIFAR10 60000 102410 68.42 69.15 68.25 66.60 57.74 69.82 68.78 69.03
Table B.1: Natural accuracy results for all UCI and vision data sets, where n denotes the
data size, p denotes the data dimension, and k denotes the number of classes. Darker blue
corresponds to higher nominal DL natural accuracy for the UCI data sets.
158
Name n p k DL Rob. St. Sp. Rob. St. St. HDL
+Sp. +Sp. +Rob.
echocardiogram 131 7 3 4.44 44.44 7.41 17.78 44.44 17.78 41.48 44.44
hill-valley 605 100 2 7.54 40.16 28.52 20.00 39.02 20.16 38.52 36.39
planning-relax 181 12 2 21.62 54.05 44.86 49.19 54.05 54.05 54.05 54.05
poker-hand 25009 10 10 50.70 50.70 50.70 50.70 50.70 50.70 50.70 51.95
hill-valley-noise 605 100 2 11.31 21.48 22.30 12.79 30.00 20.98 24.10 34.43
yeast 1483 8 10 0.47 32.26 0.27 0.20 32.26 0.00 32.26 32.26
haberman-survival 305 3 2 17.74 59.68 20.97 47.74 59.68 34.19 59.68 59.68
glass-identification 213 9 6 4.65 20.93 5.58 5.12 20.93 9.30 25.58 25.58
brst-cancer-ws-prog 197 32 2 18.00 72.50 21.00 43.50 72.50 39.50 72.50 72.50
hayes-roth 131 4 3 9.63 32.59 8.89 5.19 36.30 13.33 40.74 40.74
spectf-heart 79 44 2 27.50 25.00 32.50 40.00 25.00 13.75 21.25 25.00
hepatitis 154 19 2 5.81 70.97 9.03 21.94 70.97 29.68 70.97 70.97
connectionist-bench 989 10 11 4.04 1.52 4.34 1.41 6.46 3.84 8.79 8.99
libras-movement 359 90 15 0.00 0.83 0.28 0.00 2.50 0.00 4.72 2.50
bld-transf-serv-ctr 747 4 2 0.00 72.67 0.13 17.60 72.67 43.60 72.67 72.67
connect-bench-sonar 207 60 2 5.71 20.00 18.10 32.38 43.81 33.81 22.86 40.00
image-segmentation 209 19 7 4.29 0.95 2.86 2.38 9.52 3.33 9.52 8.10
ecoli 335 7 8 0.59 41.18 1.76 2.06 51.47 0.88 41.18 51.47
qsar-biodegradation 1054 41 2 36.21 72.04 24.08 31.75 72.04 29.67 72.04 72.04
parkinsons 194 21 2 58.97 74.36 63.08 63.08 74.36 68.72 74.36 74.36
magic-gamma-tel 19019 10 2 15.07 64.59 15.28 18.62 64.59 10.36 64.59 64.59
letter-recognition 19999 16 26 0.76 3.73 1.11 0.52 3.68 0.34 3.57 3.74
statlog-proj-landsat 4434 36 6 8.66 25.48 10.35 7.51 25.48 5.77 20.79 25.39
wall-robot-nav-24 5455 24 4 3.04 39.69 3.28 1.63 39.69 3.06 39.69 39.65
spambase 4600 57 2 40.67 58.41 48.08 45.06 58.41 49.25 58.41 58.41
seeds 209 7 3 23.81 24.29 31.90 27.14 32.38 15.24 31.43 30.95
ozone-level-eight 2533 72 2 49.59 94.48 48.88 94.48 94.48 56.69 94.48 94.48
cnae-9 1079 856 9 0.00 1.02 0.09 0.83 1.94 2.50 3.80 3.80
balance-scale 624 4 3 17.44 40.80 13.28 8.16 43.84 14.72 49.92 49.92
ionosphere 350 34 2 19.44 57.75 25.35 34.65 57.75 10.99 57.75 57.75
brst-cancer-ws-orig 698 9 2 17.71 60.71 23.71 31.29 60.71 15.14 60.71 60.71
brst-cancer-ws-diag 568 30 2 25.44 47.02 25.61 47.02 58.77 13.33 47.02 58.77
ozone-level-one 2535 72 2 73.78 97.44 81.22 97.44 97.44 89.17 97.44 97.44
wall-robot-nav-4 5455 4 4 3.42 39.69 4.85 7.69 39.69 10.49 39.69 39.69
climate-simu-crash 539 18 2 0.00 94.44 2.41 75.56 94.44 83.33 94.44 94.44
optical-recog-digits 3822 64 10 1.36 7.48 1.39 0.76 7.16 0.60 8.94 10.27
wall-robot-nav-2 5455 2 4 5.26 39.69 5.48 17.75 39.69 21.03 39.69 39.69
dermatology 365 34 6 4.59 20.27 2.97 5.68 20.27 10.81 20.27 20.27
thyroid-disease-new 214 5 3 9.77 65.12 10.23 13.02 65.12 13.02 65.12 65.12
thyroid-disease-ann 3771 21 3 48.42 91.79 54.97 54.12 91.79 55.23 91.79 91.79
wine 177 13 3 1.67 44.44 1.11 3.89 37.78 5.56 43.33 31.67
pen-recog-digits 7493 16 10 2.76 10.13 2.63 2.05 10.13 1.80 8.71 10.27
skin-segmentation 2450563 2 79.30 79.30 79.30 79.30 79.30 79.28 79.30 79.30
banknote-authent 1371 4 2 39.20 54.25 42.25 54.25 54.25 43.13 54.25 54.25
iris 149 4 3 12.00 16.00 12.00 10.00 24.67 10.00 32.67 32.67
MNIST 70000 784 10 49.60 78.37 51.52 43.80 74.67 39.94 79.50 75.97
Fashion-MNIST 70000 784 10 78.70 87.74 78.30 80.76 87.07 80.24 87.80 86.91
CIFAR10 60000 102410 28.81 48.61 28.75 34.87 43.98 33.73 47.32 43.96
Table B.2: Adversarial accuracy results for all UCI and vision data sets, where n denotes the
data size, p denotes the data dimension, and k denotes the number of classes. We use ρ = 0.1
for all data sets except CIFAR10 and Fashion-MNIST, for which we set ρ = 0.01. Darker
blue corresponds to higher nominal (DL) natural accuracy.
159
Name n p k DL Rob. St. Sp. Rob. St. St. HDL
+Sp. +Sp. +Rob.
echocardiogram 131 7 3 40.74 37.04 37.04 40.74 33.33 33.33 40.74 33.33
hill-valley 605 100 2 43.44 43.44 49.18 43.44 46.72 50.82 45.08 49.18
planning-relax 181 12 2 51.35 48.65 54.05 54.05 54.05 51.35 54.05 54.05
poker-hand 25009 10 10 54.60 54.32 54.56 54.48 53.20 53.52 54.70 52.32
hill-valley-noise 605 100 2 54.92 47.54 52.46 54.92 47.54 47.54 48.36 50.00
yeast 1483 8 10 57.58 56.90 56.57 57.24 59.60 58.92 56.90 58.92
haberman-survival 305 3 2 59.68 59.68 59.68 59.68 59.68 59.68 59.68 59.68
glass-identification 213 9 6 55.81 58.14 53.49 58.14 55.81 58.14 58.14 58.14
brst-cancer-ws-prog 197 32 2 67.50 67.50 67.50 67.50 62.50 60.00 72.50 67.50
hayes-roth 131 4 3 70.37 74.07 70.37 70.37 70.37 74.07 70.37 74.07
spectf-heart 79 44 2 68.75 68.75 75.00 68.75 68.75 62.50 75.00 75.00
hepatitis 154 19 2 70.97 70.97 70.97 70.97 74.19 70.97 70.97 77.42
connectionist-bench 989 10 11 75.25 77.78 80.81 71.72 63.13 60.61 76.26 70.71
libras-movement 359 90 15 75.00 75.00 79.17 68.06 75.00 70.83 81.94 75.00
bld-transf-serv-ctr 747 4 2 79.33 77.33 78.00 78.67 74.67 74.00 78.00 79.33
connect-bench-sonar 207 60 2 78.57 80.95 83.33 80.95 83.33 83.33 80.95 83.33
image-segmentation 209 19 7 78.57 78.57 80.95 76.19 73.81 83.33 78.57 78.57
ecoli 335 7 8 80.88 77.94 83.82 82.35 83.82 83.82 83.82 82.35
qsar-biodegradation 1054 41 2 84.83 84.83 83.89 81.99 83.41 83.89 84.36 83.89
parkinsons 194 21 2 84.62 82.05 82.05 84.62 79.49 76.92 84.62 79.49
magic-gamma-tel 19019 10 2 86.65 86.93 86.44 86.75 85.73 86.01 87.01 86.25
letter-recognition 19999 16 26 86.98 86.75 89.60 82.27 84.47 83.85 90.20 86.55
statlog-proj-landsat 4434 36 6 88.84 88.05 90.08 90.19 88.61 89.40 89.29 88.39
wall-robot-nav-24 5455 24 4 92.03 91.12 91.58 92.31 92.03 92.22 90.75 92.40
spambase 4600 57 2 92.18 92.73 92.62 92.29 92.40 92.51 92.40 92.62
seeds 209 7 3 90.48 90.48 92.86 92.86 92.86 90.48 90.48 92.86
ozone-level-eight 2533 72 2 93.10 94.08 93.10 93.29 94.48 92.31 93.10 93.49
cnae-9 1079 856 9 91.20 93.06 92.59 92.13 92.59 92.13 93.06 91.20
balance-scale 624 4 3 91.20 91.20 92.00 90.40 72.80 91.20 91.20 92.80
ionosphere 350 34 2 92.96 90.14 92.96 91.55 94.37 91.55 90.14 91.55
brst-cancer-ws-orig 698 9 2 93.57 93.57 95.71 95.71 95.71 95.00 95.71 95.71
brst-cancer-ws-diag 568 30 2 95.61 94.74 95.61 94.74 95.61 93.86 94.74 95.61
ozone-level-one 2535 72 2 97.24 96.46 96.85 96.85 97.44 97.44 96.85 97.05
wall-robot-nav-4 5455 4 4 97.44 97.16 97.34 97.25 97.71 97.62 97.62 97.80
climate-simu-crash 539 18 2 96.30 96.30 95.37 94.44 96.30 93.52 96.30 96.30
optical-recog-digits 3822 64 10 97.39 97.39 96.99 97.12 97.91 97.91 97.78 97.65
wall-robot-nav-2 5455 2 4 96.98 97.53 97.34 97.25 97.53 98.17 97.89 97.44
dermatology 365 34 6 97.30 97.30 97.30 98.65 95.95 97.30 97.30 94.59
thyroid-disease-new 214 5 3 97.67 97.67 95.35 93.02 95.35 93.02 97.67 93.02
thyroid-disease-ann 3771 21 3 98.28 98.81 98.01 98.54 98.68 98.41 98.54 98.28
wine 177 13 3 94.44 97.22 94.44 94.44 97.22 88.89 97.22 97.22
pen-recog-digits 7493 16 10 98.93 98.67 99.00 99.07 99.13 99.33 99.07 99.07
skin-segmentation 2450563 2 99.90 99.89 99.89 99.89 99.86 99.87 99.90 99.87
banknote-authent 1371 4 2 99.64 100.00 100.00 99.64 87.27 70.18 100.00 87.27
iris 149 4 3 100.00 100.00 93.33 100.00 70.00 96.67 96.67 96.67
MNIST 70000 784 10 99.14 99.24 99.15 99.06 99.07 98.86 99.25 99.19
Fashion-MNIST 70000 784 10 91.53 91.36 91.75 91.29 91.65 91.15 91.38 90.19
CIFAR10 60000 102410 68.28 65.44 68.13 63.14 56.04 61.76 63.76 56.48
Table B.3: Stability (worst case accuracy) results for all UCI and vision data sets, where n
denotes the data size, p denotes the data dimension, and k denotes the number of classes.
Darker blue corresponds to higher nominal (DL) natural accuracy.
160
Name n p k DL Rob. St. Sp. Rob. St. St. HDL
+Sp. +Sp. +Rob.
echocardiogram 131 7 3 0.0 0.0 0.0 52.17 62.57 65.69 0.0 65.49
hill-valley 605 100 2 0.0 0.0 0.0 49.00 41.94 34.89 0.0 28.80
planning-relax 181 12 2 0.0 0.0 0.0 54.47 63.00 69.13 0.0 70.56
poker-hand 25009 10 10 0.0 0.0 0.0 29.56 49.23 51.31 0.0 49.83
hill-valley-noise 605 100 2 0.0 0.0 0.0 50.67 39.73 45.04 0.0 29.28
yeast 1483 8 10 0.0 0.0 0.0 53.08 53.41 54.71 0.0 54.36
haberman-survival 305 3 2 0.0 0.0 0.0 47.21 49.40 53.34 0.0 53.10
glass-identification 213 9 6 0.0 0.0 0.0 56.14 63.65 64.86 0.0 64.85
brst-cancer-ws-prog 197 32 2 0.0 0.0 0.0 61.87 67.34 69.49 0.0 67.78
hayes-roth 131 4 3 0.0 0.0 0.0 61.56 67.49 66.39 0.0 67.38
spectf-heart 79 44 2 0.0 0.0 0.0 81.02 85.84 86.10 0.0 85.58
hepatitis 154 19 2 0.0 0.0 0.0 64.56 68.71 74.01 0.0 72.74
connectionist-bench 989 10 11 0.0 0.0 0.0 57.95 63.20 63.34 0.0 63.45
libras-movement 359 90 15 0.0 0.0 0.0 66.76 70.90 71.00 0.0 70.73
bld-transf-serv-ctr 747 4 2 0.0 0.0 0.0 46.21 48.90 52.65 0.0 50.52
connect-bench-sonar 207 60 2 0.0 0.0 0.0 76.09 80.04 81.10 0.0 80.07
image-segmentation 209 19 7 0.0 0.0 0.0 60.95 66.47 69.23 0.0 67.24
ecoli 335 7 8 0.0 0.0 0.0 54.91 62.25 63.26 0.0 62.31
qsar-biodegradation 1054 41 2 0.0 0.0 0.0 44.89 44.32 51.82 0.0 47.65
parkinsons 194 21 2 0.0 0.0 0.0 59.42 60.95 67.40 0.0 63.23
magic-gamma-tel 19019 10 2 0.0 0.0 0.0 50.80 51.43 57.52 0.0 52.84
letter-recognition 19999 16 26 0.0 0.0 0.0 58.77 63.84 65.32 0.0 64.86
statlog-proj-landsat 4434 36 6 0.0 0.0 0.0 45.65 50.60 52.06 0.0 51.53
wall-robot-nav-24 5455 24 4 0.0 0.0 0.0 51.85 55.96 56.59 0.0 55.53
spambase 4600 57 2 0.0 0.0 0.0 56.81 56.56 59.90 0.0 55.84
seeds 209 7 3 0.0 0.0 0.0 65.56 71.45 72.73 0.0 72.33
ozone-level-eight 2533 72 2 0.0 0.0 0.0 66.09 68.13 67.86 0.0 67.38
cnae-9 1079 856 9 0.0 0.0 0.0 61.94 62.06 73.43 0.0 61.91
balance-scale 624 4 3 0.0 0.0 0.0 57.25 63.41 63.35 0.0 63.14
ionosphere 350 34 2 0.0 0.0 0.0 63.16 66.76 68.17 0.0 66.74
brst-cancer-ws-orig 698 9 2 0.0 0.0 0.0 48.29 53.44 55.78 0.0 55.97
brst-cancer-ws-diag 568 30 2 0.0 0.0 0.0 68.91 71.38 71.46 0.0 70.69
ozone-level-one 2535 72 2 0.0 0.0 0.0 66.22 68.25 67.97 0.0 68.58
wall-robot-nav-4 5455 4 4 0.0 0.0 0.0 52.89 61.43 60.75 0.0 60.50
climate-simu-crash 539 18 2 0.0 0.0 0.0 59.75 64.56 65.88 0.0 66.29
optical-recog-digits 3822 64 10 0.0 0.0 0.0 65.42 68.85 71.25 0.0 69.22
wall-robot-nav-2 5455 2 4 0.0 0.0 0.0 63.85 72.12 72.68 0.0 71.74
dermatology 365 34 6 0.0 0.0 0.0 67.03 73.49 74.61 0.0 74.21
thyroid-disease-new 214 5 3 0.0 0.0 0.0 67.25 74.48 74.62 0.0 74.12
thyroid-disease-ann 3771 21 3 0.0 0.0 0.0 48.81 50.15 53.22 0.0 50.36
wine 177 13 3 0.0 0.0 0.0 74.62 81.05 80.60 0.0 80.59
pen-recog-digits 7493 16 10 0.0 0.0 0.0 59.13 62.58 63.33 0.0 63.23
skin-segmentation 2450563 2 0.0 0.0 0.0 63.77 71.27 72.15 0.0 72.56
banknote-authent 1371 4 2 0.0 0.0 0.0 76.88 84.17 84.97 0.0 84.98
iris 149 4 3 0.0 0.0 0.0 69.89 78.03 77.56 0.0 77.79
MNIST 70000 784 10 0.0 0.0 0.0 39.20 75.06 44.77 0.0 76.05
Fashion-MNIST 70000 784 10 0.0 0.0 0.0 45.41 68.94 48.40 0.0 69.18
CIFAR10 60000 102410 0.0 0.0 0.0 50.51 82.34 55.58 0.0 81.41
Table B.4: Sparsity results for all UCI and vision data sets, where n denotes the data size, p
denotes the data dimension, and k denotes the number of classes. Darker blue corresponds
to higher nominal (DL) natural accuracy.
161
162
Appendix C
Chapter 4 Appendix
C.1 HHC Data Summary Statistics
We report the summary statistics of the data after the process of inclusion, exclusion, and
splits from the Machine Learning Modeling section. For each hospital HA-HG, the number
of patients, admissions, and patient days in the union of training, validation, and testing sets
are summarized in Table C.1.
Table C.1: Summary of Data Size.
Hospital HA HB HC HD HE HF HG
# Patients 105,184 15,493 7,956 20,011 15,576 11,624 4,838
# Admissions 171,072 23,354 12,822 29,490 21,612 15,319 6,924
# patient days 879,357 106,662 52,931 139,542 90,924 79,615 26,184
C.2 Accuracy for each Hospital at HHC
We report the accuracy, precision, and recall for green and red alerts at all seven hospitals in
Table C.2. Among the hospitals, green alerts have 0.687-0.768 accuracy, 0.588-0.629 precision,
and 0.701-0.8 recall; red alerts have 0.885-0.925 accuracy, 0.477-0.55 precision, and 0.471-0.715
recall.
163
Table C.2: Precision and Recall under Selected Thresholds for Alerts.
Alert Hospital HA HB HC HD HE HF HG
Accuracy 0.767 0.734 0.714 0.751 0.739 0.768 0.687
New green Precision 0.621 0.604 0.588 0.611 0.623 0.617 0.629
Recall 0.746 0.786 0.8 0.764 0.796 0.701 0.78
Accuracy 0.899 0.901 0.895 0.881 0.896 0.886 0.925
Red Precision 0.477 0.55 0.574 0.492 0.528 0.505 0.53
Recall 0.705 0.668 0.553 0.691 0.715 0.663 0.471
C.3 Empirical Treatment Effect for HHC
Table C.3 presents information and deployment progress of all units with general level of care
and offering cardiology, medicine or surgical services. By January 16, 2023, 15 treatment
units had fully integrated the predictions in their review process, where unit leads review the
predictions with the provider team daily and adjust decisions accordingly. As of April 15,
2023, 12 control units (with an NA Start Date) had not officially integrated the daily process.
We consider a linear regression approach to estimate the impact of the adoption of the
tool on four different outcomes (time is measured in days): average LOS in the unit (Avg.
LOS), its logarithm (log(Avg. LOS)), the average of the log-LOS in the unit (Avg. log(LOS)),
and the average time difference between the discharge order and the first occurrence of the
green alert (we refer to this outcome as ∆order - alert).
We use discharge data from all half-month periods in January 16, 2020 - April 15, 2023,
and for the same treatment and control units as in Section 4.2.3. We consider the unit to
be not treated (resp. treated) on all time periods before (resp. after) the utilization start
date from Table C.3. Periods containing the utilization start date are excluded. We include
two categories of control variables: calendar variables (year-month-half indicators) and unit
controls (hospital name, unit name, and the number of beds). We then construct a linear
regression model to predict each outcome as a function of a binary variable indicating whether
the tool is deployed in that unit at that time period, and the control variables. Formally, for
164
Table C.3: Unit Deployment Progress Information.
Hospital Unit Start Date Specialty Capacity
HA HA CONKLIN 2 NA Medicine/Oncology 27
HA HA CONKLIN 4 9/13/22 Medicine 25
HA HA CONKLIN 5 7/11/22 Medicine 47
HA HA BLISS 7 EAST 8/23/22 Medicine 17
HA HA BLISS 10 EAST NA Cardiology 14
HA HA CENTER 10 NA Cardiology 26
HA HA CENTER 12 7/11/22 Medicine 26
HA HA NORTH 10 NA Cardiology 27
HA HA NORTH 12 7/11/22 Medicine 20
HB HB A3 MEDSURG 8/23/22 Medicine/Surgical 30
HB HB E4 Cardiology 8/23/22 Cardiology 28
HC HC FOURTH FLOOR 8/23/22 Medicine/Surgical 28
HC HC FIFTH FLOOR 8/23/22 Medicine/Surgical 29
HD HD EAST 2 NA Medicine/Observation 12
HD HD WEST 2 NA Medicine 15
HD HD NORTH 3 1/15/23 Medicine 24
HD HD NORTH 4 10/22/22 Medicine/Cardiology 28
HD HD NORTH 5 8/23/22 Medicine/Stroke 30
HE HE PAVILION D NA Medicine 28
HE HE PAVILION E 1/15/23 Medicine 28
HF HF 6 NORTH NA Cardiology 20
HF HF 6 SOUTH NA Cardiology 20
HF HF 9 NORTH NA Medicine 22
HF HF 10 NORTH NA Medicine 29
HG HG 4 SHEA EAST 1/15/23 Medicine/Surgical 30
HG HG 4 SHEA NORTH 1/15/23 Medicine/Surgical 12
HG HG GREER NA Medicine/Surgical 23
every unit i and time period t, we consider a regression model of the form
Yi,t = αi + λt + µWi,t + βXi + εi,t,
where Yi,t is the outcome of interest, αi (resp. λt) is the unit (resp. time) fixed effect, Wi,t is
the binary treatment indicator variable for unit i at time t, and Xi is the set of additional
controls for unit i (in our case, hospital and number of beds). The coefficient µ corresponds
to the treatment effect, assumed homogeneous across units and time.
We select this approach given our staggered roll-out setup (different units started using
the tool at different times). This estimation strategy is referred to in the literature as
Difference-in-Difference with variations in treatment times (Callaway and Sant’Anna 2021,
165
Table C.4: Regression Results of our Difference-in-Difference Model for Estimating the Impact
of our Tool after Deployment.
Length of Stay
Outcome Avg. LOS log(Avg. LOS) Avg. log(LOS) Avg. ∆order - alert
Coefficient -0.457 -0.082 -0.055 -0.194
Standard Error 0.167 0.028 0.023 0.09
p-value 6.4× 10−3 3.9× 10−3 1.7× 10−2 3.3× 10−2
Obs. 1846 1846 1846 1827
R-square 0.394 0.454 0.508 0.424
Adjusted R-square 0.357 0.421 0.479 0.389
Controls: hospital, unit, number of beds, month-year-half.
Standard Errors: cluster-robust at the unit level.
Goodman-Bacon 2021), and it accounts for variations in LOS due to time non-stationarity
(calendar variables) and differences in patient populations and hospital types (hospital and
unit fixed effects). Given the correlation in our setting between the units and the treatment
assignment, the default standard errors could overestimate the precision of the estimator
(Cameron and Miller 2015), so we compute cluster-robust standard errors (Liang and Zeger
1986) at a unit-level instead.
The results of the linear regression model are shown in Table C.4. The treatment variable
has a negative and statistically significant coefficient across all outcomes. In particular, for
the Avg. LOS, the coefficient value of -0.457 (p-value of 0.006) indicates that, everything
else (i.e., all the control variables) being equal, using the tool reduces the average LOS by
0.457 days. Similarly, regressing the average of the log(LOS) or the logarithm of the average
LOS indicates a statistically significant negative treatment effect, yet in multiplicative rather
than additive terms (around 8–5% reduction in LOS). Lastly, the coefficient of the treatment
variable for the Avg. (∆order - alert) outcome has a value of -0.194 (p-value of 0.033), suggesting
that the reduction in LOS can be partially attributed to discharge orders being placed earlier
(0.194 days sooner), potentially thanks to observing the green alert.
There are limitations to our approach, including some discussed in Section 4.2.3. Our
analysis assumes that the treatment effect on the outcomes of interest can be decomposed
into two-way fixed effects and follows a linear relationship with the control and treatment
166
variables. However, there may be other confounding factors at the unit and patient levels
that can influence LOS, such as patient demographics and other initiatives aimed at reducing
LOS in these hospitals. Most importantly, the design of the staggered roll-out was not done
with empirical validation in mind. The units treated and the time where deployment started
in these units at random and may be correlated with LOS and the potential impact of the
tool (e.g., prioritizing units that would benefit the most or would be ‘easy adopters’). This
deficiency can lead to bias in our DiD estimates, in addition to issues of treatment effect
heterogeneity across units and over time (Bertrand et al. 2004). Lastly, discussions with the
hospital network reveal challenges in accurately measuring and quantifying the exact extent
of tool usage. For instance, treatment units exhibit varying degrees of tool utilization across
different medical teams.
167
168
Appendix D
Chapter 5 Appendix
D.1 Reproducing Kernel Hilbert Spaces Overview
A reproducing kernel Hilbert space (RKHS) is a Hilbert space in which the elements are
functions that preserve pointwise distance. Specifically, if two functions are close with respect
to the Hilbert space norm, then their pointwise evaluations are close with respect to the norm
of the functions’ output space. Each RKHS is generated by a positive definite kernel K(·, ·);
a function K : Z × Z → R satisfying
∑m ∑m
a i jiajK(z , z ) ≥ 0, ∀m ∈ N, z1, . . . ,zm ∈ Z, a1, . . . , am ∈ R .
i=1 j=1
Definition D.1.0.1. A reproducing kernel Hilbert space H generated by a positive definite
kernel K : Z × Z → R is the closure of the set of functions
{ ∣ }
∣ ∑L
∣
f : Z → R f(z) = acK(zc∣ , z), for z1, . . . ,zL ∈ Z and L ∈ N ,
∣
c=1
169
∑L ∑L
with inner product of f (z) = 1 c c 2 c c1 c=1 a1K(z1, z) and f2(z) = c=1 a2K(z2, z) defined as
∑L1 ∑L2
⟨f , f ⟩ = ac1ac2K(zc1 , zc21 2 H 1 2 1 2 ).
c1=1 c2=1
The complexity of a reproducing kernel Hilbert space depends on the kernel generating
it. A linear kernel, for example, generates the Hilbert space of linear functions. A Gaussian
kernel generates much more complex spaces; it has the property that for compact spaces Z it
generates spaces that are dense in C(Z) (the space of continuous bounded funcions on Z) in
the maximum norm. Kernels with this property are called universal kernels (Micchelli et al.
2006) and they are very useful for solving functional optimization problems over continuous
functions, since the problem can be solved over the RKHS instead.
One of the main reasons why RKHSs are so useful when working with data is the fact
that they transform pointwise evaluation into an inner product of elements in the Hilbert
space, and vice-versa. Specifically, if f belongs to the reproducing kernel Hilbert space H
generated by kernel K, we have
f(x) = ⟨K(x, ·), f⟩H, (D.1)
for all x in the domain of f . This equivalence is known as the reproducing property. The next
result, known as the Representer Theorem, illustrates how in many cases solving functional
optimization problems over a RKHS is equivalent to solving an optimization problem over a
real space, and the proof relies mostly on the reproducing property.
Theorem D.1.1 (Representer Theorem). Suppose we have a data matrix Z = [z1, . . . ,zN ]
for some fixed data points z1, . . . ,zN ∈ Z. Let H be the reproducing kernel Hilbert space
generated by a kernel K : Z ×Z → R. Then, for any arbitrary loss function c : R×Z → R
170
and any regularization parameter λ ≥ 0, there exists a solution to
1 ∑
N
inf c(h(zn
λ
), zn) + ∥h∥2 (D.2)
h∈H N 2 H
n=1
that takes the form
∑N
h∗(·) = anK(z
n, ·) , (D.3)
n=1
for some scalars a1, . . . , aN ∈ RN .
The Representer Theorem implies that the solution to the functional optimization problem
(D.2) can be found by solving instead the following finite dimensional optimization problem
N
1 ∑ ( ) λ
min c (K[Z,Z]a)n, z
n + aTK[Z,Z]a , (D.4)
a∈RN N 2
n=1
where K[Z, Z̃] is the kernel matrix (between equal size matrices Z, Z̃) whose (n,m) compo-
nent is K(zn, z̃m).
The proof of this theorem follows from the fact that any function in H can be decomposed
as the sum of a function of the form (D.3) and a function orthogonal to every function of
this form. The theorem then follows after showing that, thanks to the reproducing property,
the sum in the objective of (D.2) is independent of the orthogonal part, and the second
term in the objective is increasing in the orthogonal part. This theorem can be extended
to a multidimensional version in which the optimization problem is over multiple functions
h1, . . . , hr ∈ H. In this case, the Representer Theorem tells us that there exists a solution
h∗ ∈ Hr that takes the form
h∗(·) = AK(Z, ·) , (D.5)
where K(Z, ·) := [K(z1, ·), . . . , K(zN , ·)]T and A ∈ Rr×N . For more details about the
representer theorem’s proof and applications we refer the reader to Wahba (1990), Soentpiet
et al. (1999), Schölkopf et al. (2002), Shawe-Taylor et al. (2004).
171
D.2 Lemmas
In this appendix we will state and proof several lemmas needed for the proof of Theorems
5.5.4 and Corollary 5.5.5. For generality, we consider the constrained problem in Eq. 5.19,
and define:
Q
( ) ( ) ∑ ( ( ))2
cψ u(z), z := c u(z), z + ψ max 0, gq u(z) , (D.6)
q=1
E(u) := Ez [c(u(z), z)] , (D.7)
[ ( )]
Eψ(u) := Ez cψ u(z), z , (D.8)
Eλ,ψ
λ
(u) := Eψ(u) + ∥u∥2H, (D.9)2
∑N ( )
λ,ψ 1 λ
ES (u) := c
ψ u(zn), zn + ∥u∥2 , (D.10)
N 2 H
n=1
( )
Eλ,ψ
λ
n (u) := c
ψ u(zn), zn + ∥u∥2H. (D.11)2
Lemma D.2.1. Under assumption 1, we have
∥u(z)∥2 ≤ κ∥u∥H ∀ z ∈ Z, u ∈ H.
Proof. Let u ∈ H and z ∈ Z. We have
∑T ∑rt ∑T ∑rt
∥u(z)∥2 2 22 = ut,i(z1:t−1) = ⟨ut,i, Kt(z1:t−1, ·)⟩H (by Eq. (D.1)),t
t=1 i=1 t=1 i=1
∑T ∑rt
≤ ∥K (z , ·)∥2t 1:t−1 H ∥ut,i∥
2
H (Cauchy-Schwarts Ineq.),t,i t
t=1 i=1
≤ κ2∥u∥2H (Assumption 5.5.1),
and the lemma follows.
Lemma D.2.2. Under Assumptions 1 and 2, we have
172
a) The true minimizer uλ,ψ of Eλ,ψ defined in D.9, satisfies
∥uλ,ψ
κC
∥H ≤ .
λ
b) If Algorithm 2 is initialized such that ∥u0∥H ≤
κC , then
λ
∥un
κC
∥H ≤ ∀ n ∈ [N ].
λ
Proof. a) We proceed as in Kivinen et al. (2004). We define λ,ψuS as the minimizer of
λ,ψ
ES ,
and λ,ψû = (1− ϵ)uS for ϵ > 0. We have
λ,ψ λ,ψ λ,ψ
0 ≤ ES (û)− ES (uS ) (by Optimality of
λ,ψ
uS ),
1 ∑
N ( ) ( )
= cψ
λ,ψ λ λ,ψ
(û(zn), zn)− cψ(uS (z
n), zn) + ∥û∥2H− ∥u
2
N 2 S
∥H
n=1
∑NC λ( )λ,ψ λ,ψ
≤ ∥û(zn)− u (znS )∥2 + (1− ϵ)
2 − 1 ∥u ∥2S H (by Assumption 5.5.2),N 2
n=1
N
κC∑ λ,ψ λ,ψ λ λ,ψ
≤ ∥û− uS ∥H − λϵ∥u ∥
2 + ϵ2∥u ∥2
N S H 2 S H
(by Lemma D.2.1),
n=1
λ,ψ λ,ψ 2 λ 2 λ,ψ= κCϵ∥uS ∥H − λϵ∥uS ∥H + ϵ ∥u
2
2 S
∥H.
Dividing by λ,ψϵ∥uS ∥H on both sides and taking the limit as we obtain
λ,ψ
ϵ→ 0 ∥uS ∥ ≤
κC
H λ
and the desired result then follows by taking the limit as N → ∞.
b) To prove the upper bound for the decisions output by the algorithm in each iteration
we proceed by induction on the iteration number n. We have ∥u0∥ ≤ κCH by assumption.λ
173
Suppose the bound holds for n. Then, we have
∥ ∥
∥un+1∥ ∥H = ΠDn+1 [(1− ηnλ)u
n − η ψn∇uc (u
n(zn), zn)]∥ (by definition 5.12),
H
≤ ∥(1− ηnλ)u
n − η ∇ cψ(un(znn u ), z
n)∥H (by definition of ΠDn+1)
≤ (1− ηnλ)∥u
n∥H + η ∥∇ c
ψ(unn u (z
n), zn)∥H (by triangle inequality),
≤ (1− η nnλ)∥u ∥H + ηnκ∥∇
ψ
u(z)c (u
n(zn), zn)∥2 (by Eq. (5.9)),
κC
≤ (1− ηnλ) + ηnκ∥∇ c
ψ
u(z) (u
n(zn), zn)∥2 (by assumption for n),
λ
κC κC
≤ (1− ηnλ) + ηnκC = (by Eq. 5.17),
λ λ
and therefore the result holds for all n ∈ N as desired.
Lemma D.2.3. Under Assumptions 1-3, for any u ∈ H satisfying ∥u∥ ≤ κCH , we haveλ
[ ]
E ∥∇ Eλ,ψu n (u)∥2H ≤ 4κ2C2.t
Proof. Fix some t ∈ {1, . . . , T}. Using the fact that ∥a + b∥2H ≤ 2(∥a∥2H + ∥b∥2H) for any
a, b ∈ H, as well as Assumption 5.5.3, we obtain that for any u ∈ H, it holds
[ ] [ ]
E ∥∇ Eλ,ψ(u)∥2u n H ≤ 2E ∥∇ cψu (u(zn), zn)∥2H + 2λ2∥u∥2H,
[ ]
≤ 2κ2E ∥∇u(z)cψ(u(zn), zn)∥2 2H + 2λ ∥u∥2H (By Eq. (5.9)),
( )2
[ ]
2 κC≤ 2κ E ∥∇ ψ n n 2 2u(z)c (u(z ), z )∥H + 2λ (By Lemma D.2.2),λ
( )2
≤ 2κ2C2 + 2λ2
κC
(by Eq. (5.17)),
λ
= 4κ2C2.
Lemma D.2.4. Under Assumption 3, given independently and identically distributed realiza-
174
tions {zn} of z, we have
∥∇ Eλ,ψ(un)− ∇̃ Eλ,ψ(un
ϵn
u n u n )∥H ≤ ,ηn
where ∇̃ Eλ,ψu n (u
n) was defined in Eq. (5.15).
Proof. By definition of ∇̃ Eλ,ψ nu n (u ) we have that
∥∇ Eλ,ψu n (u
n)− ∇̃ Eλ,ψu n (u
n)∥2H
∥ ∥
n n λ,ψ n 2∥ ∥
∥ λ,ψ n
u − ΠDn+1 [u − ηn∇uEn (u )]= ∇uE (u )− ∥n ,∥ η ∥n H
∥ ∥
∥ 1 2
= Π [un − η ∇ Eλ,ψ n
1 n λ,ψ n ∥
∥ Dn+1 n u n (u )]− (u − ηn∇uEη η n
(u ))∥ ,
n n H
1 ∥ ∥
= ∥un+1 − ũn+1
2
∥ ,
η2 Hn
where ũn+1 := un − η ∇ Eλ,ψn u n (un) is the result of applying one FSGD iteration to un. By
∥ ∥
the stopping criterion of the KOMP algorithm we know that ∥un+1 − ũn+1∥ ≤ ϵn, andH
therefore the lemma follows after taking the square root on both sides.
Lemma D.2.5. Under Assumption 3, for any u ∈ H with ∥u∥ ≤ κCH , we haveλ
Eλ,ψ nn (u )− E
λ,ψ
n (u) (D.12)
1 ( ) ϵ 2n ϵ
≤ ∥un − u∥2H − ∥u
n+1 − u∥2 λ,ψ n 2 nH + ηn∥∇uEn (u )∥H + ∥u − u∥
n
H + . (D.13)
2ηn ηn ηn
Proof. Firstly, notice that
∥un+1 − u∥2H = ∥u
n − ηn∇̃ E
λ,ψ
u n (u
n)− u∥2H
= ⟨un − η ∇̃ Eλ,ψn u n (u
n)− u,un − η λ,ψ nn∇̃uEn (u )− u⟩
(D.14)
= ∥un − u∥2 nH − 2ηn⟨u − u,∇uE
λ,ψ
n (u
n)⟩ − 2ηn⟨u
n − u, ∇̃ Eλ,ψ(unu n )
−∇ Eλu n(u
n)⟩+ η2 λ,ψ n 2n∥∇̃uEn (u )∥H
175
By the Cauchy Schwartz inequality and Lemma D.2.4, we have
|⟨un − u, ∇̃ Eλ,ψu n (u
n)−∇uE
λ
n(u
n)⟩| ≤ ∥un − u∥ ∥∇̃ λ,ψ n λ,ψ nH uEn (u )−∇uEn (u )∥H (D.15)
ϵn
≤ ∥un − u∥H. (D.16)
ηn
Substituting Eq. (D.16) in Equation (D.14) and rearranging terms we obtain
n λ,ψ n ∥u
n−u∥2 − ∥un+1−u∥2 ϵn ηn
⟨u −u,∇ E (u )⟩≤ H H+ ∥un−u∥ + ∥∇̃ Eλ,ψ(unu n H u n )∥
2
H.2ηn ηn 2
(D.17)
Then, we have
Eλ,ψ n λ,ψn (u )− En (u) ≤ ⟨u
n − u,∇ Eλ,ψu n (u
n)⟩ (By convexity of Eλ,ψn (u
n)),
∥un−u∥2 −∥un+1−u∥2H H ϵn ηn≤ + ∥un − u∥H + ∥∇̃ E
λ,ψ
u n (u
n)∥2
2η η Hn n 2
(D.18)
Furthermore,
∥∇̃ Eλ,ψ(un)∥2 ≤ 2∥∇̃ Eλ,ψu n H u n (u
n)−∇uE
λ,ψ(un)∥2n H + 2∥∇ E
λ,ψ
u n (u
n)∥2H
2ϵ2
(D.19)
≤ n + 2∥∇ Eλ,ψu (u
n)∥2 (By Lemma D.2.4).
η2 n Hn
The Lemma follows from applying Eq (D.19) to Eq (D.18).
Lemma D.2.6. Under Assumptions 1-3, for ϵ = P η2n 2 n we have
∑ ( ) ∑
[ ] ∥uλ,ψ
N
−u0∥ η2
N
2P κC ϵ2
E ∗ H 2Eλ,ψ(uN )− Eλ,ψ(uλ,ψ) ≤ + n=1 n + 4κ2C2 + n=1 n∑ .N ∑N ∑
2 ηn ηn λ
N
n=1 n=1 n=1 ηn
176
Proof. Taking expectation over data and sampling on both sides of Equation (D.12) we have
[ ]
η E Eλ,ψ n λ,ψn (u )− E (u)
E [∥un−u∥2H−∥un+1− u∥2H+ 2ϵn∥un−u∥H]≤ +E[η2 λ,ψ n 2 2
2 n
∥∇uEn (u )∥H] + ϵn,
E [∥un− u∥2H− ∥un+1− u∥2H+ 2ϵ nn∥u − u∥H]≤ + 4η2nκ2C2 + ϵ22 n
,
where the inequality follows form Lemma D.2.3. Summing over n and evaluating at u = uλ,ψ
we obtain
∑N [ ]
η E Eλ,ψ(un)− Eλ,ψ(uλ,ψn )
n=0
N N N
1 ∑ ∑ ∑
≤ ∥uλ,ψ − u0∥ n λ,ψ 2 2 2 2H + ϵn∥u − u ∥H + 4κ C η
2 n
+ ϵn,
n=0 n=0 n=0 (D.20)
1 ∑
N
κC ∑
N ∑N
≤ ∥uλ,ψ − u0∥ + 2ϵ + 4κ2 2 2H n C ηn + ϵ
2
n (Lemma D.2.2),2 λ
n=0 n=0 n=0
( ) N N
1 2P ∑ ∑λ,ψ 1κC= ∥u − u0∥ + + 4κ2C2 2 2H ηn + ϵ2 λ n
.
n=0 n=0
By definition of u∗S we then have
( )
∑N ∑N[ ] [ ]
η E Eλ,ψn (u∗S)− Eλ,ψ(uλ,ψ) ≤ η E Eλ,ψ(un)− Eλ,ψ(uλ,ψn ) . (D.21)
n=1 n=1
∑
Dividing by Nn=1 ηn on both sides of Eq. (D.21) and applying the inequality (D.20) we
obtain
∑N ( ) ∑N
[ ] λ,ψ 0 2
λ,ψ N∗ λ,ψ λ,ψ ∥u −u ∥H n=1 ηn 2P2κC 2 2 n=1 ϵ
2
E E (u )− E (u ) ≤ ∑ + nN ∑ + 4κ C +N ∑ .λ N2 n=1 ηn n=1 ηn n=1 ηn
as desired.
Lemma D.2.7. Suppose that there exists a feasible decision û and a finite constant C0 such
177
that c(û(z), z) ≤ C0 for all z ∈ Z. Then,
[ ]
Q
∑ ( ( ))2
lim E max 0, g uλ,ψq (z) = 0.
ψ→∞
q=1
Proof. By definition of uλ,ψ, we know
[ ]
Q
∑ ( ( ))2
ψE max 0, g uλ,ψ
λ λ
q (z) ≤ Ez [c(û(z), z)] + ∥û∥2 2
2 H
≤ C0 + ∥û∥H.2
q=1
2
Therefore, for any violation tolerance δ > 0 we can choose 2C0+λ∥û∥ψ ≥ H to ensure
[ ] 2δ( )
∑ ( ) 2
E Qq=1 max 0, gq uλ,ψ(z) ≤ δ and the lemma follows.
D.3 Main Theorems
In this section we state and proof a more general version of Theorem 5.5.4 and Corollary
5.5.5, which correspond to the case ψ = 0.
λ,ψ
Theorem D.3.1 (Generalization of Theorem 1). Let u∗S := argminu∈{u1,...,uN}ES (u) be
the decisions generated by Algorithm 2 when given the set S = {zn}Nn=1 as input, and let u
λ,ψ
be the true minimizer of Eλ,ψ(u) over H. If we use constant step-size η and constant error
bounds ϵ = P2η
2 for some constant P2 > 0, then under Assumptions 1-3, we have that
[ ] ( )
E Eλ,ψ(u∗ )− Eλ,ψ(uλ,ψ
η
S ) ≤ O .λ
Proof. Applying Lemma D.2.6 with ηn = η yields
( ∑ )N
[ ] η2 (η)
E Eλ,ψ(u∗ )− Eλ(uλ,ψ) ≤ O n=1 nS ∑ = O ,N
λ n=1 ηn λ
178
as desired.
Corollary D.3.2 (Generalization of Corollary 1). Suppose that there exists a feasible decision
û and finite constants c0, C0 such that c(û(z), z) ≤ C0 and c0 ≤ c(u, z) for all z ∈ Z and
for all scalar arguments u. Let u∗ be the true minimizer of E(·) over F and let uψ be the
true minimizer of Eψ(·) over F . If we use constant step-size with η = √P1 < 1 , and P1 > 0,N λ
constant error bounds ϵ = P2η
2 for some constant P2 > 0, and regularization parameter λ
√
such that λ −−−→ 0 and λ N −−−→ ∞, then under Assumptions 1-4 we have that
N→∞ N→∞
[ ( ) ]
∑ ( ) 2Q
a) lim lim E max 0, g u∗ψ→∞ N→∞ q=1 q S(z) = 0.
[∣ ∣]
b) lim E ∣N→∞ Eψ(u∗S)− Eψ(uψ)∣ = 0 for all ψ > 0.
c) limψ→∞ limN→∞ E [E(u∗S)] ≤ E(u∗).
Proof. Part a) We have
[ ]
Q
∑ ( ( ))2 [ ]
lim ψE max 0, gq u∗S(z) ≤ lim E Eλ,ψ(u∗S) − c0,
N→∞ N→∞
q=1
≤ Eλ,ψ(uλ,ψ)− c0 (by Theorem 5.5.4),
≤ Eλ,ψ(û)− c0 (by optimality of uλ,ψ),
= Eλ(û)− c0 (by feasibility of û),
and therefore
[ ]
Q
∑ ( ( ))2 Eλ(û)− c0
0 ≤ lim lim E max 0, gq u∗S(z) ≤ lim = 0,
ψ→∞N→∞ ψ→∞ ψ
q=1
179
Part b) Let ψuH be the true minimizer of E
ψ over H. Adding and subtracting terms we obtain
( ) ( )
Eψ(u∗S)− E
ψ(uψ) = Eψ(u∗ )− Eλ,ψS (u
∗ ) + Eλ,ψ(u∗ )− Eλ,ψ(uλ,ψS S )
( ) ( )
+ Eλ,ψ(uλ,ψ)− Eψ
ψ ψ
(uH) + E
ψ(u )− Eψ(uψH ) .
The first term on the right hand side is negative, the second term vanishes because of Theorem
5.5.4, the third term vanishes with λ, and the fourth term is zero because we use universal
kernels (Assumption 5.5.1). Since Eψ(u∗S)− Eψ(uψ) is non-negative, we obtain
[∣ ∣]
lim E ∣Eψ(u∗ )− EψS (uψ)∣ = 0 ∀ψ ≥ 0.
N→∞
Part c) We have
[ ] [ ]
lim E Eψ(u∗ ψ ∗S) = lim E E (uS)− Eψ(uψ) + Eψ(uψ)
N→∞ N→∞
[ ] [ ]
=⇒ lim lim E Eψ(u∗S) = lim lim E Eψ(u∗S)− Eψ(uψ) + lim Eψ(uψ)
ψ→∞N→∞ ψ→∞N→∞ ψ→∞
[ ]
=⇒ lim lim E [E(u∗ )] = lim lim E EψS (u∗S)−Eψ(uψ) +limEψ(uψ) (By part a))
ψ→∞N→∞ ψ→∞N→∞ ψ→∞
=⇒ lim lim E [E(u∗ ψ ψS)] ≤ lim E (u ) (By part b))
ψ→∞N→∞ ψ→∞
=⇒ lim lim E [E(u∗ )] ≤ E(u∗S ), (By optimality of uψ)
ψ→∞N→∞
as desired.
D.4 Finding Lower Bounds
We emphasize that we only need to find lower bounds for the case in which the dimension
of the data and the dimension of the controls is equal to 1; since the experiments run for
multidimensional cases were designed to have the same objective value as the one dimensional
180
case. The exact problem we want to lower bound is then
[ ∣ ]
∑T ∣
+ +
min Ew|x 2 [s ] + [−s ] ∣t t x = x0 (D.22)
u ∣1:T
t=1
s.t. st = st−1 + ut − wt (D.23)
ut ≥ 0 ∀t ∈ [T ], (D.24)
ut ≤ 150 ∀t ∈ [T ], (D.25)
ut + ut+1 ≤ 200 ∀t ∈ [T − 1]. (D.26)
The demands wt were generated as a linear function of the covariates with some added
noise; specifically, wt = αtx + ϵt, where ϵt was sampled from a standard distribution and
the constants αt were selected to be close to 50 in order for the triangular constraints to
be relevant. In fact, for the specified parameters, we found that the constraints (D.25) and
(D.24) are quite lose and we can find a good lower bound for the the optimal objective value
by removing these constraints. The problem to solve can then be simplified as
[ [ ]
T t +
[ ]+]
∑ ∑ ∑t
minEϵ 2 ui(x0, ϵ1:i−1)− αix0 − ϵi + − ui(x0, ϵ1:i−1)− αix0 − ϵi
u1:T
t=1 i=1 i=1
s.t. ut(x0, ϵ1:t−1) + ut+1(x0, ϵ1:t) ≤ 200 ∀t ∈ [T − 1]. (D.27)
To solve the problem above, we use the fact that if ϵ has a normal distribution then the
function f(a) = Eϵ [2[a− ϵ]+ + [ϵ− a]+] is strictly convex and
[ ]
a0 := argmin Eϵ 2[a− ϵ]+ + [ϵ− a]+ (D.28)
a
∫ a 2 ∫e−x /2 ∞
2
e−x /2
=min 2(a− x) √ dx+ (x− a) √ dx
a −∞ 2π a 2π
2 ( ( ))
3e−a /2 3 a
=min √ + a 1 + erf √ − a
a 2π 2 2
( )
√ 1
=− 2erf−1 . (D.29)
3
181
We will show how to exactly solve problem (D.27) in the case T = 2 (the analysis is similar
for cases T = 3, 4, 5). Suppose then that T = 2. For a fix value of u1(x0), the optimization
problem (D.27) over u2(x0, ϵ1) becomes
 [ ]+ [ ]

∑2 2
+
∑
minE ϵ 2 ui(x 2 0, ϵ1:i−1)− αix0 − ϵi + − ui(x0, ϵ1:i−1)− αix0 − ϵi
u2
i=1 i=1
s.t. u2(x0, ϵ1) ≤ 200− u1(x0).
Applying the result from Eq. (D.29) we obtain
u∗2(x0, ϵ1) = min{(α1 + α2)x0 + ϵ1 − u1(x0) + a0, 200− u1(x0)},
∑
which implies that the term 2i=1 ui(x0, ϵ1:i−1)− αix0 − ϵi evaluated at u
∗
2(x0, ϵ1) is equal to
min{a0 − ϵ2, 200− (α1 + α2)x0 − ϵ1 − ϵ2), which is independent of u1(x0). We can then find
the optimal u1 by solving
[ ]
E + +min ϵ 2 [u1(x0)− α1x0 − ϵ1] + [α1x0 + ϵ1 − u1(x0)]1
u1
s.t. u1(x0, ) ≤ 200,
which again, using Eq. (D.29) yields u∗1(x0) = min{α1x0 + a0, 200}.
182
References
Abadi M, Agarwal A, et al (2015) TensorFlow: Large-scale machine learning on heterogeneous
systems. URL https://www.tensorflow.org/, software available from tensorflow.org.
Abadie A (2005) Semiparametric difference-in-differences estimators. The Review of Economic Studies
72(1):1–19.
Aghasi A, Abdi A, Romberg J (2020) Fast convex pruning of deep neural networks. SIAM Journal
on Mathematics of Data Science 2(1):158–188.
Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, McDermott MBA (2019) Publicly
available clinical bert embeddings. URL http://dx.doi.org/10.48550/ARXIV.1904.03323.
Amram M, Dunn J, Zhuo YD (2022) Optimal policy trees. Machine Learning 111:2741–2768.
Anderson R, Huchette J, Ma W, Tjandraatmadja C, Vielma JP (2020) Strong mixed-integer
programming formulations for trained neural networks. Mathematical Programming 1–37.
Anthony M, Bartlett PL (2009) Neural network learning: Theoretical foundations (cambridge
university press).
Arik SÖ, Pfister T (2021) TabNet: Attentive interpretable tabular learning. Proceedings of the AAAI
Conference on Artificial Intelligence 35(8):6679–6687.
Athalye A, Carlini N, Wagner D (2018) Obfuscated gradients give a false sense of security: Circum-
venting defenses to adversarial examples.
Baker AC, Larcker DF, Wang CC (2022) How much should we trust staggered difference-in-differences
estimates? Journal of Financial Economics 144(2):370–395.
Balunovic M, Vechev M (2019) Adversarial training and provable defenses: Bridging the gap.
International Conference on Learning Representations.
183
Ban GY, Gallien J, Mersereau AJ (2019) Dynamic procurement of new products with covariate
information: The residual tree method. Manufacturing & Service Operations Management
21(4):798–815.
Bellec G, Kappel D, Maass W, Legenstein RA (2017) Deep rewiring: Training very sparse deep
networks. CoRR abs/1711.05136, URL http://arxiv.org/abs/1711.05136.
Beltagy I, Peters ME, Cohan A (2020) Longformer: The long-document transformer. arXiv preprint
arXiv:2004.05150 .
Ben-Tal A, Den Hertog D, Vial JP (2015) Deriving robust counterparts of nonlinear uncertain
inequalities. Mathematical programming 149(1):265–299.
Ben-Tal A, El Ghaoui L, Nemirovski A (2009) Robust optimization (Princeton university press).
Bertrand M, Duflo E, Mullainathan S (2004) How much should we trust differences-in-differences
estimates? The Quarterly Journal of Economics 119(1):249–275.
Bertsimas D, Boix X, Carballo KV, Hertog Dd (2021a) Robust upper bounds for adversarial training.
arXiv preprint arXiv:2112.09279 .
Bertsimas D, Brown DB, Caramanis C (2011) Theory and applications of robust optimization. SIAM
Review 53(3):464–501.
Bertsimas D, Carballo KV (2023) Multistage stochastic optimization via kernels. arXiv preprint
arXiv:2303.06515 .
Bertsimas D, den Hertog D (2022) Robust and Adaptive Optimization (Dynamic Ideas LLC).
Bertsimas D, Dunn J (2017) Optimal classification trees. Machine Learning 106:1039–1082.
Bertsimas D, Dunn J, Paskov I (2022a) Stable classification. Journal of Machine Learning Research
23(296):1–53.
Bertsimas D, Dunn J, Pawlowski C, Zhuo YD (2019) Robust classification. INFORMS Journal on
Optimization 1(1):2–34.
Bertsimas D, Hertog Dd, Pauphilet J, Zhen J (2023) Robust convex optimization: A new perspective
that unifies and extends. Mathematical Programming 200(2):877–918.
Bertsimas D, Kallus N (2020) From predictive to prescriptive analytics. Management Science
66(3):1025–1044.
184
Bertsimas D, Koduri N (2022) Data-driven optimization: A reproducing kernel hilbert space approach.
Operations Research 70(1):454–471.
Bertsimas D, McCord C (2019) From predictions to prescriptions in multistage optimization problems.
Mathematical Programming, under review ArXiv preprint arXiv:1904.11637.
Bertsimas D, McCord C, Sturt B (2022b) Dynamic optimization with side information. European
Journal of Operational Research, to appear ArXiv preprint arXiv:1907.07307v2.
Bertsimas D, Paskov I (2020) Stable regression: On the power of optimization over randomization in
training regression problems. Journal of Machine Learning Research 21(230):1–25.
Bertsimas D, Pauphilet J, Parys BV (2020) Sparse Regression: Scalable Algorithms and Empirical
Performance. Statistical Science 35(4):555 – 578, URL http://dx.doi.org/10.1214/19-STS701.
Bertsimas D, Pauphilet J, Stevens J, Tandon M (2021b) Predicting inpatient flow at a major hospital
using interpretable analytics. Manufacturing & Service Operations Management 24(6):2809–
2824.
Bertsimas D, Pauphilet J, Van Parys B (2021c) Sparse classification: a scalable discrete optimization
perspective. Machine Learning 110(11):3177–3209.
Bertsimas D, Shtern S, Sturt B (2022c) A data-driven approach to multistage stochastic linear
optimization. Management Science, to appear Preprint at https://dbertsim.mit.edu/pdfs/
papers/2018-sturt-data-driven-two-stage-approach.pdf.
Bertsimas D, Villalobos Carballo K, Boussioux L, Li ML, Paskov A, Paskov I (2024) Holistic deep
learning. Machine Learning 113(1):159–183.
Birge JR, Louveaux F (2011) Introduction to Stochastic Programming (Springer).
Bunel R, Turkaslan I, Torr PH, Kohli P, Kumar MP (2017) A unified view of piecewise linear neural
network verification. arXiv preprint arXiv:1711.00455 .
Callaway B, Sant’Anna PH (2021) Difference-in-differences with multiple time periods. Journal of
Econometrics 225(2):200–230.
Cameron AC, Miller DL (2015) A practitioner’s guide to cluster-robust inference. Journal of Human
Resources 50(2):317–372.
Carballo KV, Na L, Ma Y, Boussioux L, Zeng C, Soenksen LR, Bertsimas D (2022) Tabtext: A
185
flexible and contextual approach to tabular data representation. arXiv preprint arXiv:2206.10381
.
Carlini N, Wagner D (2017) Towards evaluating the robustness of neural networks. 2017 IEEE
symposium on security and privacy (sp), 39–57.
Changpinyo S, Sandler M, Zhmoginov A (2017) The power of sparsity in convolutional neural
networks. arXiv preprint arXiv:1702.06257 .
Chassein A, Goerigk M (2019) On the complexity of robust geometric programming with polyhedral
uncertainty. Operations Research Letters 47(1):21–24.
Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. Proceedings of the 22nd
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794
(Association for Computing Machinery).
Cohen J, Rosenfeld E, Kolter Z (2019) Certified adversarial robustness via randomized smoothing.
International Conference on Machine Learning, 1310–1320 (PMLR).
Dathathri S, Dvijotham K, Kurakin A, Raghunathan A, Uesato J, Bunel R, Shankar S, Steinhardt
J, Goodfellow I, Liang P, Kohli P (2020) Enabling certification of verification-agnostic networks
via memory-efficient semidefinite programming. arXiv preprint arXiv:2010.11645 .
Deng L (2012a) The mnist database of handwritten digit images for machine learning research. IEEE
Signal Processing Magazine 29(6):141–142.
Deng L (2012b) The mnist database of handwritten digit images for machine learning research. IEEE
Signal Processing Magazine 29(6):141–142.
Dua D, Graff C (2017) UCI machine learning repository. URL http://archive.ics.uci.edu/ml.
Dvijotham K, Gowal S, Stanforth R, Arandjelovic R, O’Donoghue B, Uesato J, Kohli P (2018a)
Training verified learners with learned verifiers. arXiv preprint arXiv:1805.10265 .
Dvijotham K, Stanforth R, Gowal S, Mann TA, Kohli P (2018b) A dual approach to scalable
verification of deep networks. UAI, volume 1, 3.
Ek A, Bernardy JP, Chatzikyriakidis S (2020) How does punctuation affect neural models in natural
language inference. URL https://aclanthology.org/2020.pam-1.15.
186
Engel Y, Mannor S, Meir R (2004) The kernel recursive least-squares algorithm. IEEE Transactions
on signal processing 52(8):2275–2285.
Gale T, Elsen E, Hooker S (2019) The state of sparsity in deep neural networks. CoRR abs/1902.09574,
URL http://arxiv.org/abs/1902.09574.
Gehr T, Mirman M, Drachsler-Cohen D, Tsankov P, Chaudhuri S, Vechev M (2018) Ai2: Safety and
robustness certification of neural networks with abstract interpretation. 2018 IEEE Symposium
on Security and Privacy (SP), 3–18 (IEEE).
Geneviève LD, Martani A, Mallet MC, Wangmo T, Elger BS (2019) Factors influencing harmonized
health data collection, sharing and linkage in denmark and switzerland: A systematic review.
PLOS ONE 14(12):e0226015, URL http://dx.doi.org/10.1371/journal.pone.0226015.
Glorot X, Bengio Y (2010a) Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the thirteenth international conference on artificial intelligence and statistics,
249–256 (JMLR Workshop and Conference Proceedings).
Glorot X, Bengio Y (2010b) Understanding the difficulty of training deep feedforward neural networks.
Teh YW, Titterington M, eds., Proceedings of the Thirteenth International Conference on
Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research,
249–256 (Chia Laguna Resort, Sardinia, Italy: PMLR), URL https://proceedings.mlr.press/v9/
glorot10a.html.
Goldwasser S, Kalai AT, Kalai Y, Montasser O (2020) Beyond perturbations: Learning guarantees
with arbitrary adversarial test examples. Advances in Neural Information Processing Systems
33:15859–15870.
Goodfellow IJ, Shlens J, Szegedy C (2014) Explaining and harnessing adversarial examples.
Goodfellow IJ, Shlens J, Szegedy C (2015) Explaining and harnessing adversarial examples.
Goodman-Bacon A (2021) Difference-in-differences with variation in treatment timing. Journal of
Econometrics 225(2):254–277, ISSN 0304-4076, Themed Issue: Treatment Effect 1.
Gorishniy Y, Rubachev I, Babenko A (2022) On embeddings for numerical features in tabular deep
learning. URL http://dx.doi.org/10.48550/ARXIV.2203.05556.
Gowal S, Dvijotham KD, Stanforth R, Bunel R, Qin C, Uesato J, Arandjelovic R, Mann T, Kohli
187
P (2019) Scalable verified training for provably robust image classification. Proceedings of the
IEEE/CVF International Conference on Computer Vision, 4842–4851.
Han S, Pool J, Tran J, Dally W (2015) Learning both weights and connections for efficient neural
network. Advances in neural information processing systems 28.
Hanasusanto GA, Kuhn D (2013) Robust data-driven dynamic programming. Advances in Neural
Information Processing Systems 827–835, ISSN 10495258, URL http://papers.nips.cc/paper/
5123-robust-data-driven-dynamic-programming.
Harari A, Katz G (2022) Few-shot tabular data enrichment using fine-tuned transformer architectures.
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor
J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río
JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H,
Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362,
URL http://dx.doi.org/10.1038/s41586-020-2649-2.
Hastie T, Tibshirani R, Friedman J (2001) The Elements of Statistical Learning. Springer Series in
Statistics (New York, NY, USA: Springer New York Inc.).
Hegselmann S, Buendia A, Lang H, Agrawal M, Jiang X, Sontag D (2023) Tabllm: Few-shot
classification of tabular data with large language models. Ruiz F, Dy J, van de Meent JW,
eds., Proceedings of The 26th International Conference on Artificial Intelligence and Statistics,
volume 206 of Proceedings of Machine Learning Research, 5549–5581 (PMLR), URL https:
//proceedings.mlr.press/v206/hegselmann23a.html.
Hein M, Andriushchenko M (2017) Formal guarantees on the robustness of a classifier against
adversarial manipulation. CoRR abs/1705.08475, URL http://arxiv.org/abs/1705.08475.
Herzig J, Nowak PK, Müller T, Piccinno F, Eisenschlos J (2020) TaPas: Weakly supervised table
parsing via pre-training. URL http://dx.doi.org/10.18653/v1/2020.acl-main.398.
Hoefler T, Alistarh D, Ben-Nun T, Dryden N, Peste A (2021) Sparsity in deep learning: Pruning
and growth for efficient inference and training in neural networks. The Journal of Machine
Learning Research 22(1):10882–11005.
188
Honeine P (2011) Online kernel principal component analysis: A reduced-order model. IEEE
transactions on pattern analysis and machine intelligence 34(9):1814–1826.
Huang R, Xu B, Schuurmans D, Szepesvari C (2016) Learning with a strong adversary.
Ilyas A, Jalal A, Asteri E, Daskalakis C, Dimakis AG (2017) The robust manifold defense: Adversarial
training using generative models. CoRR abs/1712.09196, URL http://arxiv.org/abs/1712.09196.
Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A (2019) Adversarial examples are not
bugs, they are features. arXiv preprint arXiv:1905.02175 .
Janowsky SA (1989) Pruning versus clipping in neural networks. Phys. Rev. A 39:6600–6603, URL
http://dx.doi.org/10.1103/PhysRevA.39.6600.
Johnson AE, Pollard TJ, Shen L, Lehman LwH, Feng M, Ghassemi M, Moody B, Szolovits P,
Anthony Celi L, Mark RG (2016) Mimic-iii, a freely accessible critical care database. Scientific
data 3(1):1–9.
Kabilan VM, Morris B, Nguyen A (2018) Vectordefense: Vectorization as a defense to adversarial
examples. CoRR abs/1804.08529, URL http://arxiv.org/abs/1804.08529.
Katz G, Barrett C, Dill DL, Julian K, Kochenderfer MJ (2017) Reluplex: An efficient smt solver
for verifying deep neural networks. International Conference on Computer Aided Verification,
97–117 (Springer).
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) Lightgbm: A highly
efficient gradient boosting decision tree. Advances in Neural Information Processing Systems
30.
Kivinen J, Smola AJ, Williamson RC (2004) Online learning with kernels. IEEE transactions on
signal processing 52(8):2165–2176.
Koppel A, Warnell G, Stump E, Ribeiro A (2016) Parsimonious online learning with kernels via
sparse projections in function space.
Krizhevsky A, Hinton G, et al. (2009a) Learning multiple layers of features from tiny images .
Krizhevsky A, Hinton G, et al. (2009b) Learning multiple layers of features from tiny images.
Technical report.
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural
189
networks. Proceedings of the 25th International Conference on Neural Information Processing
Systems - Volume 1, 1097–1105, NIPS’12 (Red Hook, NY, USA: Curran Associates Inc.).
Krogh A, Vedelsby J (1994) Neural network ensembles, cross validation, and active learning. Advances
in neural information processing systems 7.
Kurakin A, Goodfellow I, Bengio S, et al. (2016) Adversarial examples in the physical world.
Lamb A, Binas J, Goyal A, Serdyuk D, Subramanian S, Mitliagkas I, Bengio Y (2018) Fortified
networks: Improving the robustness of deep networks by modeling the manifold of hidden
representations. arXiv preprint arXiv:1804.02485 .
LeCun Y, Denker J, Solla S (1989) Optimal brain damage. Advances in neural information processing
systems 2.
Lecuyer M, Atlidakis V, Geambasu R, Hsu D, Jana S (2019) Certified robustness to adversarial
examples with differential privacy. 2019 IEEE Symposium on Security and Privacy (SP), 656–672
(IEEE).
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) BioBERT: a pre-trained biomedical
language representation model for biomedical text mining. Bioinformatics 36(4):1234–1240,
ISSN 1367-4803, URL http://dx.doi.org/10.1093/bioinformatics/btz682.
Li H, Kadav A, Durdanovic I, Samet H, Graf HP (2016) Pruning filters for efficient convnets. CoRR
abs/1608.08710, URL http://arxiv.org/abs/1608.08710.
Li Y, Wehbe RM, Ahmad FS, Wang H, Luo Y (2022) Clinical-longformer and clinical-bigbird:
Transformers for long clinical sequences. arXiv preprint arXiv:2201.11838 .
Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika
73(1):13–22.
Liu EZ, Haghgoo B, Chen AS, Raghunathan A, Koh PW, Sagawa S, Liang P, Finn C (2021) Just
train twice: Improving group robustness without training group information. International
Conference on Machine Learning, 6781–6792 (PMLR).
Lorena AC, Garcia LP, Lehmann J, Souto MC, Ho TK (2019) How complex is your classification
problem? a survey on measuring classification complexity. ACM Computing Surveys (CSUR)
52(5):1–34.
190
Louizos C, Welling M, Kingma DP (2017) Learning sparse neural networks through l_0 regularization.
arXiv preprint arXiv:1712.01312 .
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. Advances in
neural information processing systems 30.
Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, Liu TY (2022) BioGPT: generative pre-trained
transformer for biomedical text generation and mining. Briefings in Bioinformatics 23(6), URL
http://dx.doi.org/10.1093/bib/bbac409.
Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2017) Towards deep learning models resistant
to adversarial attacks.
Madry A, Makelov A, Schmidt L, Tsipras D, Vladu A (2019) Towards deep learning models resistant
to adversarial attacks.
May RJ, Maier HR, Dandy GC (2010) Data splitting for artificial neural networks using som-based
stratified sampling. Neural Networks 23(2):283–294.
Micchelli CA, Xu Y, Zhang H (2006) Universal kernels. Journal of Machine Learning Research 7(12).
Mirman M, Gehr T, Vechev M (2018) Differentiable abstract interpretation for provably robust
neural networks. International Conference on Machine Learning, 3578–3586 (PMLR).
Miyajiwala A, Ladkat A, Jagadale S, Joshi R (2022) On sensitivity of deep learning based text
classification algorithms to practical input perturbations.
Mocanu DC, Mocanu E, Stone P, Nguyen PH, Gibescu M, Liotta A (2017) Evolutionary training of
sparse artificial neural networks: A network science perspective. CoRR abs/1707.04780, URL
http://arxiv.org/abs/1707.04780.
Mostafa H, Wang X (2019) Parameter efficient training of deep convolutional neural networks by
dynamic sparse reparameterization. International Conference on Machine Learning, 4646–4655
(PMLR).
Na L, Villalobos Carballo K, Pauphilet J, Haddad-Sisakht A, Kombert D, Boisjoli-Langlois M,
Castiglione A, Khalifa M, Hebbal P, Stein B, Bertsimas D (2023) Patient outcome predictions
improve operations at a large hospital network. arXiv preprint arXiv:2305.15629 .
Nan Y, Ser JD, Walsh S, Schönlieb C, Roberts M, Selby I, Howard K, Owen J, Neville J, Guiot J,
191
Ernst B, Pastor A, Alberich-Bayarri A, Menzel MI, Walsh S, Vos W, Flerin N, Charbonnier JP,
van Rikxoort E, Chatterjee A, Woodruff H, Lambin P, Cerdá-Alberich L, Martí-Bonmatí L,
Herrera F, Yang G (2022) Data harmonisation for information fusion in digital healthcare: A
state-of-the-art systematic review, meta-analysis and future research directions. Information
Fusion 82:99–122, URL http://dx.doi.org/10.1016/j.inffus.2022.01.001.
Narang S, Elsen E, Diamos G, Sengupta S (2017) Exploring sparsity in recurrent neural networks.
arXiv preprint arXiv:1704.05119 .
Norkin V, Keyzer M (2009) On stochastic optimization and statistical learning in reproducing kernel
hilbert spaces by support vector machines (svm). Informatica 20(2):273–292.
Padhi I, Schiff Y, Melnyk I, Rigotti M, Mroueh Y, Dognin P, Ross J, Nair R, Altman E (2021)
Tabular transformers for modeling multivariate time series. URL http://dx.doi.org/10.1109/
ICASSP39728.2021.9414142.
Pati YC, Rezaiifar R, Krishnaprasad PS (1993) Orthogonal matching pursuit: Recursive function
approximation with applications to wavelet decomposition. Proceedings of 27th Asilomar
conference on signals, systems and computers, 40–44 (IEEE).
Pflug GC, Pichler A (2016) From empirical observations to tree models for stochastic optimization:
convergence properties. SIAM Journal on Optimization 26(3):1715–1740.
Prakash A, Moran N, Garber S, DiLillo A, Storer JA (2018) Deflecting adversarial attacks with pixel
deflection. CoRR abs/1801.08926, URL http://arxiv.org/abs/1801.08926.
Raghunathan A, Steinhardt J, Liang P (2018a) Certified defenses against adversarial examples.
arXiv preprint arXiv:1801.09344 .
Raghunathan A, Steinhardt J, Liang P (2018b) Semidefinite relaxations for certifying robustness to
adversarial examples. arXiv preprint arXiv:1811.01057 .
Rauber J, Brendel W, Bethge M (2017) Foolbox: A python toolbox to benchmark the robustness of
machine learning models. Reliable Machine Learning in the Wild Workshop, 34th International
Conference on Machine Learning, URL http://arxiv.org/abs/1707.04131.
Rauber J, Zimmermann R, Bethge M, Brendel W (2020) Foolbox native: Fast adversarial attacks to
192
benchmark the robustness of machine learning models in pytorch, tensorflow, and jax. Journal
of Open Source Software 5(53):2607, URL http://dx.doi.org/10.21105/joss.02607.
Rockafellar RT (1970) Convex analysis. Princeton Mathematical Series (Princeton, N. J.: Princeton
University Press).
Roos E, den Hertog D, Ben-Tal A, De Ruiter F, Zhen J (2020) Tractable approximation of hard
uncertain optimization problems. Available on Optimization-Online .
Ross AS, Doshi-Velez F (2017) Improving the adversarial robustness and interpretability of deep
neural networks by regularizing their input gradients.
Sagawa S, Koh PW, Hashimoto TB, Liang P (2019) Distributionally robust neural networks for group
shifts: On the importance of regularization for worst-case generalization. CoRR abs/1911.08731,
URL http://arxiv.org/abs/1911.08731.
Savarese P, Silva H, Maire M (2020) Winning the lottery with continuous sparsification. Advances in
Neural Information Processing Systems 33:11380–11390.
Schölkopf B, Smola AJ, Bach F, et al. (2002) Learning with kernels: support vector machines,
regularization, optimization, and beyond (MIT press).
Shafieezadeh-Abadeh S, Kuhn D, Esfahani PM (2019) Regularization via mass transportation.
Journal of Machine Learning Research 20(103):1–68.
Shapiro A, Dentcheva D, Ruszczyński A (2014) Lectures on stochastic programming: modeling and
theory (SIAM).
Shawe-Taylor J, Cristianini N, et al. (2004) Kernel methods for pattern analysis (Cambridge university
press).
Singh G, Gehr T, Mirman M, Püschel M, Vechev MT (2018) Fast and effective robustness certification.
NeurIPS 1(4):6.
Sinha A, Namkoong H, Volpi R, Duchi J (2017) Certifying some distributional robustness with
principled adversarial training. International Conference on Learning Representations .
Soentpiet R, et al. (1999) Advances in kernel methods: support vector learning (MIT press).
Somepalli G, Goldblum M, Schwarzschild A, Bruss CB, Goldstein T (2021) Saint: Improved
193
neural networks for tabular data via row attention and contrastive pre-training. URL http:
//dx.doi.org/10.48550/ARXIV.2106.01342.
Staib M, Jegelka S (2019) Distributionally robust optimization and generalization in kernel methods.
Advances in Neural Information Processing Systems 32.
Sweeney E (2017) Experts say ibm watson’s flaws are rooted in data collec-
tion and interoperability. URL https://www.fiercehealthcare.com/analytics/
ibm-watson-s-flaws-trace-back-to-data-collection-interoperability.
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R (2013) Intriguing
properties of neural networks. arXiv preprint arXiv:1312.6199 .
Szegedy C, Zaremba W, Sutskever I, Bruna J, Erhan D, Goodfellow I, Fergus R (2014) Intriguing
properties of neural networks. International Conference on Learning Representations, URL
http://arxiv.org/abs/1312.6199.
Thompson NC, Greenewald KH, Lee K, Manso GF (2020) The computational limits of deep learning.
CoRR abs/2007.05558, URL https://arxiv.org/abs/2007.05558.
Tjeng V, Xiao K, Tedrake R (2019) Evaluating robustness of neural networks with mixed integer
programming.
Van Rossum G, Drake FL (2009a) Python 3 Reference Manual (Scotts Valley, CA: CreateSpace),
ISBN 1441412697.
Van Rossum G, Drake FL (2009b) Python 3 Reference Manual (Scotts Valley, CA: CreateSpace),
ISBN 1441412697.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017)
Attention is all you need. Advances in neural information processing systems 30.
Vincent P, Bengio Y (2002) Kernel matching pursuit. Machine learning 48(1):165–187.
Wahba G (1990) Spline models for observational data (SIAM).
Weng L, Zhang H, Chen H, Song Z, Hsieh CJ, Daniel L, Boning D, Dhillon I (2018) Towards fast
computation of certified robustness for relu networks. International Conference on Machine
Learning, 5276–5285 (PMLR).
Wheeden RL (2015) Measure and integral: an introduction to real analysis, volume 308 (CRC press).
194
Wong E, Kolter Z (2018) Provable defenses against adversarial examples via the convex outer
adversarial polytope. International Conference on Machine Learning, 5286–5295 (PMLR).
Wong E, Rice L, Kolter JZ (2020) Fast is better than free: Revisiting adversarial training. arXiv
preprint arXiv:2001.03994 .
Wong E, Schmidt FR, Metzen JH, Kolter JZ (2018) Scaling provable adversarial defenses. arXiv
preprint arXiv:1805.12514 .
Xiao H, Rasul K, Vollgraf R (2017a) Fashion-mnist: a novel image dataset for benchmarking machine
learning algorithms.
Xiao H, Rasul K, Vollgraf R (2017b) Fashion-mnist: a novel image dataset for benchmarking machine
learning algorithms.
Xie C, Wang J, Zhang Z, Ren Z, Yuille A (2017) Mitigating adversarial effects through randomization.
Xu Y, Goodacre R (2018) On splitting training and validation set: a comparative study of cross-
validation, bootstrap and systematic sampling for estimating the generalization performance of
supervised learning. Journal of analysis and testing 2(3):249–262.
Yan Z, Guo Y, Zhang C (2018) Deep defense: Training dnns with improved adversarial robustness.
arXiv preprint arXiv:1803.00404 .
Yin P, Neubig G, Yih Wt, Riedel S (2020) TaBERT: Pretraining for joint understanding of textual
and tabular data. URL http://dx.doi.org/10.18653/v1/2020.acl-main.745.
Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability
estimates. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 694–699.
Zhang H, Chen H, Xiao C, Gowal S, Stanforth R, Li B, Boning D, Hsieh CJ (2019) Towards stable
and efficient training of verifiably robust neural networks. arXiv preprint arXiv:1906.06316 .
Zhang H, Weng TW, Chen PY, Hsieh CJ, Daniel L (2018) Efficient neural network robustness
certification with general activation functions. arXiv preprint arXiv:1811.00866 .
Zhang L, Yi J, Jin R, Lin M, He X (2013) Online kernel learning with a near optimal sparsity bound.
International Conference on Machine Learning, 621–629.
195
Zhen J, de Ruiter F, Den Hertog D (2017) Robust optimization for models with uncertain soc and
sdp constraints. Optimization Online .
Zhou DX (2002) The covering number in learning theory. Journal of Complexity 18(3):739–767,
ISSN 0885-064X, URL http://dx.doi.org/https://doi.org/10.1006/jcom.2002.0635.
Zhuang Z, Tan M, Zhuang B, Liu J, Guo Y, Wu Q, Huang J, Zhu J (2018) Discrimination-aware
channel pruning for deep neural networks. Advances in neural information processing systems
31.
196