1 — 16:20 — Soft regression trees: a model variant and a decomposition training algorithm
Decision trees are popular supervised Machine Learning (ML) methods used for classification and regression tasks arising in a variety of application fields. Unlike most black-box ML models, they reveal the sequence of decisions leading to the tree response for any input vector. Interpretability is of particular importance in applications where the ML models support the decisions of the domain experts and justifiable predictions are required, such as for instance in medical diagnosis.
In the seminal CART method [3] for building univariate classification and regression trees and the later variants C4:5 and ID3, a top-down and greedy approach is adopted where at each branch node a split is determined by minimizing a local impurity measure.
Since training decision trees is known to be NP-hard, it is a challenging problem to globally optimize them over all the tree parameters. During the past decade, growing attention has been devoted to globally optimized (trained) decision trees with deterministic or soft splitting rules at branch nodes, leveraging on the remarkable advances in mixed integer optimization and nonlinear optimization.
We propose a variant of soft multivariate regression trees (SRTs) where, for every input vector, a potential linear prediction is available at each leaf node (as a linear regression of the features) but the actual tree prediction is the one associated to the single leaf node obtained by routing the input vector from the root along the branches with the higher probabilities. Our nonlinear optimization formulation for training such soft trees is well-suited to decomposition and to impose fairness constraints. After showing a universal approximation result for SRTs, we present a node-based decomposition training algorithm which includes a clustering-based initialization procedure and a heuristic for reassigning the data points along the tree. Under mild assumptions, we also establish asymptotic convergence guarantees. Experiments on 15 data sets from the UCI and KEEL repositories indicate that our SRT model variant and decomposition algorithm yield improved testing accuracy compared with the soft regression trees discussed in [2], and significant speed-up in training time and similar testing accuracy compared with the mixed integer optimization approach described in [1]. Other experiments on the same datasets show that our SRT approach is significantly more robust than classical Hierarchical Mixture of Experts trained with the Expectation-Maximization algorithm and provides single soft trees whose testing accuracy is comparable to that of Random Forest.
References
[1] Bertsimas, D. and Dunn, J.: Machine learning under a modern optimization lens. Dynamic Ideas LLC (2019).
[2] Blanquero, R., Carrizosa, E., Molero-Rìo, C. and Romero Morales, D.: On sparse optimal regression trees. European Journal of Operational Research, 299(3), 1045-1054 (2022).
[3] Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC press (1984).
[4] Irsoy, O., Yildiz, O.T., Alpaydin, E.: Soft decision trees. Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), 1819–1822 (2012).
2 — 16:50 — Mixed-Integer Linear Optimization for Cardinality-Constrained Random Forests
Random forests are among the most famous algorithms for solving classification problems, mainly for large-scale data sets. Considering a set of labeled points and several decision trees, the method takes the majority vote to classify a new given point. In some scenarios, however, labels are only accessible for a proper subset of the given points. Moreover, this subset can be non-representative, e.g., due to collection bias. Semi-supervised learning considers the setting of labeled and unlabeled data and often improves the reliability of the results. In addition, it can be possible to obtain additional information about class sizes from undisclosed sources. We propose a mixed-integer linear optimization model for computing a semi-supervised random forest that covers the setting of labeled and unlabeled data points as well as the overall number of points in each class for a binary classification. Since the solution time rapidly grows as the number of variables increases, we present some preprocessing techniques and an intuitive branching priority rule to reduce the model's size. Our numerical results show that our approach leads to a better accuracy and a better Matthews correlation coefficient for biased samples compared to random forests by majority vote, even if only few labeled points are available.
3 — 17:20 — On Enhancing the Explainability and Fairness of Tree Ensembles
Tree ensembles are one of the most powerful methodologies in Machine Learning. In this talk, we investigate how to make tree ensembles more flexible to incorporate by design explainability and fairness. While explainability helps the user understand the key features that play a role in the classification task, with fairness we ensure that the ensemble does not discriminate against a group of observations that share a sensitive attribute. We propose a Mixed Integer Linear Optimization formulation to train an ensemble of trees that apart from minimizing the misclassification error, controls for sparsity as well as the accuracy in the sensitive group. Our formulation is scalable in the number of observations since its number of binary decision variables is independent of the number of observations. In our numerical results, we show that for standard datasets used in the fairness literature, we can dramatically enhance the fairness of the benchmark, namely the popular Random Forest, while using only a few features, all without damaging the misclassification error.