Learning with heavy-tailed inputs: Out-of-domain Generalization on Extremes

When:
04/05/2025 all-day
2025-05-04T02:00:00+02:00
2025-05-04T02:00:00+02:00

Offre en lien avec l’Action/le Réseau : – — –/– — –

Laboratoire/Entreprise : Laboratoire MAP5, Université Paris-Cité,
Durée : 4-6 months + opportu
Contact : anne.sabourin@u-paris.fr
Date limite de publication : 2025-05-04

Contexte :
The internship is intended to lead to a PhD thesis if everything goes as planned. The PhD will be funded by the ANR project EXSTA led by A. Sabourin. The Phd Candidate will benefit from interactions with other researchers in the field e.g. through workshops organised within the project’s framework, in addition to usual participation in conferences.

Sujet :
Context: Extreme Value Theory (EVT) is a field of probability and statistics concerned with tails of distributions, that is, regions of the sample space located far away from the bulk, associated with rare and extreme events. Poviding probabilistic descriptions and statistical inference methods for the tails requires sound theoretical assumptions pertaining to the theory of regular variation and maximum domains of attraction, ensuring that a limit distribution of extremes exists. This setting encompasses a wide range of applications in various disciplines where extremes have tremendous impact, such as climate science, insurance, environmental risks and industrial monitoring systems [1].

In a supervised learning framework, the goal is to learn a good prediction function to predict new, unobserved labels. In many contexts (covariate-shifts, climate change), extrapolation (or out-of-sample) properties of the predictors thus constructed are crucial, and obtaining good generalization properties on unobserved regions of the covariate space is key. Recently, there has been significant interest in the ML literature regarding out-of-domain generalization (see e.g. [2]).

Recent works [3,4,5] focus on the problem of learning a tail predictor based on a small of the most, with non-asymptotic guarantees regarding the risk on extreme regions . For simplicity, the theoretical study in both works is limited to Empirical Risk Minimization (ERM) algorithms without a penalty term. In addition, the regression problem analysed in [5] covers least squares regression only. Also, with heavy-tailed targets, non-linear transformations of the target are required in order to satisfy boundedness assumptions.

Research Objectives: The general purpose of this internship and subsequent thesis is to extend the scope of applications of the supervised learning methods described above to a wider class of learning algorithms. One main limitation of least squares regression is that the optimal predictor (i.e. the conditional expectation given the covariate) is not invariant under non-linear transformations of the target. As a starting point, the least-squares framework will be extended to the quantile regression framework which, in contrast to least squares, is compatible with non-linear transformations. From a statistical learning perspective, we shall extend the ERM framework considered thus far to encompass penalized risk minimizations procedures amenable to high dimensional covariates or non-linear regression functions. SVM quantile regression [6] is a natural candidate for this purpose. The goal will be to obtain finite sample guarantees on the generalization error of quantile regression functions learnt with the a subsample made of the largest observations and hopefully recover learning rates of comparable order as the ones obtained in the classical framework, with the full sample size n replaced with the reduced sample size. The bottleneck is that these largest observations may not be considered as an independent sample because they are order statistics of a full sample. However it is anticipated that proof techniques from recent works [7,8,9] based on conditioning arguments and concentration inequalities incorporating (small) variance terms can be leveraged for this purpose.

References
[1] Beirlant, J., Goegebeur, Y., Segers, J., and Teugels, J. L. (2004). Statistics of Extremes: Theory and
Applications, volume 558. John Wiley & Sons.

[2] Zhou, K., Liu, Z., Qiao, Y., Xiang, T., and Loy, C. C. (2022). Domain generalization: A survey. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 45(4):4396–4415.

[3] Jalalzai, H., Clémençon, S., and Sabourin, A. (2018). On binary classification in extreme regions. In
NeurIPS Proceedings, volume 31.

[4] Clémençon, S., Jalalzai, H., Lhaut, S., Sabourin, A., and Segers, J. (2023). Concentration bounds for the
empirical angular measure with statistical learning applications. Bernoulli, 29(4):2797–2827.

[5] Huet, N., Clémençon, S., and Sabourin, A. (2023). On Regression in Extreme Regions. arXiv preprint
arXiv:2303.03084.

[6] Takeuchi, I., Le, Q. V., Sears, T. D., Smola, A. J., and Williams, C. (2006). Nonparametric quantile
estimation. Journal of machine learning research, 7(7).

Supervisory Team/contact: Anne Sabourin (MAP5, Université Paris-Cité), Clément Dombry (LMB, Université de Franche-Comté)

Profil du candidat :
Master’s student (2nd year) in Applied Mathematics/Statistics/Statistical Machine Learning with an excellent track record and a strong interest for mathematical statistics and learning theory. Some knowledge of R or Python.

Application to the PhD thesis from candidates having already graduated from a Master’s program will also be considered.

Formation et compétences requises :
Being enrolled in or having graduated from a Master’s program in Mathematics/Statistics/Statistical Machine Learning

Adresse d’emploi :
Laboratoire MAP5, Université Paris Cité, 45 rue des Saint Pères, Paris.

Document attaché : 202410081334_offreStage2024.pdf