Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : IRISA
Durée : 5 mois
Contact : laetitia.chapel@irisa.fr
Date limite de publication : 2024-03-01
Contexte :
AI methodologies typically depend on extensive datasets that may be tainted by noise, missing values, or can be collected in heterogeneous yet related environments. Data with missing values are ubiquitous in many applications; they can be due to equipment failure, incomplete information collection (e.g. clouds in the remote sensing case) or inadequate data entry for instance. Nevertheless, conventional
learning algorithms often assume that the data are complete and independent and identically distributed, that is to say they have been drawn randomly from a single distribution.
Data imputation aim at substituting missing data by plausible values, e.g. by filling them by the value of the nearest sample or by imputing with some relevant statistics. The imputation can have a high
impact on performances of the learning task at hand, leading to biased results or degraded performances. Most of the imputation methods rely on some (completely) missing at random assumption and with no pattern between the missingness of the data and any values. More challenging scenario deal with random block missing or blackout missing, in which blocks of information are missing and where the
structure of block-wise missing data should be further taken into consideration.
In practice, the data are often collected on different yet related domains, offering the potential to enhance the generalization capability of the learning algorithm. For instance, in Earth observation, and especially for land cover mapping applications, the differences in weather, soil conditions or farmer practices between study sites are known to induce temporal shifts that can be corrected to enhance task performance. For predicting crop yield, the variability under changing climates and severe weather events have to be taken into account when considering data from the past to predict the evolution of the yield.
Domain adaptation [6, 7] aims to transfer knowledge from one domain to another and has demonstrated significant enhancements in classification or clustering tasks when domain shifts are carefully managed.
Sujet :
The aim of the internship is to study the potential of data imputation method within the context of domain adaptation. Existing approaches mostly tackle missing values within an inferential framework, wherein they are replaced with values derived from dataset statistics, relying on robust parametric assumptions. However, when a shift exists between the datasets, this strategy becomes inadequate. Instead, we propose to address imputation and learning tasks concurrently, introducing the additional complexity that the data may originate from different domains.
The research directions will explore optimal transport-based solutions, known for their success in
imputing missing values and aligning distributions in a domain adaptation context, especially
when dealing with temporal data.
Profil du candidat :
Master student
== peut éventuellement être poursuivi par une thèse ==
Formation et compétences requises :
Applicants are expected to be graduated in mathematics/statistics and in computer science and/or machine learning and/or signal & image processing, and show an excellent academic profile.
Beyond, good programming skills are mandatory.
Adresse d’emploi :
Laboratoire IRISA, Rennes
Document attaché : 202401180900_Missing_data_and_DA___internship-2.pdf