Offre en lien avec l’Action/le Réseau : – — –/– — –
Laboratoire/Entreprise : ICube laboratory
Durée : 3 ans
Contact : lafabregue@unistra.fr
Date limite de publication : 2024-05-23
Contexte :
This thesis is part of the field of unsupervised or weakly supervised learning applied to temporal data. Clustering, which consists in partitioning the set of analyzed objects into groups or clusters, is one of the most widely used approaches and relies on a similarity measure between objects. In particular, sequence clustering raises problems related to measuring similarity between two individuals. For example, in river monitoring, certain phenomena occur at an annual frequency linked to the natural water cycle, but may be shifted in time due to geographical distance and local meteorology. Similarity measures must be able to take these potential shifts or slight distortions in time into account. Numerous methods have been proposed in the literature to take these specificities into account, such as Dynamic Time Warping, Longest Common SubSequence or, more recently, representations using shapelets or neural networks.
Sujet :
The main objective of this thesis is to develop new approaches for measuring similarity between two multivariate time series, taking into account missing values distributed heterogeneously in time and between variables. The aim is to define solutions for integrating temporal information (spacing between two time steps, temporal frequencies of measurements, etc.) into the calculation of similarity. We will also look at how to integrate the expert’s knowledge via annotations, also known as constraints (e.g. proximity/remoteness between two individuals based on external information), concerning both temporal and spatial links between different individuals, in order to improve the correspondence between the clustering obtained and the expert’s expectations. These approaches will be experimented on river monitoring data that raise various problems, due to their number, their diversity, and their spatial and temporal heterogeneity.
Work will focus on the following questions:
– clustering vector sequences, where vectors contain parameters that can be measured at different time steps
– taking into account temporal (seasons) and geographical (hydrographic regions) constraints
– coupling physico-chemical and biological or hydrological data (different measurement frequencies)
– exploring the limits of the proposed methods in terms of number and size of sequences
Profil du candidat :
– Master 2 in Computer Science
– Training in data science, data mining, machine learning
Formation et compétences requises :
– Excellent knowledge of machine learning and knowledge modeling
– Excellent programming skills in Python or R
– Excellent communication and writing skills in English (French not mandatory)
– An interest in the application’s subject
Adresse d’emploi :
The thesis will be carried out at the ICube laboratory in Illkirch (near Strasbourg).
Document attaché : 202403191713_sujet_these_hydro.pdf