Inferring phenotypes from genotypes with machine learning: an application to the global problem of antibiotic resistance


A thorough understanding of the relationship between the genomic characteristics of an individual (the genotype) and its biological state (the phenotype) is essential to personalized medicine, where treatments are tailored to each individual. This notably allows to anticipate diseases, estimate response to treatments, and even identify new pharmaceutical targets. Machine learning is a science that aims to develop algorithms that learn from examples. Such algorithms can be used to learn models that estimate phenotypes based on genotypes, which can then be studied to elucidate the biological mechanisms that underlie the phenotypes. Nonetheless, the application of machine learning in this context poses significant algorithmic and theoretical challenges. The high dimensionality of genomic data and the small size of data samples can lead to overfitting; the large volume of genomic data requires adapted algorithms that limit their use of computational resources; and importantly, the learned models must be interpretable by domain experts, which is not always possible. This thesis presents learning algorithms that produce interpretable models for the prediction of phenotypes based on genotypes. Firstly, we explore the prediction of discrete phenotypes using rule-based learning algorithms. We propose new implementations that are highly optimized and generalization guarantees that are adapted to genomic data. Secondly, we study a more theoretical problem, namely interval regression. We propose two new learning algorithms, one which is rule-based. Finally, we show that this type of regression can be used to predict continuous phenotypes and that this leads to models that are more accurate than those of conventional approaches in the presence of censored or noisy data. The overarching theme of this thesis is an application to the prediction of antibiotic resistance, a global public health problem of high significance. We demonstrate that our algorithms can be used to accurately predict resistance phenotypes and contribute to the improvement of their understanding. Ultimately, we expect that our algorithms will take part in the development of tools that will allow a better use of antibiotics and improved epidemiological surveillance, a key component of the solution to this problem.

PhD Thesis, Université Laval