10:50 - 12:30 |
Parallel sessions -
Each parallel session includes one invited talk (40') and three contributed talks (20').
Session 1 - Machine Learning [Room: L3 Auditoire 0/13A]
Chair : Rainer Von Sachs
Robin Van Oirbeek - mCube: multinomial micro-level reserving model[Joint work with Emmanuel Jordy Menvouta, Jolien Ponnet and Tim Verdonck]
The estimation of the claims reserve or the remaining future claim cost of a given set of open claims is a crucial exercise to ensure the financial viability of an non-life insurance company. Typically, the claims reserve is estimated for the entire portfolio simultaneously using a macro-level reserving model, of which the ChainLadder is the best known member. However, it is also possible to estimate the claims reserve on a claim-by-claim basis by means of a micro-level reserving model. This type of models captures the entire lifecycle of a claim by explicitly modelling the underlying time and payment process separately. In this presentation, we will focus on the mCube or
multinomial micro-level reserving model, an adaptation of [1], where both processes, as well as the IBNR or ‘Incurred But Not Reported’ model are modelled using separate multinomial regression models. The predictive performance as well as the different components of the model will be discussed during this presentation.
Sophie Mathieu - Monitoring of sunspot number observations based on neural networks[Joint work with Rainer Von Sachs, Christian Ritter, Laure Lefèvre and Véronique Delouille]
The observation of sunspots is one of the most important empirical data source giving information about the long-term solar activity. The sunspots observation extends from the seventeenth century to the present day. Surprisingly, determining the number of sunspots consistently over time remains a challenging problem. The challenge involves the absence of stationarity, different types of correlation and many kinds of observational errors.
In this work, we construct an artificial neural network for monitoring these important series. The method is trained by simulations that are sufficiently general to allow the predictions on unseen deviations of various types. The procedure can efficiently detect when the observations are deviating and takes into account the autocorrelation of the data. The network has been compared to a more classical procedure based on the CUSUM chart and appears to be consistent with the latter. It can also predict the size of the encountered deviations over a large range of values.
Using this method allows us to detect and identify a wide range of deviations. Many of these deviations are observer or equipment related. Detecting and understanding them will help improve future observations. Eliminating or correcting them in past data will lead to a more precise reconstruction of the International Sunspot Number, the world reference for solar activity.
Jakob Raymaekers - Regularized k-means Through Hard-Thresholding[Joint work with Ruben H. Zamar]
The k-means algorithm remains a very popular and widely used clustering method in a variety of scientific fields due to its intuitive objective function and relative ease of computation. Whereas in classical k-means, all p features are used to partition the data, it can be desirable to identify a subset of features that partitions the data particularly well. This feature selection may lead to a more interpretable partitioning of the data and more accurate recovery of the 'true' clusters.
We study a framework for performing regularized k-means, based on direct penalization of the size of the cluster centers. Different penalization strategies are considered and compared in a theoretical analysis and an extensive Monte Carlo simulation study. Based on the results, we propose a new method called hard-threshold k-means (HTK-means), which uses an L0 penalty to induce sparsity. HTK-means is a fast and competitive sparse clustering method which is easily interpretable, as is illustrated on several real data examples. In this context, new graphical displays are presented and used to gain further insight into the data sets. Hugues Annoye - Statistical Matching using KCCA, Super-OM and Autoencoders-CCA[Joint work with Alessandro Beretta, Cédric Heuchenne and Ida-Marie Jensen]
The potential to study and improve different aspects of our lives is ever growing thanks to the abundance of data available in today’s modern society. Scientists and researchers often need to analyze data from different sources; the observations, which only share a subset of the variables, cannot always be paired to detect common individuals.
This is the case, for example, when the information required to study a certain phenomenon is coming from different sample surveys. Statistical matching is a common practice to combine these data sets. In this talk, we investigate and extend to statistical matching three methods based on Kernel Canonical Correlation Analysis (KCCA), Super-Organizing Map (Super-OM) and Autoencoders-Canonical Correlation Analysis (A-CCA). These methods are designed to deal with various variable types, sample weights and incompatibilities among categorical variables. We use the 2017 Belgian Statistics on Income and Living Conditions (SILC) and we compare the performance of the proposed statistical matching methods by means of a cross-validation technique, as if the data were available from two separate sources.
|
Session 2 - Asymptotics [Room: L3 (0/54)]
Chair : Amir Aboubacar
Eva Cantoni - Robust Fitting for Generalized Additive Models for Location, Scale and Shape[Joint work with William H. Aeberhard, Giampiero Marra and Rosalba Radice]
The validity of estimation and smoothing parameter selection for the wide class of generalized additive models for location, scale and shape (GAMLSS) relies on the correct specification of a likelihood function. Deviations from such assumption are known to mislead any likelihood-based inference and can hinder penalization schemes meant to ensure some degree of smoothness for non-linear effects. We propose a general approach to achieve robustness in fitting GAMLSSs by limiting the contribution of observations with low log-likelihood values. Robust selection of the smoothing parameters can be carried out either by minimizing information criteria that naturally arise from the robustified likelihood or via an extended Fellner-Schall method. The latter allows for automatic smoothing parameter selection and is particularly advantageous in applications with multiple smoothing parameters. We also address the challenge of tuning robust estimators for models with non-linear effects by proposing a novel median downweighting proportion criterion. This enables a fair comparison with existing robust estimators for the special case of generalized additive models, where our estimator competes favorably. The overall good performance of our proposal is illustrated by further simulations in the GAMLSS setting and by an application to functional magnetic resonance brain imaging using bivariate smoothing splines. Lorenzo Tedesco - Estimation of a General Semiparametric Hazards Regression Model With Application to Change analysis Hazard[Joint work with Ingrid Van Keilegom]
We propose a method for the estimation of a general semiparametric hazards regression model that extends the Accelerated failure time model and the Cox proportional hazards model in survival analysis. The method is based on a kernel-smoothed profile likelihood. The estimator is shown to be consistent and achieves the semiparametric efficiency. The method is then applied to a generalisation of the Accelerated failure time model that considers a single change point in the hazard. The change point can depend both on the time but also on the covariate values in case of a time dependent covariate. Simulations are provided together with applications to a real data set and comparisons with alternative methods. Guy Mélard - New method for the asymptotic properties of self-excited threshold models[Joint work with Marcella Niglio]
A new method for obtaining the asymptotic properties of self-excited threshold autoregressive (SETAR) or ARMA (SETARMA) models is introduced. Threshold models are non-linear models for time series with k regimes, where the regime depends on the value of a variable called the threshold variable, with respect to one (when k = 2) or several (when k > 2) threshold values. Like for most non-linear models, the usual method for obtaining the asymptotic properties of such models consists in exhibiting a stationary and ergodic solution of the model equation and using ergodicity to prove the consistency and the asymptotic normality of an estimator of the model parameters. A new method for obtaining these asymptotic properties was recently proposed by the authors when the threshold variable is exogenous and independent of the innovations of the time series models. That method is based on an asymptotic theory for scalar or vector ARMA models, where the coefficients are not constant but are deterministic functions of time and a small number of parameters. The method is thus valid well beyond threshold autoregressive (TAR) models, like TARMA models and their multivariate counterparts, but still assuming an exogenous threshold variable. SETARMA models have random coefficients so that the theory is not directly
applicable. Nevertheless, it is possible to adapt these fundamental results to the case of the randomly varying coefficients that appear in SETARMA models and their vector generalization. The only problem with the new method is that the existence of the information matrix has to be assumed or proved. Alexander Duerre - Depth conditioned functional curves[Joint work with Davy Paindaveine]
Statistical functionals are useful tools to extract essential informations like location, scale or dependence from possibly complicated distributions. Popular examples are the expected value, the variance and the covariance matrix. A major appeal lies in their simplicity, which is also their major limitation. Sometimes the location of a distribution cannot be solely described by its expectation. Imagine a mixture distribution of a standard normal with large mixture weight and a point mass in far a way from 0. The expectation then fails to capture neither the location of the standard normal nor the location of the ”outlying” probability mass. We will develop the idea of conditional functional curves which capture both the properties of the central probability mass and properties of the outer or outlying probability mass.
For a subclass of functionals, we propose a consistent estimator for the corresponding conditional functional curve and derive pointwise asymptotic normality. We conclude with some examples underlining the various possibilities to apply this method.
|
Session 3 - Genetics [Room: L5 (1/1)]
Chair : Dirk Valkenborg
Kai Kammers - Transcriptional landscape of platelets and iPSC-derived megakaryocytes[Joint work with M.A. Taub, B. Rodriguez, L.R. Yanek, I. Ruczinski, J. Martin, K. Kanchan, A. Battle, L. Cheng, Z.Z. Wang, A.D. Johnson, J.T. Leek, N. Faraday, L.C. Becker and R.A. Mathias]
Genome-wide association studies have identified common variants associated with platelet-related phenotypes, but because these variants are largely intronic or intergenic, their link to platelet biology is unclear. Additionally, extensive missing heritability may be resolved by integrating genetics and transcriptomics. To better understand the transcriptome signature and its genetic regulatory landscape in platelets and induced pluripotent stem cell-derived megakaryocyte (MK) cell lines (platelet precursor cells), we performed expression-quantitative trait locus (eQTL) analyses of wholegenome sequencing and RNA-sequencing data on both cell types in African American (AA) and European American (EA) subjects from the Genetic Studies of Atherosclerosis Risk (GeneSTAR) project.
By meta-analyzing the results of AAs and EAs and selecting the peak single-nucleotide polymorphism (SNP) for each expressed gene, we identified 946 cis-eQTLs in MKs and 1830 cis-eQTLs in platelets. Among the 57 eQTLs shared between the two tissues, the estimated directions of effect are very consistent (98.2% concordance). A high proportion of detected cis-eQTLs (74.9% in MKs and 84.3% in platelets) are unique to MKs and platelets compared with peak-associated SNP-expressed gene pairs of 48 other tissue types that are reported in version V7 of the Genotype-Tissue Expression Project. The locations of our identified eQTLs are significantly enriched for overlap with several annotation tracks highlighting genomic regions with specific functionality in MKs.
These results offer insights into the regulatory signature of MKs and platelets, with significant overlap in genes expressed, eQTLs detected, and enrichment within known superenhancers relevant to platelet biology.
Yao Chen - A Bootstrap Method for Variance Estimation in dPCR Experiments[Joint work with Ward De Spiegelaere, Wim Trypsteen, David Gleerup and Olivier Thas]
Digital PCR (dPCR) is a highly sensitive technique for quantification of a target
molecule copy number in a biological sample. It proceeds by massive partitioning of the sample, followed by PCR reactions in all partitions and eventually classifying the individual partitions as positive or negative based on the end-point fluorescence intensities. The data are often analysed by relying on the binomial or Poisson distribution. However, these assumptions may not stand when there are other sources of variation than sampling error. Moreover, when more complicated statistics need to be computed (see further), then these parametric methods cannot be easily used for statistical inference.
We have developed a bootstrap method that takes into account not only the sampling variability and the inherent partitioning variation, but also other sources of errors, such as partition loss and pipetting error. Furthermore, the method is generic so that it can be easily used for variance estimation of more complicated non-linear statistics, such as copy number variation and the DNA shearing index. The method can also be extended to work with multiplex dPCR. We have evaluated the performance of the method under various realistic simulation scenarios.
The simulation results demonstrate the capability of this new bootstrap method for
variance estimation even when many sources of variation are present. Another strength is that it also works well for the variance estimation of non-linear statistics. Leyla Kodalci - Simple and Flexible Sign and Rank-Based Methods for Testing for Differential Abundance in Microbiome Studies[Joint work with Olivier Thas]
Microbiome data obtained from high-throughput sequencing are considered as compositional data, which is characterised by a sum-constraint. Hence, only ratios of count observations are informative. Furthermore, microbiome data are overdispersed and have many zero abundances. Many compositional data analysis methods make use of log ratios of the components of the observation vector. However, the many zero abundances cause problems when calculating ratios and logarithms.
In this work, we focus on the identification of taxa that are differentially abundant between two groups. We have developed semiparametric methods targeting the probability that the outcome of one taxon is smaller than the outcome of another taxon. The methods rely on logistic and probabilistic index models and hence inherit the flexibility that comes with these modelling frameworks. The estimation of this probability only requires information about the pairwise ordering of the taxa, and hence zero observations cause no problems. We have constructed several estimators of the effect size parameters in the model, and hypothesis tests based on these estimators. Results from a simulation study indicate that our methods control the false discovery rate at the nominal level and have good sensitivity compared to competitors. Mohamad Zafer Merhi - Single Cell RNAseq data: Application of Clustering and Biclustering for structure Identification of Antigen Specificity[Joint work with Dan Lin, Ahmed Essaghir and Ziv Shkedy]
The single cell RNA-sequencing technology allows the assessment of heterogenous cell-specific changes and their biological characteristics. In this study, we focus on a single cell omics data for immune profiling purposes. T-cells exhibit unique behavior referred to as cross-reactivity; the ability of T-Cells to recognize two or more peptide-MHC complexes by the TCR. Our work is applied on single cell RNA-seq data(publicly available in https://support.10xgenomics.com/single-cell-vdj/datasets/) consisting of CD8+ T Cells obtained using a single cell omics technology from 10X Genomics and our aim is to understand the heterogeneic characteristics and the binding specificities of these T cells, i.e., we aim to identify the specificity of the CD8+ T cells to one (or more) antigen(s). For the identification of specific CD8+ T Cells, we proposed an unsupervised data analysis pipeline. Biclustering methods are applied to recover and explore the cross-reactive behaviour of T Cells and to identify a subset of cells which are specific to a subset of antigens. Clustering methods are used to link these subsets to the RNA-seq data. Furthermore, we discuss the challenges of the application and evaluation of clustering algorithms on the single Cell RNA-seq data.
|
|
14:15 - 15:55 |
Session 4 - Covid -
This session includes one invited talk (40') and three contributed talks (20').
[Room: L3 Auditoire 0/13A]
Chair : Stijn Vansteelandt
Niel Hens - Lessons Learned, Remaining Challenges and Pandemic PreparednessIn this talk, I will reflect on the COVID-19 pandemic both from a national and international perspective while focussing on the lessons learned, remaining challenges and pandemic preparedness. I will focus on design and analysis of infectious disease studies and the importance of peacetime research. Hege Michiels - Estimation and interpretation of vaccine efficacy in COVID-19 trials[Joint work with An Vandebosch and Stijn Vansteelandt]
An exceptional effort by the scientific community has led to the development of multiple vaccines against COVID-19. Efficacy estimates for these vaccines have been widely communicated to the general public, but may nonetheless be challenging to compare quantitatively. Indeed, the performed phase 3 trials differ in study design, definition of vaccine efficacy and in how cases arising shortly after vaccination are handled. In this work, we investigate the impact of these choices on the obtained vaccine efficacy estimates, both theoretically and by re-analysing the Janssen and Pfizer COVID-19 trial data using a uniform protocol. We moreover study the causal interpretation that can be assigned to per-protocol analyses typically performed in vaccine trials. Finally, we propose alternative estimands to measure vaccine efficacy in settings with delayed immune response and provide insight into the intrinsic effect of the vaccine after achieving adequate immune response. Cécile Kremer - Quantifying superspreading using Poisson mixture distributions[Joint work with Andrea Torneri, Sien Boesmans, Hanne Meuwissen, Selina Verdonschot, Koen Vanden Driessche, Christian L. Althaus, Christel Faes and Niel Hens]
An important parameter for the control of infectious diseases is the number of secondary cases, i.e. the number of new infections generated by an infectious individual. When individual variation in disease transmission is present, the distribution of the number of secondary cases is skewed and often modeled using a negative binomial distribution. However, this may not always be the best distribution to describe the underlying transmission process. We propose the use of three other offspring distributions to quantify heterogeneity in transmission, and find that estimates of the mean and variance may be biased when there is a substantial amount of heterogeneity. In addition we (re-)analyze three COVID-19 datasets and find that for two of these datasets the distribution of the number of secondary cases is better described by a Poisson-lognormal distribution. Since conclusions regarding superspreading potential of a disease are made based on the distribution used for modeling the number of secondary cases, we recommend comparing different distributions and selecting the most accurate one before making inferences on superspreading potential. Lisa Hermans - Infectieradar.be, Crowdsourcing Surveillance System to Monitor the Spread of Infectious Diseases in Belgium[Joint work with Yannick Vandendijck, Sarah Vercruysse, Emiliano Mancini, Jakob Randa, Geert Jan Bex, Sajeeth Sadanand, Daniela Paolotti, Christel Faes, Philippe Beutels, Niel Hens and Pierre Van Damme]
Infectieradar monitors the spread of infectious diseases with the help of volunteers via the internet. In this platform individuals can report symptoms and complaints related to their health condition, and report whether they seek medical care or not.
Traditional surveillance for respiratory infections (COVID-19, influenza, etc.) relies on patients that consult physicians. However, many individuals do not seek health care while ill. Also, not everyone is tested. On top of this, individuals may change their health seeking behaviour over the course of an epidemic. Community participatory surveillance is therefore critical. Infectieradar receives the data directly from the population, creating a fast and flexible monitoring system. The platform is part of a larger network, Influenzanet, and allows to put the Belgian data in an European perspective.
It brings insights into symptom burden, and gives an estimate to the true number of cases that got infected, allowing us to monitor how health complaints are distributed in Belgium over time. This data is of utmost importance for scientific research into the spread of COVID-19, but also for other viruses and infectious diseases. The more people subscribe to the platform, the more accurate the prediction of incidence of new infection will be.
|
17:15 - 18:15 |
Quetelet session [Room: L3 Auditoire 0/13A]
Chair : Beatrijs Moerkerke
Jelle Goeman - All-Resolutions InferenceMany fields of science nowadays gather data at a very fine resolution but do inference at a higher aggregated level. For example, in neuroimaging data are gathered at the level of 3 mm × 3 mm × 3 mm voxels, but the relevant biology happens at the level of cm-scale brain areas; in genetics, data are gathered at the level of single-DNA-base polymorphisms, but interesting questions happen at the level of genes or even gene groups; in spatial statistics, data may be gathered at street level but interesting questions are about neighbourhoods or regions. Often, there is not just one natural way to aggregate data to prepare for inference. Multiple alternative criteria could be used to drive the grouping. Aggregation to large regions may give low specificity; more limited aggregation may give low power.
This talk presents how Closed Testing can be used to analyze this type data at all resolutions simultaneously. The method allows the choice how and how much to aggregate to be chosen freely by the researcher, in a data-dependent way, while still strictly controlling the probability of false positive findings. This allows researchers to adapt the inference to the amount and the shape of the signal that is present in the data: the stronger the signal, the better it will be pinpointed by the closed testing procedure.
I will review the general idea and theory of closed testing and recent progress in method development in this area. Several example contexts illustrate the wide applicability of all-resolutions inference.
Announcement of the Quetelet award winners |