## A Data mining approach for the discovery of critical events in life courses

Keywords: data mining, machine learning, class imbalance, unfrequent classes, longitudinal data, social sciences, life course analysis, sequences analysis, vulnerability, critical life event, critical life situation.

Conducted within the NCCR LIVES, this thesis aims at providing some methodological tools for the study of the vulnerability accross the life course. More precisely, we are interested in being able to explain why some individuals experience critical life situations, while some others succeed in avoiding them. The concept of vulnerability is broad and complex and we limit the scope of our study to what we call manifest vulnerability: a critical life situation which can be observed and measured, as for example an important health problem or a long-term unemployment. This formulation of the vulnerabilty allows a natural representation as an ordinal scale representing vulnerability levels. On its simplest form this scale can be a binary variable representing two states: being vulnerable or not (for example: experiencing a more than two-year unemployment or not). Reasons of experiencing critical life situations are most often complex and generally result from the interaction of several factors. By interaction we mean that the impact of one factor is different depending on the values of another covariate. For example, let us consider the impact on the loss of independence after a leg fracture caused by a domestic accident. We may assume that for young people this event won't have significant consequences, and several weeks of rest will be sufficient to recover. But for elderly people recovering from such an accident is more difficult. This domestic accident can mark the transition to the loss of independence and therefore should be interpreted as a critical event having led to a critical life situation. We here observed an interaction between the variable “experiencing a leg fracture” and the age class. When studying individual life courses a lot of covariates can interact and it is quite difficult to figure out which of those are the most significant. Furthermore, the causes of experiencing a given critical situation generally result not from the interaction of only two covariates but rather three, four or more factors. This makes the task of modelizing the vulnerabilization process even more difficult. Secondly, critical life events are usually rare. The consequence is that we have less occasions to observe them. Figuring out what successions of critical events may lead to experience a critical life situation is for the mind more difficult. Theses two reasons motivate the need for exploratory tools to “mine” data and highlight relevant and significant patterns. These discovered patterns will help the social scientist to set or/and refine her theoritical model. Then, this theoretical model can find empirical evidence by testing it with classical statistical methods used in scientist's research field (for example: regression models, multi-level analysis, structural equation modeling).

I will start my dissertation by introducing the concepts of longitudinal data, life course, trajectories, categorical sequences and will provide a review of classical sequence analysis methods. Then, the first part of my thesis will focus on supervised machine learning methods able to discover relevant associations between a categorical ordinal variable, or a specific class of this variable, and a set of categorical or quantitative explanatory variables. When working on a categorical dependent variable (or target), machine learning algorithms are oftenly called classifiers. I will provide a review of most-known classifiers, such as decision trees (Murthy, 1998), rules learners (Fürnkranz, 1999), association rules miners (Kotsiantis and Kanellopoulos, 2006), support vector machines (Vapnik, 1995) and neural network classifiers (Zhang, 2000). Then, I will look for classical strategies aiming at overcoming the imbalance and unfrequent class issue (Japkowicz and Stephen, 2002), such as oversampling/undersampling strategies (Chawla et al., 2002; Chawla, 2003), cost methods (Chawla et al., 2008; Weiss, McCarthy, and Zabar, 2007), and ensemble methods such as bagging (Breiman, 1994) and boosting (Fan et al., 1999).

In social sciences, the ability to understand results is just as important than the performance of the method itself. In this context, my academic contributions will focus on two specific methods: decision trees and association rules. Indeed, in this context, graphical tools helping the user to visualize and interpret results are more than welcome. Decision tree methods allows to (1) make significant factors emerge among a huge set of variables, (2) discover interactions involving several variables at the same time and (3) render the modelization on an natural and intuitive graphical representation. Theses features make decision tree methods relevant for studying vulnerabilization processes. Basically, a decision tree creates a partition on the attribute space that “explain” the values of the dependent variable.

Figure 1: CHAID Decision tree used in Guarin and Rousseaux (2013) to refine an assumption of the theoretical model. Legend: E = employed, U = unemployed, REL3 = respondent educational level (3 groups), FEL3 = father educational level (3 groups), Age4 = age of the respondent (4 groups).

An example of decision tree is given in Figure 1. This example comes from a research work on employment attainment of second generation immigrants in Switzerland (Guarin and Rousseaux, 2013; Rousseaux and Guarin, 2013}. The motivation of using decision tree methods in this article was to refine one of our hypotheses. Indeed, we expected the father educational level plays a key role in the ability of their children to get their first job, but we were not able to get evidence of a significant impact with regressions models. Therefore, using decision trees we explored underlying interactions between variables. This analysis highlighted a meaningful interaction: the educational level of the father is the second best explanatory variable for individuals having a low education whereas for individuals having a middle or high education the second best explanatory variable is the age (Figure 1). This interaction led us to refine our theoretical model: the educational level of the father does not play a key role for all individuals but only for individuals having a low educational level. This can be easily explained. Most employers hire individuals regarding to their grades and skills. Therefore well-educated individuals succeed in getting a job thanks to their education, and the educational level of their father won't play a significant role. But for people unable to take advantage of their education, getting a job is more difficult. For those individuals the father can be a significant help, for example by using his professional relations or by sagely advicing their children. Finally, we found empirical evidences of this assumption in our regression models. Like others supervised machine learning methods, decision trees are affected by class imbalance. This impact mainly concerns the splitting critera used to growth the tree. Several methods have been set up to overcome the imbalance and unfrequent class issue for decision tree learning, especially for entropy-based tree methods (Lallich, Lenca, and Vaillant, 2007; Marcellin, 2008; Ritschard, Marcellin, and Zighed, 2009; Zighed, Ritschard, and Marcellin, 2010}. In this thesis we will review these different methods and put forward one or two new splitting criteria which will try to outperform current methods. The final aim of this work on decision tree methods for imbalanced data is to provide social scientists with an efficient tool to discover profiles of vulnerable individuals. Then, we will pursue this work by focusing on association rules based classifiers. Association rules play a major role for mining relevant associations within large data. Our motivation behind the use of association rules is to be able to (1) discover what sequences of events may lead to experience a critical life situation, and (2) to discover what events can protect individuals against this critical situation. For example, assuming $C$ is a critical life situation, we would like to discover rules like R1, where $A$, $B$ are life course events which could explain why the critical situation happened, and $t_i$ are the times when events occurred. Then, we would like to highlight exclusions to rules we founded, as for example R2, where $Z$ is another life course events.

$A_{work traj.}^{t_1}\land B_{family traj.}^{t_2} \Rightarrow C_{health traj.}^{t_3} \text{(R1)}$ $A_{work traj.}^{t_1}\land B_{family traj.}^{t_2}\land Z_{family traj.}^{t_3} \Rightarrow \bar{C}_{health traj.} \text{(R2)}$

To that purpose we shall focus on positive and negative association rules algorithms (Antonie and Zaïane, 2004; Wu, C. Zhang, and S. Zhang, 2004; Cornelis et al., 2006). Whereas positive rules mining are only able to assess the impact of events that occur, negative rules are also able to assess the impact of events that don't occur. Furthermore, classical algorithms, as for example the apriori algorithm (Agrawal, Imieliński, and Swami, 1993), require to specify a minimum support to find frequent itemsets. In the case of significantly imbalanced class distributions, this may lead to miss interesting rules about minority classes. On the other side, by lowering the minimum support we would get too many and uninteresting rules. An interesting approach to overcome this support threshold issue is for example to formulate the rule miner as a combinatorial global optimization problem, and then solving it by help of a meta-heuristic based optimization method (Yan, C. Zhang, and S. Zhang, 2009). In this thesis our contribution will be on the method used to solve the combinatorial problem. Recently, a new optimization method based on the ability of bees to discover good nectar sources has been set up, and it seems to outperform other methods in most situations. This method, called the artificial bee colony optimization (Karaboga and Basturk 2007; Karaboga and Basturk, 2008; Karaboga and Akay, 2009) has been initialy developped for minimizing numerical functions, but has recently been adapted for binary optimization problems (Pampara and Engelbrecht, 2011; M. H. Kashan, Nahavandi, and A. H. Kashan, 2012). Then, in our contribution we will suggest to prune discovered rules by a chi-squared test and to assess the quality of a rule by its lift.

Finally, we will combine both approaches in a unified framework. For example, we want to try to start the data exploration by looking for relevant association rules, and then conducting a decision tree analysis for discovering what life profiles bring support of these rules.