A Data mining approach for the discovery of critical events in life courses

Keywords: data mining, machine learning, class imbalance, unfrequent classes, longitudinal data, social sciences, life course analysis, sequences analysis, vulnerability, critical life event, critical life situation.

Conducted within the NCCR LIVES, this thesis aims at providing some methodological tools for the study of the vulnerability accross the life course. More precisely, we are interested in being able to explain why some individuals experience critical life situations, while some others succeed in avoiding them. The concept of vulnerability is broad and complex and we limit the scope of our study to what we call manifest vulnerability: a critical life situation which can be observed and measured, as for example an important health problem or a long-term unemployment. This formulation of the vulnerabilty allows a natural representation as an ordinal scale representing vulnerability levels. On its simplest form this scale can be a binary variable representing two states: being vulnerable or not (for example: experiencing a more than two-year unemployment or not). Reasons of experiencing critical life situations are most often complex and generally result from the interaction of several factors. By interaction we mean that the impact of one factor is different depending on the values of another covariate. For example, let us consider the impact on the loss of independence after a leg fracture caused by a domestic accident. We may assume that for young people this event won't have significant consequences, and several weeks of rest will be sufficient to recover. But for elderly people recovering from such an accident is more difficult. This domestic accident can mark the transition to the loss of independence and therefore should be interpreted as a critical event having led to a critical life situation. We here observed an interaction between the variable “experiencing a leg fracture” and the age class. When studying individual life courses a lot of covariates can interact and it is quite difficult to figure out which of those are the most significant. Furthermore, the causes of experiencing a given critical situation generally result not from the interaction of only two covariates but rather three, four or more factors. This makes the task of modelizing the vulnerabilization process even more difficult. Secondly, critical life events are usually rare. The consequence is that we have less occasions to observe them. Figuring out what successions of critical events may lead to experience a critical life situation is for the mind more difficult. Theses two reasons motivate the need for exploratory tools to “mine” data and highlight relevant and significant patterns. These discovered patterns will help the social scientist to set or/and refine her theoritical model. Then, this theoretical model can find empirical evidence by testing it with classical statistical methods used in scientist's research field (for example: regression models, multi-level analysis, structural equation modeling).

I will start my dissertation by introducing the concepts of longitudinal data, life course, trajectories, categorical sequences and will provide a review of classical sequence analysis methods. Then, the first part of my thesis will focus on supervised machine learning methods able to discover relevant associations between a categorical ordinal variable, or a specific class of this variable, and a set of categorical or quantitative explanatory variables. When working on a categorical dependent variable (or target), machine learning algorithms are oftenly called classifiers. I will provide a review of most-known classifiers, such as decision trees (Murthy, 1998), rules learners (Fürnkranz, 1999), association rules miners (Kotsiantis and Kanellopoulos, 2006), support vector machines (Vapnik, 1995) and neural network classifiers (Zhang, 2000). Then, I will look for classical strategies aiming at overcoming the imbalance and unfrequent class issue (Japkowicz and Stephen, 2002), such as oversampling/undersampling strategies (Chawla et al., 2002; Chawla, 2003), cost methods (Chawla et al., 2008; Weiss, McCarthy, and Zabar, 2007), and ensemble methods such as bagging (Breiman, 1994) and boosting (Fan et al., 1999).

In social sciences, the ability to understand results is just as important than the performance of the method itself. In this context, my academic contributions will focus on two specific methods: decision trees and association rules. Indeed, in this context, graphical tools helping the user to visualize and interpret results are more than welcome. Decision tree methods allows to (1) make significant factors emerge among a huge set of variables, (2) discover interactions involving several variables at the same time and (3) render the modelization on an natural and intuitive graphical representation. Theses features make decision tree methods relevant for studying vulnerabilization processes. Basically, a decision tree creates a partition on the attribute space that “explain” the values of the dependent variable.

Decision tree showing an interaction between father's and respondent's educational level
Figure 1: CHAID Decision tree used in Guarin and Rousseaux (2013) to refine an assumption of the theoretical model. Legend: E = employed, U = unemployed, REL3 = respondent educational level (3 groups), FEL3 = father educational level (3 groups), Age4 = age of the respondent (4 groups).

An example of decision tree is given in Figure 1. This example comes from a research work on employment attainment of second generation immigrants in Switzerland (Guarin and Rousseaux, 2013; Rousseaux and Guarin, 2013}. The motivation of using decision tree methods in this article was to refine one of our hypotheses. Indeed, we expected the father educational level plays a key role in the ability of their children to get their first job, but we were not able to get evidence of a significant impact with regressions models. Therefore, using decision trees we explored underlying interactions between variables. This analysis highlighted a meaningful interaction: the educational level of the father is the second best explanatory variable for individuals having a low education whereas for individuals having a middle or high education the second best explanatory variable is the age (Figure 1). This interaction led us to refine our theoretical model: the educational level of the father does not play a key role for all individuals but only for individuals having a low educational level. This can be easily explained. Most employers hire individuals regarding to their grades and skills. Therefore well-educated individuals succeed in getting a job thanks to their education, and the educational level of their father won't play a significant role. But for people unable to take advantage of their education, getting a job is more difficult. For those individuals the father can be a significant help, for example by using his professional relations or by sagely advicing their children. Finally, we found empirical evidences of this assumption in our regression models. Like others supervised machine learning methods, decision trees are affected by class imbalance. This impact mainly concerns the splitting critera used to growth the tree. Several methods have been set up to overcome the imbalance and unfrequent class issue for decision tree learning, especially for entropy-based tree methods (Lallich, Lenca, and Vaillant, 2007; Marcellin, 2008; Ritschard, Marcellin, and Zighed, 2009; Zighed, Ritschard, and Marcellin, 2010}. In this thesis we will review these different methods and put forward one or two new splitting criteria which will try to outperform current methods. The final aim of this work on decision tree methods for imbalanced data is to provide social scientists with an efficient tool to discover profiles of vulnerable individuals. Then, we will pursue this work by focusing on association rules based classifiers. Association rules play a major role for mining relevant associations within large data. Our motivation behind the use of association rules is to be able to (1) discover what sequences of events may lead to experience a critical life situation, and (2) to discover what events can protect individuals against this critical situation. For example, assuming $C$ is a critical life situation, we would like to discover rules like R1, where $A$, $B$ are life course events which could explain why the critical situation happened, and $t_i$ are the times when events occurred. Then, we would like to highlight exclusions to rules we founded, as for example R2, where $Z$ is another life course events.

\[ A_{work traj.}^{t_1}\land B_{family traj.}^{t_2} \Rightarrow C_{health traj.}^{t_3}          \text{(R1)} \] \[ A_{work traj.}^{t_1}\land B_{family traj.}^{t_2}\land Z_{family traj.}^{t_3} \Rightarrow \bar{C}_{health traj.}          \text{(R2)} \]

To that purpose we shall focus on positive and negative association rules algorithms (Antonie and Zaïane, 2004; Wu, C. Zhang, and S. Zhang, 2004; Cornelis et al., 2006). Whereas positive rules mining are only able to assess the impact of events that occur, negative rules are also able to assess the impact of events that don't occur. Furthermore, classical algorithms, as for example the apriori algorithm (Agrawal, Imieliński, and Swami, 1993), require to specify a minimum support to find frequent itemsets. In the case of significantly imbalanced class distributions, this may lead to miss interesting rules about minority classes. On the other side, by lowering the minimum support we would get too many and uninteresting rules. An interesting approach to overcome this support threshold issue is for example to formulate the rule miner as a combinatorial global optimization problem, and then solving it by help of a meta-heuristic based optimization method (Yan, C. Zhang, and S. Zhang, 2009). In this thesis our contribution will be on the method used to solve the combinatorial problem. Recently, a new optimization method based on the ability of bees to discover good nectar sources has been set up, and it seems to outperform other methods in most situations. This method, called the artificial bee colony optimization (Karaboga and Basturk 2007; Karaboga and Basturk, 2008; Karaboga and Akay, 2009) has been initialy developped for minimizing numerical functions, but has recently been adapted for binary optimization problems (Pampara and Engelbrecht, 2011; M. H. Kashan, Nahavandi, and A. H. Kashan, 2012). Then, in our contribution we will suggest to prune discovered rules by a chi-squared test and to assess the quality of a rule by its lift.

Finally, we will combine both approaches in a unified framework. For example, we want to try to start the data exploration by looking for relevant association rules, and then conducting a decision tree analysis for discovering what life profiles bring support of these rules.

Population studies strongly rely on survey data. In order to face with the increasing complexity and rigor of the current research studies, the structure of the databases has become in the last decades more and more complex, as longitudinal data, network data, spatial data, etc. The increasing volumes of structured data complicate the task of both documenting and manipulating data. However, in social sciences, the description of data is crucial. For example, assuming we have in our database a variable named health, of what kind of health are we really talking about? Is it the physical health or the mental health? Was this health self-reported by the respondent, or assessed by a doctor? Is it the respondent satisaction with her health? It is not possible for a social scientist to make any analysis without more information on the nature of the health measure that was taken on respondents. Furthermore, as social analysis are conducted on a sample of the population studied, it is mandatory to take the sampling design of the survey into account, and also how respondents answer to the survey. Especially, it is important to take non-response weights into account, and to document the different reasons an individual didn't answer to a specific question. The R software is a popular, multi-platform, open-source and free of charge statistical software, which is more and more used in research and industry. As I want to make my methodological contributions available to the scientific community in a ready-to-use way, I choose the R software for releasing my methods. However, the R software does not provide a native framework for survey data managment. The native structure to store data, the data.frame object, doesn't allow to store metadata of a survey like short/long labels for values and variables. Missing values are limited to only one type, which makes it unable to describe the different types of missing values that can occur in survey sampling. The quality of a survey stands on its representativeness of the population studied and often weights are needed to correct for non-responses bias. Although a lot of statistical methods in R are able to deal with weights, the data.frame object doesn't provide a specific weigths handling. This is the reason why there is a need for specific tools to assist the user in handling these complex data. We argue that to propose an operational solution for answering to the research questions of this thesis -- providing ad hoc tools for the study of the vulnerability accross the life course -- we have to propose a framework for handling survey data in R, especially biographical data. The R package we will design, called Rsocialdata, aims at facilitating the management of survey data by providing researchers in social sciences with high-level tools for storing, documenting, exploring, recoding and sharing survey data in a secure and efficient way. It will allow to fully document both survey data and the questionnaire design. Instead of storing data and the documentation separately, the software will store all the documentation together with the data, and will make it interactively available to users during their analyses. Finally, our two methodological contributions will be integrated to the Rsocialdata package, offering to the user an efficient framework for discovering critical events in life courses.

comments powered by Disqus