International Workshop on High Dimensional Data Mining 2016
7 Jun 2016 Naples (Italy)

Invited speakers

 

 

Edwin.png

Edwin Diday, CEREMADE Paris-Dauphine University (France)

E. Diday is Professor of Exceptional Class at the University of Paris Dauphine in Computer science and Mathematics. He was at INRIA head of project until 1994 and the ScientificManager of two EUROSTAT project SODAS and ASSO (17 teams from 9 European countries) until 2003. He was involved in three other European consortium. He is in the editorial board of the book series:"studies in classification, data analysis and knowledge organisation" and several International Journals. He is author or editor of 14 books and of more than 50 refereed papers. More than 50 doctorate dissertations have been obtained by students under his direction. He is past president of the Francophone Society of Classification. He is member of the International Statistical Institute. His most recent contribution concerns Maximal and Stochastic Galois Lattices, Symbolic Data Analysis and Spatial Classification. He is awarded laureate of the Montyon Price given by the French Academy of Sciences.

 

 

Thinking by classes in Data Science:

Symbolic Data Analysis framework and principles

Summary: Data Science, considered as a science by itself, is in general terms, the extraction of knowledge from data. Basically in Data Science, we have as input a standard set of “individual entities” (as statistical units), described by a set of qualitative or quantitative variables. Symbolic Data Analysis (SDA) gives a new way of thinking in Data Science by extending the standard input to a set of classes of individual entities. Hence, classes of a given population are considered to be units of a higher level population to be studied. Such classes often represent the real units of interest and allow a summary of the population. In order to take the variability between the members of each class into account, the classes are described by intervals, distributions, set of categories or numbers sometimes weighted and the like. In that way, we obtain new kinds of data, called "symbolic" as they cannot be reduced to numbers without losing much information. The first step in SDA is to build the symbolic data table where the rows are classes and the variables are of symbolic values. The second step is to study and extract new knowledge from these new kinds of data by at least an extension of Computer Statistics and Data Mining to symbolic data. We show that SDA is a new paradigm which opens up a vast domain of research and applications by giving complementary results to classical methods applied to standard data. SDA also gives answers to the big data and complex data challenges as big data can be reduced and summarized by classes and as complex data with multiple non structured data tables and non-paired variables can be transformed in a structured data table with paired symbolic valued variables. We give some principles and industrial applications.

 

 

Saporta_2.png

Gilbert Saporta, CEDRIC, Conservatoire National des Arts et Métiers, Paris

Emeritus Professor of Applied Statistics at CNAM
EDUCATION : Ingénieur de l'Ecole Centrale des Arts et Manufactures (1968), Master in mathematical statistics, Université Paris VI (1969), Graduated from Institut de Statistique de l'Université de Paris (1970), Docteur (Ph.D.) Statistique, Université Paris VI (1975), Docteur ès Sciences Mathématiques, Université Paris VI (1981).
RESEARCH: Data science, Categorical and mixed data analysis by means of optimal scaling techniques and correspondence analysis, Functional Data Analysis, Supervised classification with applications in risk analysis (credit scoring), Confidence region and validity in data analysis, Data mining, Big Data and Statistical Learning Theory.
Associations: President of the Foundation "La Science Statistique », Past President of IASC (International Association for Statistical Computing) 2005-2007, Former Vice-president de of ISI (International Statistical Institute). 2005-2007, Former Président of Société Française de Statistique 2000-2002, Former President of Société de Statistique de France & Société de Statistique de Paris, Former president of ASU (Association pour la Statistique et ses Utilisations) 1986-1988, Member of International Biometric Society, Société Francophone de Classification, Psychometric Society, International Association for Statistical Education , Societa Italiana di Statistica, Honorary member of Societatea de Probabilitati si Statistica din Romania, Editorial boards Journal of Classification, Applied Stochastic Models in Business and Industry, Journal of Symbolic Data Analysis, Advances in Data Analysis and Applications.

 

Some « sparse » methods for high dimensional data


Summary: High dimensional data means that the number of variables p is far larger than the number of observations n . This occurs in several fields such as genomic data or chemometrics.
When p>n the OLS estimator does not exist for linear regression. Since it is a case of forced multicollinearity, one may use regularized methods such as ridge regression, principal component regression or PLS regression: these methods provide rather robust estimates through a dimension reduction approach or constraints on the regression coefficients. The fact that all the predictors are kept may be considered as a positive point in some cases. However if p>>n, it becomes a drawback since a combination of thousands of variables cannot be interpreted. Sparse combinations, ie with a large number of zero coefficients are preferred. Lasso, elastic net, sPLS perform simultaneously regularization and variable selection thanks to non quadratic penalties: L1, SCAD etc. Group-lasso is a generalization fitted to the case where explanatory variables are structured in blocks. Recent works include sparse discriminant analysis and sparse canonical correlation analysis.
In PCA, the singular value decomposition shows that if we regress principal components onto the input variables, the vector of regression coefficients is equal to the factor loadings. It suffices to adapt sparse regression techniques to get sparse versions of PCA. Sparse Multiple Correspondence Analysis is derived from group-lasso with groups of indicator variables.
Finally when one has a large number of observations, it is frequent that unobserved heterogeneity occurs, which means that there is no single model, but several local models: one for each cluster of a latent variable. Clusterwise methods optimize simultaneously the partition and the local models; they have been already extended to PLS regression. We will present here CS-PLS (Clusterwise Sparse PLS) a combination of clusterwise PLS and sPLS which is well fitted for big data: large n , large p.

 

 

Michel_Verleysen.jpg

Michel Verleysen, Université Catholique de Louvain, Belgium


Michel Verleysen is a Professor of Machine Learning at the Université catholique de Louvain, Belgium.  He was an invited professor at the Swiss E.P.F.L. (Ecole Polytechnique Fédérale de Lausanne, Switzerland) in 1992, at the Université d'Evry Val d'Essonne (France) in 2001, at the Université ParisI-Panthéon-Sorbonne from 2002 to 2011, and at Université Paris Est in 2014.  He is an Honorary Research Director of the Belgian F.N.R.S. (National Fund for Scientific Research), and the Dean of the Louvain School of Engineering. He is editor-in-chief of the Neural Processing Letters journal (published by Springer), chairman of the annual ESANN conference (European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning), past associate editor of the IEEE Trans. on Neural Networks journal, and member of the editorial board and program committee of several journals and conferences on neural networks and learning. He was the chairman of the IEEE Computational Intelligence Society Benelux chapter (2008-2010), and member of the executive board of the European Neural Networks Society (2005-2010).  He is author or co-author of more than 250 scientific papers in international journals and books or communications to conferences with reviewing committee. He is the co-author of the scientific popularization book on artificial neural networks in the series “Que Sais-Je?”, in French, and of the "Nonlinear Dimensionality Reduction" book published by Springer in 2007. His research interests include machine learning, feature selection, nonlinear dimensionality reduction, visualization, high-dimensional data analysis, self-organization, time-series forecasting and biomedical signal processing.

 

Nonlinear dimensionality reduction

Summary: Dimensionality reduction (DR) aims at providing faithful low-dimensional (LD) representations of high-dimensional (HD) data. DR is a ubiquitous tool in many branches of science, like sociology, psychometrics, statistics, and, more recently, in (big) data mining. A faithful LD representation of HD data preserves key properties of the original data.  A number of DR techniques have been developed recently; they differ by the choice of the key property they optimize.  For example choosing between preserving Euclidean distances, geodesic distances or similarities largely influences the resulting representation.  DR techniques being mostly unsupervised, the choice between the key properties, or between objective functions to optimize, is not obvious and must be related to the goal of applying DR techniques to specific data.  Quality assessment of DR techniques is thus also an important issue.  This talk aims at presenting modern DR methods relying on distance, neighborhood or similarity preservation, and using either spectral methods or nonlinear optimization tools.  The talk will cover important issues such as scalability to big data, user interaction for dynamical exploration, reproducibility and stability. 

Online user: 1