SOCR ≫ DSPA ≫ Topics ≫

1 Information Theory and Statistical Learning

1.1 Summary

Machine learning relies heavily on entropy-based (e.g., Renyi-entropy) information theory and kernel-based methods. For instance, Parzen-kernel windows may be used for estimation of various probability density functions, which facilitates the expression of information theoretic concepts as kernel matrices or statistics, e.g., mean vectors, in a Mercer kernel feature space. The parallels between machine learning and information theory allows the interpretation and understand of computational methods from one field in terms of their dual representations in the other.

Machine learning (ML) is the process of data-driven estimation (quantitative evidence-based learning) of optimal parameters of a model, network or system, that lead to output prediction, classification, regression or forecasting based on a specific input (prospective, validation or testing data, which may or may not be related to the original training data). Parameter optimality is tracked and assessed iteratively by a learning criterion depending on the specific type of ML problem. Classical learning assessment criteria, including mean squared error (MSE), accuracy, \(R^2\), see Chapter 13, may only capture low-order statistics of the data, e.g., first or second order. Higher-order learning criteria enable solving problems where sensitivity to higher-moments is important (e.g., matching skewness or kurtosis for non-linear clustering, classification, dimensionality reduction).

This Figure provides a schematic workflow description of machine learning.