...  ...  @@ 15,11 +15,14 @@ The machine learning models we have covered so far can also be interpreted from 





## Discrete and continuous interpretation of the latent space






...



Many real life observations include nonlinear behavior, hence the density distributions of the data in feature space usually cannot be described by single distribution functions such as Gaussian distribution. In engineering, it is a common practice to convert a finite, nonlinear system into an infinite combination of linear functions so that we can analyze and predict the system behavior. For instance, transient heat/mass/momentum transfer with source terms:






## Dimensionality reduction: why it is useful?



IMAGE






The same strategy can also be applied in data driven learning. We can, for instance, approximate a nonlinear observation distribution as a linear combinations of basic distribution functions, as in the case of Gaussian mixtures. With this linearization, we can further convert the observed variables X into discrete latent variables. In GMM, for instance, we create latent variables by assigning the observations to specific components of the mixture model (via EM algorithm).






Observations in the data space also tend to cluster on some special hyperplanes, particularly for high dimensional systems (high typically refers to >50100). In such cases, the observations would seem to be floating over these special surfaces, as if continuously. These hyperplanes towards which the data seems to be gravitated are the “patterns” what we are after. This is somehow similar to Lagrangian flow treatment. To create an analogy, imagine that you want to observe / control the pathways of flying cars in a futuristic city. In principle, cars can fly in any direction so it is a bit problematic to organize the flow of traffic. If you want to design such a flow of cars, we would create some preferable paths so that we do not end up with chaos. Imagine that we came up with a function Phi so that it can deduce the paths by using a 100 dimensional feature space, including the coordinates x, y, z and many other parameters such as the traffic zone, time, car type etc. In such a case, we can we can consider the Phi itself as a hidden state, embedding all these factors in much more reduced dimensionality. In the simplest example, the path way of preference will be the earthy path in a park, rather than the paved roads. The potential existence of the lower dimensional, continuous hyperplanes in dataspace (on which observations are clustering) is the main motivation behind the dimensionality reduction techniques. In other words, the practical problems of interest (classification, regression, reduced order modelling, clustering) may be much easier to solve over these new coordinates (Remember the curse of dimensionality?). The question is, how can we find such useful hyperplanes automatically in our data?






...






## Dimensionality reduction methods




...  ...  @@ 58,3 +61,7 @@ S = 1/N \sum {n=1}^{N} (x_n  \overline{x})(x_n\overline{x})^T 





In the third step, we maximize the variance of the projected data onto new coordinate system. This step consists of multiple smaller steps.









## Dimensionality reduction: why it is useful?






... 