... | ... | @@ -8,11 +8,14 @@ _C. Ates_ |
|
|
<img src="uploads/1c4d469427536cf5c7d70e600e3957a7/16b50c3a43dfb9b145fe76dd815afba6.jpg" width="600">
|
|
|
|
|
|
[Hand Shadows, Acrylic Print by Zapista OU](https://fineartamerica.com/featured/hand-shadows-taylan-apukovska.html)
|
|
|
</div>
|
|
|
|
|
|
"What is essential is invisible to the eye."
|
|
|
—The Little Prince, Antoine De Saint-Exupéry
|
|
|
|
|
|
</div>
|
|
|
|
|
|
|
|
|
|
|
|
In almost all real-life scenarios, the variables we are interested is not directly measurable (or are not practical to measure), “hidden” from us. We typically infer them by using “observable” variables. Think about stress. In modern life, stress is an inevitable ingredient. We cannot measure it directly, but make an estimate (modelling) via some observable variables such as sleep patterns, sweating, heart rate etc. This is more or less the case for a majority of medical diagnosis. Such a report would include observed symptoms, treatments applied, additional physiological measurements (e.g. blood analysis), opinions of the doctors, but not the disease itself. It is what is deduced from all this data.
|
|
|
|
|
|
The machine learning models we have covered so far can also be interpreted from this perspective, and as a matter of fact, this is what makes “learning” possible. Even in the simplest models (linear regression, k-means clustering), we learn the relationship between the observed variables (X) and the hidden variables (such as model weights, cluster stereotypes μ) during the learning process. In a bold summary, we can say that data driven learning perspective is built upon the existence of these latent states. The latent variables do not have to any meaning (such as cluster centres), for humans at least, and can be considered as an alternative feature space where learning is (hopefully) easier.
|
... | ... | @@ -22,9 +25,9 @@ The machine learning models we have covered so far can also be interpreted from |
|
|
|
|
|
Many real life observations include nonlinear behavior, hence the density distributions of the data in feature space usually cannot be described by single distribution functions such as Gaussian distribution. In engineering, it is a common practice to convert a finite, nonlinear system into an infinite combination of linear functions so that we can analyze and predict the system behavior. For instance, transient heat/mass/momentum transfer with source terms:
|
|
|
|
|
|
\begin{center}
|
|
|
<div align="center">
|
|
|
<img src="uploads/10baf30be90723e9cc23730b2309f920/dmd_1.png" width="600">
|
|
|
\end{center}
|
|
|
</div>
|
|
|
|
|
|
The same strategy can also be applied in data driven learning. We can, for instance, approximate a nonlinear observation distribution as a linear combinations of basic distribution functions, as in the case of Gaussian mixtures. With this linearization, we can further convert the observed variables X into discrete latent variables. In GMM, for instance, we create latent variables by assigning the observations to specific components of the mixture model (via EM algorithm).
|
|
|
|
... | ... | @@ -141,9 +144,9 @@ Note that we do not know neither A nor S. We also want to perform this decomposi |
|
|
|
|
|
We have already discussed a component analysis method, PCA. At this point, you may ask what is the difference between PCA and ICA? First of all, note that what we aim here is different. In PCA, we aim maximum variance, –a weaker constraint than the independency. In PCA, what is independent from one another is the principle components, while it may carry information from more than one source dimension –and this is typically the case as we reflect multiple coordinates into principle components. It is better seen on a plot:
|
|
|
|
|
|
\begin{center}
|
|
|
<div align="center">
|
|
|
<img src="uploads/876dc45febab4a5593bcce12383aa455/ica_1.png" width="600">
|
|
|
\end{center}
|
|
|
</div>
|
|
|
|
|
|
In PCA, the way we extract the unit vectors of the new coordinate system relies on the variance. If we apply it to a problem composed on two independent phenomena, it will lead to a merged transformation, which is definitely wrong (see middle). In such cases, we first aim to filter out the these "independent" behaviors within the data (see right). So, how are we going to do that?
|
|
|
|
... | ... | |