|
|
## Descriptive learning
|
|
|
|
|
|
...
|
|
|
|
|
|
## Latent state: is there a hidden meaning here, or am I just dreaming?
|
|
|
|
|
|
...
|
|
|
|
|
|
## Clustering
|
|
|
|
|
|
Now our task is to identify groups so called clusters in a multi-dimensional space. Here the dimensionality is a critical information, as the clustering typically depends on similarity and/or distance based learning strategies. As the number of dimensions in the data space increase, the “distance” between instances becomes larger and larger, to such an extent that it becomes very difficult to judge “how similar” they are.
|
|
|
|
|
|
I would like to start with the simplest case: imagine that we have K number of clusters in D dimensional data space. A group, or cluster, implies that the instances it engulfs are “similar” based on a “criteria”. One strategic question here is “do we have a rough idea about the number of groups in our sample?”. We may not know exactly why they are similar explicitly, or which instance belongs to which group (i.e., labels), but we can still say that we expect more or less 10s of groups, maybe 5, maybe, 35 but not 100. Probably more than 5. That is the first step. Next, we need to estimate what that 5 group characteristics look like as a “stereotype”.
|
|
|
|
|
|
Let’s give an example. Imagine that we have 2D data space (for visualization) and each instance has two features: x1 and x2. We also suspect two groups to exist (like the Covid example from Lecture 1, we have features PC1 and PC2, and two groups of people; sick and healthy). To cluster the groups, we still need a hypothesis on what x1 and x2 can be for healthy and sick people. We do not know whether that is actually true, but we need to start at some point. For similarity / distance based analysis, we need to approach the problem in a relativistic manner. Since we do not have a reference, we assign it, typically randomly.
|
|
|
|
|
|
Let’s talk visually. Image that our data is spread in 2D data space as shown in Fig 1(i). As the user, we have decided that we expect two classes (will be shown in blue and red). First step is to assign randomly the stereotypes for clusters μ1 and μ2(shown as x symbols). Then, we will compute the distance between each sample and our stereotypes; and assign them to the nearest stereotype (Fig 1(ii,iii)). At this stage, we do not know how true our “centre of cluster” is. In the third step, we will evaluate the stereotype centres (rather than assuming it) from the assigned labels (Fig 1(iv)). In the fourth step, we will reassign the samples to the new cluster centres. This procedure is followed iteratively until a convergence criterion is achieved.
|
|
|
|
|
|
<img src="uploads/d91ac5ec305465c9366bfc57b3271229/cluster_0.png" width="600">
|
|
|
|
|
|
Fig. 1 Flat clustering: how k-means work
|
|
|
|
|
|
How do we keep track of the convergence criteria? We will start by defining a way to represent cluster information. A method to do it is the “1-of-K-coding” scheme. Here we have a binary indicator $`r_{nk}`$ where k index denote the cluster number and n is the index of the assigned sample. So $`r_{nk}`$is 1 for the assign cluster, otherwise it is zero. With this encoding, we can define an objective function:
|
|
|
|
|
|
<img src="uploads/9a0b02298b9108f44c4a3e2ddf42f7db/cluster_1.png" width="600">
|
|
|
|
|
|
which is the summed of the square distances of each example to its assigned cluster centre. Note that in this objective function, we do not know what true “r” is, nor the centres for stereotypes μ –but we have only one equation to go with. It means we need to apply an iterative process to get an approximate solution, a procedure similar to what is described above:
|
|
|
|
|
|
1. Choose μ values,
|
|
|
2. Minimize cost function J with respect to r while keeping μ fixed,
|
|
|
3. Minimize cost function J with respect to μ while keeping r fixed.
|
|
|
|
|
|
Since the observations (n) are independent, we can find iteratively the best r for each sample; that is the best class we can get given the current stereotypes μ (step 1 in below figure). By doing so, we would fill up the matrix r. In the next step, we want to minimize the cost, J so we take partial derivative with respect to the positions of the stereotypes and make it equal to zero (step 2 in figure below). Here r is fixed: we accepted the classes are true. Given that, we can figure out what is the best stereotype in this current distribution. We basically solve for μ for all classes. If we examine the equation a bit closely, we will see that the ratio actually hints what we do. The denominator gives us how many examples belong to class k; in other words, we take an average of the examples in X that belongs to cluster k (we sum them up and divide to number of instances in that class). As a matter of fact, this is why the method called “k-means”. We follow the mean behaviour as a stereotype info.
|
|
|
|
|
|
<img src="uploads/4ba45b79f70386fef8acbd14a7b30527/cluster_2.png" width="600">
|
|
|
|
|
|
In the descriptive learning process, we perform above two steps several time, until we reach a convergence criterion for J, typically user defined:
|
|
|
|
|
|
<img src="uploads/95dab4e8fcb08691b5f97a0012521dc1/cluster_3.png" width="300">
|
|
|
|
|
|
Another input from the user is the number of clusters to find. This is found iteratively as well, if we do not have a prior knowledge. We simple increase the number of clusters to be found gradually, which also gives a curve similar to iteration plot. Here the user decides (manually or automatically) the number of clusters in the dataset. We should be aware that the similarity measure is distance-based, Euclidean to be precise. It has three practical limitations to remember: (i) it is not robust to outliers, (ii) it starts to become useless if features have categorical data (either 0 or 1 for example), (iii) everything is categorized as “either white or black”, meaning that there is no measure of closeness / confidence levels in model predictions (instance at the stereotype centre and at the boundary is considered to be the same).
|
|
|
|
|
|
### Gaussian Mixtures
|
|
|
|
|
|
We have just discussed that in k-means we see everything in either black or white (similar to hard classification). This is reflected in the r matrix; all but one element is zero. From a probabilistic view, it means that we assign 100% probability to one cluster and all other probabilities are set to zero, which too much idealization. Alternatively, we can assign and learn probabilistic representation of the matrix r. This is, in short, what we do with Gaussian mixtures (GM). If we compare it with k-means, the other important effect of adding probabilities is the way model builds the decision boundaries. In 2D, k-means learn to plot “circles” where the centre is the stereotype. In GM, we do not necessarily follow a circular boundary.
|
|
|
|
|
|
How does the model work? As the name implies, we are defining cluster boundaries by combining multiple Gaussian distributions. Therefore, we represent the stereotypes with two values: mean and the variance, which helps us to define the uncertainties in our clustering.
|
|
|
|
|
|
In k-means, we saw how we can apply an iterative algorithm to learn the stereotype cluster centres. In GM models, we use [Expectation-Maximization (EM) algorithm]( https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm). The idea behind is the same. In the expectation step, we again calculate the r matrix, but this time we find the probabilities to belong to a class for each instance, by using the assumed mean and variances (stereotypes). Note that we do not have one, but several Gaussians to represent the cluster distribution so we need to initialize a weight vector (mixing coefficients), which will combine the Gaussians according to their weights. Then for each cluster, we will find the new stereotype properties:
|
|
|
|
|
|
<img src="uploads/ea5baf619f11295c145bb1639e46a8d4/cluster_4.png" width="600">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|