em0787 · 325bfcfd
--- a/DDE-1/Clustering.md
+++ b/DDE-1/Clustering.md
@@ -50,6 +50,19 @@ In k-means, we saw how we can apply an iterative algorithm to learn the stereoty

 <img src="uploads/ea5baf619f11295c145bb1639e46a8d4/cluster_4.png"  width="400">

+### Hierarchical clustering
+
+We have talked so far how we can partition instances into discrete groups with k-means and Gaussian mixtures, which is one way grouping the samples and called “flat clustering”. Another strategy we can follow is the hierarchical clustering, where we create “trees”. In flat clustering, we build cluster on the same “plane”, a sample belongs to only one cluster (with different degrees of certainty, depending on the model details). In hierarchical clustering (imagine decision trees), we create groups that is nested inside each other. The tree structure can be build top-down (like a decision tree), called divisive clustering. Alternatively, it can be built bottom-up (inverse decision tree) called agglomerative clustering. Tree creation process does not utilize an optimization function and the samples are merged / split into groups by comparison: therefore, it will always cluster the samples even if there is no hidden structure in the observations (white noise). 
+
+In agglomerative clustering, we start with an over-fitted tree, where there is only one sample at each leaf. At each iteration, we combine the most two similar instances in a sub-group (takes O(N2) time as we look the combinations). Iterations are done until we reach a single group (in total O(N3) time). With a smarter technique (priority queue), time can be reduced to (O(NxNxlog(N)). Once the tree is created, it is cut at any desired level, giving K number of clusters (analogous to pruning). 
+
+In order to group the observations, we need a measure of distance. If it is set, then we need to decide how to compare the similarities between group of observations. There are three common ways deployed for that comparison:
+
+1. If the distance is measured between two groups A and B based on the closest members in each group, this is called “single link”. 
+2. If the distance is measured between two groups A and B based on the furthest members in each group, this is called “complete link”. 
+3. If the average distance between all pairs are used, it is called “average link”. 
+
+The third method is what is usually preferred. In any case, selecting the right number of cluster is not easy. Once the tree is constructed, we search for large gaps in the merged sub-groups, indicating the dissimilarities between the merged groups. The gap detection process can be done by the user. Alternatively, we can follow the [Bayesian treatment](https://www2.stat.duke.edu/~kheller/bhcnew.pdf).