... | ... | @@ -90,15 +90,20 @@ One of the most common measures is the Euclidean distance, giving the dissimilar |
|
|
|
|
|
```math
|
|
|
Distance(x_{i},x_{i'}) = \sqrt{ \sum_{m=1}^{M} (x_{im}-x_{i'm})^2 }
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
|
which is, nothing but the l2 norm we discussed in Chapter 2. This definition can be considered the distance in the bird's-eye view, a special bird flying in this feature space (shortest distance).Note that the distance between the features are squared, so the overall distance will be dictated by large value differences, while the smaller differences will be over-ridden. Here we also see the reasoning behind the feature rescaling: if one feature has absolute value range of 10,000-20,000, while the other is changing between 0.0001 and 0.001, the first feature will be much more important in the distance measuring between instances. So features should be scaled in the same range in order not to be biased to some of them unrealistically.
|
|
|
|
|
|
Another popular option is the city block distance:
|
|
|
|
|
|
```math
|
|
|
Distance(x_{i},x_{i'}) = \sqrt{ \sum_{m=1}^{M} \abs{(x_{im}-x_{i'm})}}
|
|
|
```
|
|
|
In this case, we will be considering the distance between features, not their squires. This is what we actually did with l1 norm in the regression analysis. We can visualize it as if we are in a car and trying to get from point A to B, in a city. This definition will tell us how many rows and columns of city blocks (buildings) we have to move horizontally and vertically for that journey.
|
|
|
|
|
|
Another important distance in ML is projection-based, cosine distance. This particularly used in text processing. Imagine that you want to classify a novel texts according to their genre. Here we can project the phrases into classes like Horror, Historical, Romance etc. The point you should remember is distance-based learning is used in many of the models we studied and how you define it will change the results.
|
|
|
|
|
|
You can also check:
|
|
|
[17 types of similarity and dissimilarity measures](https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-measures-used-in-data-science-3eb914d2681)
|
|
|
For more distance measures, You can also check the post,[17 types of similarity and dissimilarity measures](https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-measures-used-in-data-science-3eb914d2681).
|
|
|
|
|
|
## Additional materials
|
|
|
|
... | ... | |