...  ...  @@ 157,8 +157,62 @@ S = WX 





All we need to do is finding the values for $`\alpha_{ij}`$ in W. Note that we need to find W when A is unknown (we cannot simply use the inverse A). What we know though is, W defines the vectors in the mixture space and each vector (e.g. $`[\alpha_{11},\alpha_{12}]`$) basically extracts one source signal (here it is $`s_{1}`$). If you look at the above sketch of ICA, we see that these vectors must be orthogonal to the samples associated with all sources except the one it describes. So, we need to find W such that each vector in W is orthogonal to all sources but one. Okay, now we are getting closer to define an optimization problem.






We also said that we are after the independent signals. By saying so, we assume that the sources do reflect this property, better than the merged signals at least. With this constraint, we can say, "I will find such a W that "the independency" is maximized in the extracted signals. In the simplest approach (brute force), we can take a vector $`w_{1}`$ ($`[\alpha_{11},\alpha_{12}]`$) and try combinations by rotating it around the origin. For each $`w_{1}`$ alternative, we can find the corresponding $`s_{1}`$ and choose the one with maximum independency. A smarter move would be using a gradient based search algorithm but this is the core idea. At this point, after our discussions on regression, classification and clustering, you should ask "how can I measure this statistical independence?". The short answer is the [moments](https://en.wikipedia.org/wiki/Moment_(mathematics)) of [probability density functions](https://en.wikipedia.org/wiki/Probability_density_function): the first moment is the expected value, the second central moment is the variance, the third standardized moment is the [skewness](https://en.wikipedia.org/wiki/Skewness), and the fourth standardized moment is the [kurtosis](https://en.wikipedia.org/wiki/Kurtosis).



We also said that we are after the independent signals. By saying so, we assume that the sources do reflect this property, better than the merged signals at least. With this constraint, we can say, "I will find such a W that "the independency" is maximized in the extracted signals. In the simplest approach (brute force), we can take a vector $`w_{1}`$ ($`[\alpha_{11},\alpha_{12}]`$) and try combinations by rotating it around the origin. For each $`w_{1}`$ alternative, we can find the corresponding $`s_{1}`$ and choose the one with maximum independency. A smarter move would be using a gradient based search algorithm but this is the core idea.






At this point, after our discussions on regression, classification and clustering, you should ask "how can I measure this statistical independence?". The short answer is the [moments](https://en.wikipedia.org/wiki/Moment_(mathematics)) of [probability density functions(PDF)](https://en.wikipedia.org/wiki/Probability_density_function): the first moment is the expected value (E), the second central moment is the variance, the third standardized moment is the [skewness](https://en.wikipedia.org/wiki/Skewness), and the fourth standardized moment is the [kurtosis](https://en.wikipedia.org/wiki/Kurtosis).






Statistical independence is expressed in terms of PDFs. Two variables i & j are independent if the [joint PDF](https://en.wikipedia.org/wiki/Joint_probability_distribution) for i & j is:






´´´math



p_{xy}(i,j)=p_i(i)p_j(j)



´´´






In simpler terms, it means if we know the PDFs of i & j, their joint PDF can be constructed. This implies that:






´´´math



E(i^p,j^q) = E(i^p)E(j^q)



´´´






Let's start with something we already see in the lecture, p=q=1:






´´´math



E(i,j) = E(i)E(j)



´´´






where ´$E(i,j)$´ is the covariance between i and j (both i and j are zero mean):






´´´math



E(i,j) = \sum_{n=1}^{N} i_nj_n



´´´






"Correlation" is nothing but the normalized covariance:






´´´math



\rho(i,j) = \sum_{n=1}^{N} i_nj_n / \sigma_i\sigma_j






´´´






where ´$\sigma$´ is the standard deviation:






´´´math



\sigma_i = E(i,i)^0.5, \sigma_j = E(j,j)^0.5



´´´






We already talked in the lecture that this normalization helps us to scale between 1 and 1. To give you vivid examples, let's see the limits. ´$\rho(i,j)=1$´ means that i and j increases in proportion (j in proportion to i). 0 means j does not increase or increase (in average behavior of course) as x increases. If ´$\rho(i,j)=1$´, y decreases in proportion to x, as it increases.






To check whether correlation exists or not, we look at the a limited case, the first moment of the joint PDF (p=q=1) and say that they are uncorrelated if:






´´´math



E(i^p,j^q) = E(i^p)E(j^q)



´´´






In order to claim that it is independent, the general form of the equation must be satisfied (p>0, q>0; where p & q are integers). So our criteria is:






´´´math



\rho(i^p,j^q) = 0



´´´
















...  ...  