Commit 5deac1df authored by cihan.ates's avatar cihan.ates
Browse files

Update Lecture_1.ipynb

parent a83b37ed
%% Cell type:markdown id: tags:
 
# Active Session I: Classification
 
%% Cell type:markdown id: tags:
 
# Important Note
 
Lecture notes and notebooks must not be copied and/or distributed without the express permission of ITS.
 
 
 
 
%% Cell type:markdown id: tags:
 
#1. Problem Definition: Probe into the Data
# 1. Problem Definition: Probe into the Data
 
In this dataset, we will look into the relationship between airfoil design and its relationship with noise generation.
 
This NASA data set was obtained from a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections conducted in an anechoic wind tunnel. It comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position were the same in all of the experiments.
 
Attribute Information:
 
This problem has the following inputs:
1. Frequency, in Hertzs.
2. Angle of attack, in degrees.
3. Chord length, in meters.
4. Free-stream velocity, in meters per second.
5. Suction side displacement thickness, in meters.
 
The only output is:
6. Scaled sound pressure level, in decibels.
 
%% Cell type:markdown id: tags:
 
 
# 2. Preparing the enviroment
 
Import the Python libraries that we will need to (i) load the data, (ii) analyze it, (iii) create our model, (iv) process the results.
 
%% Cell type:code id: tags:
 
```
#Importing the necessary libraries
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl
```
 
%% Cell type:code id: tags:
 
```
# Data Preparation
from sklearn import preprocessing as pp
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report
```
 
%% Cell type:code id: tags:
 
```
# ML Algorithms to be used
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as LGBMClassifier
 
#If you are working on local machine:
# pip install lightgbm
```
 
%% Cell type:markdown id: tags:
 
# 3. Pre-processing
 
%% Cell type:markdown id: tags:
 
## Loading the Data
 
We need to upload the dataset to Colab enviroment. Pandas library is a practical way to load and read the data from an url. Since the data is only given as tabulated values, we need to add the name of the features as well.
 
%% Cell type:code id: tags:
 
```
# Loading the data
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00291/airfoil_self_noise.dat'
new_names = ['frequency','angle_attack','chord_length','Free-stream_velocity','displacement_thickness','sound_pressure']
data = pd.read_csv(url, names=new_names, delimiter='\t')
data.head()
```
 
%%%% Output: execute_result
 
frequency angle_attack ... displacement_thickness sound_pressure
0 800 0.0 ... 0.002663 126.201
1 1000 0.0 ... 0.002663 125.201
2 1250 0.0 ... 0.002663 125.951
3 1600 0.0 ... 0.002663 127.591
4 2000 0.0 ... 0.002663 127.461
[5 rows x 6 columns]
 
%% Cell type:markdown id: tags:
 
## Data Exploration
Here we will look into the statistics of the data, identify any missing values or categorical features that is needed to be further process.
 
Let’s analyze our dataset first. Use dataset.head(n) to display top n data. You can change dataset.head(n) to dataset.sample(n) to display randomly picked data:
 
%% Cell type:code id: tags:
 
```
data.head(5)
```
 
%%%% Output: execute_result
 
frequency angle_attack ... displacement_thickness sound_pressure
0 800 0.0 ... 0.002663 126.201
1 1000 0.0 ... 0.002663 125.201
2 1250 0.0 ... 0.002663 125.951
3 1600 0.0 ... 0.002663 127.591
4 2000 0.0 ... 0.002663 127.461
[5 rows x 6 columns]
 
%% Cell type:code id: tags:
 
```
data.sample(5)
```
 
%%%% Output: execute_result
 
frequency angle_attack ... displacement_thickness sound_pressure
1384 1000 8.9 ... 0.010309 135.433
1015 12500 4.8 ... 0.000849 127.688
265 2000 2.0 ... 0.003135 124.612
226 4000 0.0 ... 0.002535 123.465
1042 8000 4.8 ... 0.000907 131.346
[5 rows x 6 columns]
 
%% Cell type:code id: tags:
 
```
data.info()
```
 
%%%% Output: stream
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1503 entries, 0 to 1502
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 frequency 1503 non-null int64
1 angle_attack 1503 non-null float64
2 chord_length 1503 non-null float64
3 Free-stream_velocity 1503 non-null float64
4 displacement_thickness 1503 non-null float64
5 sound_pressure 1503 non-null float64
dtypes: float64(5), int64(1)
memory usage: 70.6 KB
 
%% Cell type:markdown id: tags:
 
Let's look into the statistics of the data. This is usually a good starting point to have an idea obout the range of the data, its nature, as well as the missing information for different features:
 
%% Cell type:code id: tags:
 
```
data.describe()
```
 
%%%% Output: execute_result
 
frequency angle_attack ... displacement_thickness sound_pressure
count 1503.000000 1503.000000 ... 1503.000000 1503.000000
mean 2886.380572 6.782302 ... 0.011140 124.835943
std 3152.573137 5.918128 ... 0.013150 6.898657
min 200.000000 0.000000 ... 0.000401 103.380000
25% 800.000000 2.000000 ... 0.002535 120.191000
50% 1600.000000 5.400000 ... 0.004957 125.721000
75% 4000.000000 9.900000 ... 0.015576 129.995500
max 20000.000000 22.200000 ... 0.058411 140.987000
[8 rows x 6 columns]
 
%% Cell type:markdown id: tags:
 
If not defined by the user, you can also explore the features with the following command:
 
%% Cell type:code id: tags:
 
```
data.columns
```
 
%%%% Output: execute_result
 
Index(['frequency', 'angle_attack', 'chord_length', 'Free-stream_velocity',
'displacement_thickness', 'sound_pressure'],
dtype='object')
 
%% Cell type:markdown id: tags:
 
It is also possible to explore individual features:
 
%% Cell type:code id: tags:
 
```
data['sound_pressure'].median()
```
 
%%%% Output: execute_result
 
125.721
 
%% Cell type:code id: tags:
 
```
data['sound_pressure'].mean()
```
 
%%%% Output: execute_result
 
124.83594278110434
 
%% Cell type:markdown id: tags:
 
### Data Visualization
 
Another important pre-processing step is the data visualization. Histograms are suitable for a holistic view, where we can probe into the data for each attribute.
 
We can use hist from matplotlib for that purpose:
 
%% Cell type:code id: tags:
 
```
data.hist(bins=30, figsize=(15,15))
plt.show()
```
 
%%%% Output: display_data
 
![]()
 
%% Cell type:markdown id: tags:
 
It is always a good exercise to look into the data visually and try to see the distributions of the features.
 
%% Cell type:markdown id: tags:
 
### Identify nonnumerical values
Some ML algorithms can not handle non-numerical values (NaN: not a number) so you may need to identify the type of the data for each feature and modify it if necessary. It is also quite common that different feature values are missing for different instances / examples so you may need to decide what to do: (i) omit the instance; (ii) replace them with the mean / median / mode of the feature; (iv) substitute them with a value of your choice.
 
The following line counts the NaNs for each feature for us:
 
%% Cell type:code id: tags:
 
```
nanCounter = np.isnan(data).sum()
print(nanCounter)
```
 
%%%% Output: stream
 
frequency 0
angle_attack 0
chord_length 0
Free-stream_velocity 0
displacement_thickness 0
sound_pressure 0
dtype: int64
 
%% Cell type:markdown id: tags:
 
The data was entirely numerical and composed of numbers. We will see in the following sessions how to handle datasets including non-numerical features in a smart way!
 
It is also a good exercise to check the uniqueness of the dataset, that is, whether there exists values repeating at different instances:
 
%% Cell type:code id: tags:
 
```
distinctCounter = data.apply(lambda x: len(x.unique()))
print(distinctCounter)
```
 
%%%% Output: stream
 
frequency 21
angle_attack 27
chord_length 6
Free-stream_velocity 4
displacement_thickness 105
sound_pressure 1456
dtype: int64
 
%% Cell type:markdown id: tags:
 
As identified earlier, there are 1503 instances (experimental measurements). Here we realized that these experiments were planned as combinations of 5 input parameters, with unique values of 21x27x6x4x105.
 
### Supervised Algorithms: Preparing the Labels
 
Supervised approach requires labelled data for the training. In this dataset, the objective function is the noise levels (classification) or noise prediction (regression), depending on the question. In both cases, we need to create a feature matrix say X and and label vector Y. We will use Y to train and test our model.
 
%% Cell type:markdown id: tags:
 
### Creating the Feature Matrix and Labels
 
%% Cell type:code id: tags:
 
```
dataX = data.copy().drop(['sound_pressure'],axis=1)
dataY = data['sound_pressure'].astype(int).copy()
```
 
%% Cell type:markdown id: tags:
 
### Modifying the label vector for classification:
 
If we are interested in a classification problem, such as identifying the noise levels depending on the airfoil design, we can modify the measured noise levels by converting them into categorical values.
 
Assume that we will be performing binary classification with the following criteria:
If the noise is less than 120 dB, it is "low" and it is classified as "high" if greater than 120 dB. In this exercise, we will use '0' for 'low' and '1' as 'high'. In this scenerio, we can use Binarizer:
 
%% Cell type:code id: tags:
 
```
from sklearn.preprocessing import Binarizer
#Create an object:
transformer = Binarizer(threshold=120,copy=False)
transformer
```
 
%%%% Output: execute_result
 
Binarizer(copy=False, threshold=120)
 
%% Cell type:code id: tags:
 
```
#Transforming the data with Binarizer method:
transformer.fit_transform(dataY.values.reshape(-1, 1))
 
#Lets see how it changes:
dataY.head
```
 
%%%% Output: execute_result
 
<bound method NDFrame.head of 0 1
1 1
2 1
3 1
4 1
..
1498 0
1499 0
1500 0
1501 0
1502 0
Name: sound_pressure, Length: 1503, dtype: int64>
 
%% Cell type:code id: tags:
 
```
dataY.value_counts()
```
 
%%%% Output: execute_result
 
1 1073
0 430
Name: sound_pressure, dtype: int64
 
%% Cell type:markdown id: tags:
 
For more pre-processing options; you may check:
 
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing
 
%% Cell type:markdown id: tags:
 
### Feature Standardization: Rescaling the Data
 
Feature engineering is an inseparable aspect of ML models. In many engineering problem, we know from our tradition that combining different features significantly simplifies the problem and help us to focus our experimental / numerical work on the correct data plane. For example, we combine characteristic length, velocity, density and viscosity as Reynolds number and "classify" the flow regime in a pipe in a quite straight-forward way. The same is true on ML algorithms. You can combine features, delete the unrelated ones and rescale the data (similar to what we do in non-dimensional analysis in engineering) to help ML algorithms to seek patterns from an unbiased perspective. In our example, number of features is already small so we do not need to perform any feature engineering. We will come back to this topic in the following weeks.
 
For the time being, lets just see how much the features are correlated. But first, we will rescale the data. It is important to remember that most ML algorithms work better if the data is normalized around zero; that it has a mean value of zero with a standard deviation of one. Let's try it ourselves:
 
%% Cell type:code id: tags:
 
```
#Rescaling the data
featuresToScale = dataX.columns
sX = pp.StandardScaler(copy=True)
dataX.loc[:,featuresToScale] = sX.fit_transform(dataX[featuresToScale])
#Looking into the statistics again:
dataX.describe()
```
 
%%%% Output: execute_result
 
frequency angle_attack ... Free-stream_velocity displacement_thickness
count 1.503000e+03 1.503000e+03 ... 1.503000e+03 1.503000e+03
mean 2.837975e-16 -3.495393e-16 ... -8.558246e-16 2.866045e-16
std 1.000333e+00 1.000333e+00 ... 1.000333e+00 1.000333e+00
min -8.524068e-01 -1.146403e+00 ... -1.230809e+00 -8.169263e-01
25% -6.620227e-01 -8.083458e-01 ... -7.233448e-01 -6.545613e-01
50% -4.081773e-01 -2.336486e-01 ... -7.233448e-01 -4.702979e-01
75% 3.533590e-01 5.269801e-01 ... 1.312935e+00 3.374462e-01
max 5.430267e+00 2.606032e+00 ... 1.312935e+00 3.595917e+00
[8 rows x 5 columns]
 
%% Cell type:markdown id: tags:
 
As you can see, mean value is fixed as zero with a standard deviation (std) of 1. Now let's try to visualize how correlated the data is by creating a correlation matrix.
 
%% Cell type:code id: tags:
 
```
correlationMatrix = pd.DataFrame(dataX).corr()
 
f = plt.figure(figsize=(12, 6))
plt.matshow(correlationMatrix, fignum=f.number)
plt.xticks(range(dataX.shape[1]), dataX.columns, fontsize=14, rotation=75)
plt.yticks(range(dataX.shape[1]), dataX.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.show()
```
 
%%%% Output: display_data
 
![]()
 
%% Cell type:code id: tags:
 
```
#we can also simply look at the table via pandas:
correlationMatrix.style.background_gradient(cmap='viridis').set_precision(2)
```
 
%%%% Output: execute_result
 
<pandas.io.formats.style.Styler at 0x7feafc15dcc0>
 
%% Cell type:markdown id: tags:
 
What do these numbers mean?
 
When it is close to 1, it means that there is a strong positive correlation; At the other extreme; –1, it implies that there is a strong negative correlation. For more:
 
https://en.wikipedia.org/wiki/Correlation_and_dependence
 
%% Cell type:markdown id: tags:
 
## Preparing the Dataset for Model
 
We need to divide our entire dataset into fractions so that we have a training set
from which the machine learning algorithm learns. We also need another set to test the predictions of the ML algorithm. There is no golden rule here: you need to consider the size of your entire dataset. Sometimes 5% is more than enough, sometimes we need to divide 1/3 to have enough test samples.
 
In our current examples, number of cases is quite low for a ML project. Therefore, lets leave sufficient number of test cases:
 
 
 
 
%% Cell type:code id: tags:
 
```
X_train, X_test, y_train, y_test = train_test_split(dataX,
dataY, test_size=0.25,
random_state=2020, stratify=dataY)
```
 
%% Cell type:markdown id: tags:
 
Here we have frozen the randomness to make the results reproducible. Otherwise, the results would change at every run.
 
%% Cell type:markdown id: tags:
 
## Cross-Validation
Another rule of thumb is to split
the training set into a sub-training sets and a validation set before seeing its true performance on the test set (25% of the dataset reserved above). This policy is called k-fold cross-validation. The training data is divided into k fractions, trained over (k-1) fractions and tested on the k_th. Here the idea is to increase the generalization of the model as much as possible.
 
%% Cell type:code id: tags:
 
```
k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=2020)
```
 
%% Cell type:markdown id: tags:
 
## How to calculate the performance?
In any supervised approach, we need to select a cost function to compare ML predictions with the true values (labels). The
ML algorithm will minimize the cost function by changing its fitting parameters. You should spend some time to define what could be the best cost function for my dataset and my objective.
 
In our case, we are dealing with binary classification: so lets try binary classification log loss. You may see the following links for further information:
 
https://en.wikipedia.org/wiki/Cross_entropy
 
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
 
%% Cell type:markdown id: tags:
 
# ML 101: Logistic Regression
 
See the lecture notes for the model. Here is a review:
 
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
 
%% Cell type:code id: tags:
 
```
# Hyperparameters:
penalty = 'l2'
C = 1.0 #regularization strength. The smaller the value, the stronger the regularization.
random_state = 2020
solver = 'liblinear' # For small datasets, it is good.
logReg = LogisticRegression(penalty=penalty, C=C,random_state=random_state, solver=solver)
```
 
%% Cell type:code id: tags:
 
```
# Model Training:
#-----------------------------------------------------------------------------
#Lists for storing scores
trainingScores = []
cvScores = []
 
#DataFrame is a 2-dimensional labeled data structure. You can think of it like a spreadsheet.
#Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame.
#It is also known as Subset Selection.
predictionsBasedOnKFolds = pd.DataFrame(data=[],index=y_train.index,columns=[0,1])
model = logReg
 
#kfold.split will generate indices to split data into training and test set (cv):
for train_index, cv_index in k_fold.split(np.zeros(len(X_train)),y_train.ravel()):
 
#'iloc' can be used when the user doesn’t know the index label:
#Here we are filtering the data based on indices. Data is divided as 902/225.
X_train_fold, X_cv_fold = X_train.iloc[train_index,:], X_train.iloc[cv_index,:]
y_train_fold, y_cv_fold = y_train.iloc[train_index], y_train.iloc[cv_index]
 
#Fitting the model according to given data:
#Note that model refers to logReg.
model.fit(X_train_fold, y_train_fold)
 
#Lets check how good is the fitting. Remember we decided to use log loss.
# Log loss, aka logistic loss or cross-entropy loss.This is the loss function
# used in (multinomial) logistic regression and extensions of it such as neural networks.
 
# We will fisrt look log loss in the training dataset.
loglossTraining = log_loss(y_train_fold,model.predict_proba(X_train_fold)[:,1])
#Saving our analysis on the list:
trainingScores.append(loglossTraining)
 
#Lets see how good it is on CV dataset:
predictionsBasedOnKFolds.loc[X_cv_fold.index,:] =model.predict_proba(X_cv_fold)
loglossCV = log_loss(y_cv_fold,predictionsBasedOnKFolds.loc[X_cv_fold.index,1])
#Saving our analysis on the list:
cvScores.append(loglossCV)
 
#printing the results:
print('Training Log Loss: ', loglossTraining)
print('CV Log Loss: ', loglossCV)
 
#Lets see the overall log loss for the entire training set (1127)
loglossLogisticRegression = log_loss(y_train,predictionsBasedOnKFolds.loc[:,1])
print('Logistic Regression Log Loss: ', loglossLogisticRegression)
```
 
%%%% Output: stream
 
Training Log Loss: 0.356738871444132
CV Log Loss: 0.3908671819349346
Training Log Loss: 0.3502213540664088
CV Log Loss: 0.41923690251002477
Training Log Loss: 0.36966720091349914
CV Log Loss: 0.3416906749773056
Training Log Loss: 0.3713443836483212
CV Log Loss: 0.33109771302939867
Training Log Loss: 0.3628451865203291
CV Log Loss: 0.36564683099773154
Logistic Regression Log Loss: 0.3697705832835482
 
%% Cell type:markdown id: tags:
 
Lets review the results.
 
i) In general, what we expect is a training loss smaller than the CV loss. The reason is simple: ML algorithm learns from the training data so it should be better with the training set. CV is the data not included in the training set. In our case, we see that the training and cross-validation losses are similar for each run.