Lecture notes and notebooks must not be copied and/or distributed without the express permission of ITS.
%% Cell type:markdown id: tags:
#1. Problem Definition: Probe into the Data
#1. Problem Definition: Probe into the Data
In this dataset, we will look into the relationship between airfoil design and its relationship with noise generation.
This NASA data set was obtained from a series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections conducted in an anechoic wind tunnel. It comprises different size NACA 0012 airfoils at various wind tunnel speeds and angles of attack. The span of the airfoil and the observer position were the same in all of the experiments.
Attribute Information:
This problem has the following inputs:
1. Frequency, in Hertzs.
2. Angle of attack, in degrees.
3. Chord length, in meters.
4. Free-stream velocity, in meters per second.
5. Suction side displacement thickness, in meters.
The only output is:
6. Scaled sound pressure level, in decibels.
%% Cell type:markdown id: tags:
# 2. Preparing the enviroment
Import the Python libraries that we will need to (i) load the data, (ii) analyze it, (iii) create our model, (iv) process the results.
%% Cell type:code id: tags:
```
#Importing the necessary libraries
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl
```
%% Cell type:code id: tags:
```
# Data Preparation
from sklearn import preprocessing as pp
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import confusion_matrix, classification_report
```
%% Cell type:code id: tags:
```
# ML Algorithms to be used
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as LGBMClassifier
#If you are working on local machine:
# pip install lightgbm
```
%% Cell type:markdown id: tags:
# 3. Pre-processing
%% Cell type:markdown id: tags:
## Loading the Data
We need to upload the dataset to Colab enviroment. Pandas library is a practical way to load and read the data from an url. Since the data is only given as tabulated values, we need to add the name of the features as well.
data = pd.read_csv(url, names=new_names, delimiter='\t')
data.head()
```
%%%% Output: execute_result
frequency angle_attack ... displacement_thickness sound_pressure
0 800 0.0 ... 0.002663 126.201
1 1000 0.0 ... 0.002663 125.201
2 1250 0.0 ... 0.002663 125.951
3 1600 0.0 ... 0.002663 127.591
4 2000 0.0 ... 0.002663 127.461
[5 rows x 6 columns]
%% Cell type:markdown id: tags:
## Data Exploration
Here we will look into the statistics of the data, identify any missing values or categorical features that is needed to be further process.
Let’s analyze our dataset first. Use dataset.head(n) to display top n data. You can change dataset.head(n) to dataset.sample(n) to display randomly picked data:
%% Cell type:code id: tags:
```
data.head(5)
```
%%%% Output: execute_result
frequency angle_attack ... displacement_thickness sound_pressure
0 800 0.0 ... 0.002663 126.201
1 1000 0.0 ... 0.002663 125.201
2 1250 0.0 ... 0.002663 125.951
3 1600 0.0 ... 0.002663 127.591
4 2000 0.0 ... 0.002663 127.461
[5 rows x 6 columns]
%% Cell type:code id: tags:
```
data.sample(5)
```
%%%% Output: execute_result
frequency angle_attack ... displacement_thickness sound_pressure
1384 1000 8.9 ... 0.010309 135.433
1015 12500 4.8 ... 0.000849 127.688
265 2000 2.0 ... 0.003135 124.612
226 4000 0.0 ... 0.002535 123.465
1042 8000 4.8 ... 0.000907 131.346
[5 rows x 6 columns]
%% Cell type:code id: tags:
```
data.info()
```
%%%% Output: stream
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1503 entries, 0 to 1502
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 frequency 1503 non-null int64
1 angle_attack 1503 non-null float64
2 chord_length 1503 non-null float64
3 Free-stream_velocity 1503 non-null float64
4 displacement_thickness 1503 non-null float64
5 sound_pressure 1503 non-null float64
dtypes: float64(5), int64(1)
memory usage: 70.6 KB
%% Cell type:markdown id: tags:
Let's look into the statistics of the data. This is usually a good starting point to have an idea obout the range of the data, its nature, as well as the missing information for different features:
%% Cell type:code id: tags:
```
data.describe()
```
%%%% Output: execute_result
frequency angle_attack ... displacement_thickness sound_pressure
It is also possible to explore individual features:
%% Cell type:code id: tags:
```
data['sound_pressure'].median()
```
%%%% Output: execute_result
125.721
%% Cell type:code id: tags:
```
data['sound_pressure'].mean()
```
%%%% Output: execute_result
124.83594278110434
%% Cell type:markdown id: tags:
### Data Visualization
Another important pre-processing step is the data visualization. Histograms are suitable for a holistic view, where we can probe into the data for each attribute.
It is always a good exercise to look into the data visually and try to see the distributions of the features.
%% Cell type:markdown id: tags:
### Identify nonnumerical values
Some ML algorithms can not handle non-numerical values (NaN: not a number) so you may need to identify the type of the data for each feature and modify it if necessary. It is also quite common that different feature values are missing for different instances / examples so you may need to decide what to do: (i) omit the instance; (ii) replace them with the mean / median / mode of the feature; (iv) substitute them with a value of your choice.
The following line counts the NaNs for each feature for us:
%% Cell type:code id: tags:
```
nanCounter = np.isnan(data).sum()
print(nanCounter)
```
%%%% Output: stream
frequency 0
angle_attack 0
chord_length 0
Free-stream_velocity 0
displacement_thickness 0
sound_pressure 0
dtype: int64
%% Cell type:markdown id: tags:
The data was entirely numerical and composed of numbers. We will see in the following sessions how to handle datasets including non-numerical features in a smart way!
It is also a good exercise to check the uniqueness of the dataset, that is, whether there exists values repeating at different instances:
As identified earlier, there are 1503 instances (experimental measurements). Here we realized that these experiments were planned as combinations of 5 input parameters, with unique values of 21x27x6x4x105.
### Supervised Algorithms: Preparing the Labels
Supervised approach requires labelled data for the training. In this dataset, the objective function is the noise levels (classification) or noise prediction (regression), depending on the question. In both cases, we need to create a feature matrix say X and and label vector Y. We will use Y to train and test our model.
### Modifying the label vector for classification:
If we are interested in a classification problem, such as identifying the noise levels depending on the airfoil design, we can modify the measured noise levels by converting them into categorical values.
Assume that we will be performing binary classification with the following criteria:
If the noise is less than 120 dB, it is "low" and it is classified as "high" if greater than 120 dB. In this exercise, we will use '0' for 'low' and '1' as 'high'. In this scenerio, we can use Binarizer:
Feature engineering is an inseparable aspect of ML models. In many engineering problem, we know from our tradition that combining different features significantly simplifies the problem and help us to focus our experimental / numerical work on the correct data plane. For example, we combine characteristic length, velocity, density and viscosity as Reynolds number and "classify" the flow regime in a pipe in a quite straight-forward way. The same is true on ML algorithms. You can combine features, delete the unrelated ones and rescale the data (similar to what we do in non-dimensional analysis in engineering) to help ML algorithms to seek patterns from an unbiased perspective. In our example, number of features is already small so we do not need to perform any feature engineering. We will come back to this topic in the following weeks.
For the time being, lets just see how much the features are correlated. But first, we will rescale the data. It is important to remember that most ML algorithms work better if the data is normalized around zero; that it has a mean value of zero with a standard deviation of one. Let's try it ourselves:
max 5.430267e+00 2.606032e+00 ... 1.312935e+00 3.595917e+00
[8 rows x 5 columns]
%% Cell type:markdown id: tags:
As you can see, mean value is fixed as zero with a standard deviation (std) of 1. Now let's try to visualize how correlated the data is by creating a correlation matrix.
<pandas.io.formats.style.Styler at 0x7feafc15dcc0>
%% Cell type:markdown id: tags:
What do these numbers mean?
When it is close to 1, it means that there is a strong positive correlation; At the other extreme; –1, it implies that there is a strong negative correlation. For more:
We need to divide our entire dataset into fractions so that we have a training set
from which the machine learning algorithm learns. We also need another set to test the predictions of the ML algorithm. There is no golden rule here: you need to consider the size of your entire dataset. Sometimes 5% is more than enough, sometimes we need to divide 1/3 to have enough test samples.
In our current examples, number of cases is quite low for a ML project. Therefore, lets leave sufficient number of test cases:
Here we have frozen the randomness to make the results reproducible. Otherwise, the results would change at every run.
%% Cell type:markdown id: tags:
## Cross-Validation
Another rule of thumb is to split
the training set into a sub-training sets and a validation set before seeing its true performance on the test set (25% of the dataset reserved above). This policy is called k-fold cross-validation. The training data is divided into k fractions, trained over (k-1) fractions and tested on the k_th. Here the idea is to increase the generalization of the model as much as possible.
In any supervised approach, we need to select a cost function to compare ML predictions with the true values (labels). The
ML algorithm will minimize the cost function by changing its fitting parameters. You should spend some time to define what could be the best cost function for my dataset and my objective.
In our case, we are dealing with binary classification: so lets try binary classification log loss. You may see the following links for further information:
i) In general, what we expect is a training loss smaller than the CV loss. The reason is simple: ML algorithm learns from the training data so it should be better with the training set. CV is the data not included in the training set. In our case, we see that the training and cross-validation losses are similar for each run.