Commit c847416e authored by Cihan Ates's avatar Cihan Ates
Browse files

Colab buttons

parent 488c335f
%% Cell type:markdown id: tags:
# Active Session 8: Autoencoders
%% Cell type:markdown id: tags:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cihan-ates/data-driven-engineering/blob/master/DDE_I_ML_Dynamical_Systems/Lecture%2010/Lecture_8.ipynb)
%% Cell type:markdown id: tags:
# Important Note
Lecture notes and notebooks must not be copied and/or distributed without the express permission of ITS.
%% Cell type:markdown id: tags:
# 1. Problem Definition: Probe into the Data
In this dataset, we will be looking into manufacturing error analysis. The dataset includes modified sensory inputs collected on the manufacturing line and we will be using this multi-dimensional input data to predict the defective products via dimensionality reduction.
Note that the data set includes over 280,000 instances, where only a small fraction (~500) is defective.
%% Cell type:markdown id: tags:
# 2. Preparing the enviroment
Import the Python libraries that we will need to (i) load the data, (ii) analyze it, (iii) create our model, (iv) process the results.
%% Cell type:code id: tags:
```
!pip install ipython-autotime
%load_ext autotime
```
%%%% Output: stream
Requirement already satisfied: ipython-autotime in /usr/local/lib/python3.6/dist-packages (0.3.1)
Requirement already satisfied: ipython in /usr/local/lib/python3.6/dist-packages (from ipython-autotime) (5.5.0)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.7.5)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.8.1)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.8.0)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (1.0.18)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.3.3)
Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.4.2)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (51.1.1)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (2.6.1)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython->ipython-autotime) (0.7.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->ipython-autotime) (0.2.5)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->ipython-autotime) (1.15.0)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->ipython->ipython-autotime) (0.2.0)
time: 1.37 ms (started: 2021-01-18 16:03:04 +00:00)
%% Cell type:code id: tags:
```
#Importing the necessary libraries
import numpy as np
import pandas as pd
import os
import io
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.cm as cm
import seaborn as sns
color = sns.color_palette()
import matplotlib as mpl
#import altair as alt
import plotly.express as px
```
%%%% Output: stream
time: 586 ms (started: 2021-01-18 16:03:04 +00:00)
%% Cell type:code id: tags:
```
# Data Preparation
from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import confusion_matrix, classification_report
```
%%%% Output: stream
time: 39 ms (started: 2021-01-18 16:03:05 +00:00)
%% Cell type:code id: tags:
```
# ML Algorithms to be used
import tensorflow as tf
import keras
from keras import backend as K
from keras.models import Sequential, Model
from keras import optimizers, models, layers, regularizers
from keras.layers import Activation, Dense, Dropout
from keras.layers import BatchNormalization, Input, Lambda
from keras.losses import mse, binary_crossentropy
```
%%%% Output: stream
time: 1.42 s (started: 2021-01-18 16:03:05 +00:00)
%% Cell type:markdown id: tags:
# 3. Pre-processing
%% Cell type:markdown id: tags:
## Loading the Data
We need to upload the dataset to Colab enviroment. Pandas library is a practical way to load and read the data from an url.
The data is on ILIAS so this week you need to upload the data from your local pc / by using Google Drive link.
%% Cell type:code id: tags:
```
# Loading the data
#Local drive:
#from google.colab import files
#uploaded = files.upload()
#data = pd.read_csv('manufacturing.csv')
#data.head()
```
%%%% Output: stream
time: 1.49 ms (started: 2021-01-18 16:03:06 +00:00)
%% Cell type:code id: tags:
```
# Loading the data
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
```
%%%% Output: stream
time: 2.22 s (started: 2021-01-18 16:03:06 +00:00)
%% Cell type:code id: tags:
```
downloaded = drive.CreateFile({'id':'1mQZTB5gEkal_qH0TiivXAJY082Wclfbo'})
downloaded.GetContentFile('manufacturing.csv')
data = pd.read_csv('manufacturing.csv')
data.head()
```
%%%% Output: execute_result
Time S1 S2 S3 ... S26 S27 S28 Class
0 0.0 -1.359807 -0.072781 2.536347 ... -0.189115 0.133558 -0.021053 regular
1 0.0 1.191857 0.266151 0.166480 ... 0.125895 -0.008983 0.014724 regular
2 1.0 -1.358354 -1.340163 1.773209 ... -0.139097 -0.055353 -0.059752 regular
3 1.0 -0.966272 -0.185226 1.792993 ... -0.221929 0.062723 0.061458 regular
4 2.0 -1.158233 0.877737 1.548718 ... 0.502292 0.219422 0.215153 regular
[5 rows x 30 columns]
%%%% Output: stream
time: 9.51 s (started: 2021-01-18 16:03:08 +00:00)
%% Cell type:code id: tags:
```
os.listdir('.')
```
%%%% Output: execute_result
['.config',
'my_best_model_AE_2.h5',
'my_best_model_AE_reLU.h5',
'manufacturing.csv',
'my_best_model_AE_int.h5',
'adc.json',
'my_best_model_AE.h5',
'my_best_model_noise.h5',
'my_best_model_AE_Sparse.h5',
'my_best_model_AE_Sparse_2.h5',
'sample_data']
%%%% Output: stream
time: 3.48 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags:
## Data Exploration
Here we will look into the statistics of the data, identify any missing values or categorical features that is needed to be further process.
Let’s analyze our dataset first. Use dataset.head(n) to display top n data. You can change dataset.head(n) to dataset.sample(n) to display randomly picked data:
%% Cell type:code id: tags:
```
data.sample(5)
```
%%%% Output: execute_result
Time S1 S2 ... S27 S28 Class
132851 80143.0 -2.635638 2.371788 ... -0.250755 -0.006164 regular
118123 74958.0 -1.060513 0.080976 ... -0.075486 0.053509 regular
132847 80141.0 -1.347806 0.513449 ... -0.370661 -0.022077 regular
48752 43773.0 1.015943 -0.013354 ... 0.025135 0.015417 regular
17983 29087.0 1.356317 -0.392223 ... -0.030075 -0.002300 regular
[5 rows x 30 columns]
%%%% Output: stream
time: 54.6 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:code id: tags:
```
data.info()
```
%%%% Output: stream
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 S1 284807 non-null float64
2 S2 284807 non-null float64
3 S3 284807 non-null float64
4 S4 284807 non-null float64
5 S5 284807 non-null float64
6 S6 284807 non-null float64
7 S7 284807 non-null float64
8 S8 284807 non-null float64
9 S9 284807 non-null float64
10 S10 284807 non-null float64
11 S11 284807 non-null float64
12 S12 284807 non-null float64
13 S13 284807 non-null float64
14 S14 284807 non-null float64
15 S15 284807 non-null float64
16 S16 284807 non-null float64
17 S17 284807 non-null float64
18 S18 284807 non-null float64
19 S19 284807 non-null float64
20 S20 284807 non-null float64
21 S21 284807 non-null float64
22 S22 284807 non-null float64
23 S23 284807 non-null float64
24 S24 284807 non-null float64
25 S25 284807 non-null float64
26 S26 284807 non-null float64
27 S27 284807 non-null float64
28 S28 284807 non-null float64
29 Class 284807 non-null object
dtypes: float64(29), object(1)
memory usage: 65.2+ MB
time: 45 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags:
##Label Encoding
Unlike the previous airfoil example, here we have an object column for the labels (see Class). Lets convert the NaN data into categorical values. Here you can use either pandas dataframe directly or use scikit learn. For scikit implementation, you may check:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
For our case, we will directly work on the dataframe and customize the labels as we want:
%% Cell type:code id: tags:
```
data["Class"].value_counts()
```
%%%% Output: execute_result
regular 284315
defective 492
Name: Class, dtype: int64
%%%% Output: stream
time: 31.7 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags:
As you can see, we have two categories; `regular` and `defective`. We want to label them in a way that regulars are '0' and defectives are as '1'. In this case, we can use the `str` accessor plus `np.where` to modify the target column:
%% Cell type:code id: tags:
```
data["Class"] = np.where(data["Class"].str.contains("reg"), 0, 1)
data["Class"].value_counts()
```
%%%% Output: execute_result
0 284315
1 492
Name: Class, dtype: int64
%%%% Output: stream
time: 115 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:code id: tags:
```
data.info()
```
%%%% Output: stream
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 S1 284807 non-null float64
2 S2 284807 non-null float64
3 S3 284807 non-null float64
4 S4 284807 non-null float64
5 S5 284807 non-null float64
6 S6 284807 non-null float64
7 S7 284807 non-null float64
8 S8 284807 non-null float64
9 S9 284807 non-null float64
10 S10 284807 non-null float64
11 S11 284807 non-null float64
12 S12 284807 non-null float64
13 S13 284807 non-null float64
14 S14 284807 non-null float64
15 S15 284807 non-null float64
16 S16 284807 non-null float64
17 S17 284807 non-null float64
18 S18 284807 non-null float64
19 S19 284807 non-null float64
20 S20 284807 non-null float64
21 S21 284807 non-null float64
22 S22 284807 non-null float64
23 S23 284807 non-null float64
24 S24 284807 non-null float64
25 S25 284807 non-null float64
26 S26 284807 non-null float64
27 S27 284807 non-null float64
28 S28 284807 non-null float64
29 Class 284807 non-null int64
dtypes: float64(29), int64(1)
memory usage: 65.2 MB
time: 30.8 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags:
Let's look into the statistics of the data. This is usually a good starting point to have an idea about the range of the data, its nature, as well as the
missing information for different features:
%% Cell type:code id: tags:
```
data.describe()
```
%%%% Output: execute_result
Time S1 ... S28 Class
count 284807.000000 2.848070e+05 ... 2.848070e+05 284807.000000
mean 94813.859575 1.758743e-12 ... -3.518886e-12 0.001727
std 47488.145955 1.958696e+00 ... 3.300833e-01 0.041527
min 0.000000 -5.640751e+01 ... -1.543008e+01 0.000000
25% 54201.500000 -9.203734e-01 ... -5.295979e-02 0.000000
50% 84692.000000 1.810880e-02 ... 1.124383e-02 0.000000
75% 139320.500000 1.315642e+00 ... 7.827995e-02 0.000000
max 172792.000000 2.454930e+00 ... 3.384781e+01 1.000000
[8 rows x 30 columns]
%%%% Output: stream
time: 381 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags:
If not defined by the user, you can also explore the features with the following command:
%% Cell type:code id: tags:
```
data.columns
```
%%%% Output: execute_result
Index(['Time', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 'S9', 'S10',
'S11', 'S12', 'S13', 'S14', 'S15', 'S16', 'S17', 'S18', 'S19', 'S20',
'S21', 'S22', 'S23', 'S24', 'S25', 'S26', 'S27', 'S28', 'Class'],
dtype='object')
%%%% Output: stream
time: 4.93 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:markdown id: tags:
It is also possible to explore individual features:
%% Cell type:code id: tags:
```
data['S12'].median()
```
%%%% Output: execute_result
0.140032588
%%%% Output: stream
time: 7.77 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:code id: tags:
```
data['S12'].mean()
```
%%%% Output: execute_result
1.0534877568289322e-12
%%%% Output: stream
time: 2.81 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:markdown id: tags:
## Identify non-numerical values
Some ML algorithms can not handle non-numerical values (NaN: not a number) so you may need to identify the type of the data for each feature and modify it if necessary. It is also quite common that different feature values are missing for different instances / examples so you may need to decide what to do: (i) omit the instance; (ii) replace them with the mean / median / mode of the feature; (iv) substitute them with a value of your choice.
The following line counts the NaNs for each feature for us. Note that in previous exercises, we used np for the task; not pandas. Since here we have an integer type in the Dataframe, np.isnan does not work here. This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).
%% Cell type:code id: tags:
```
nanCounter = pd.isnull(data).sum()
print(nanCounter)
```
%%%% Output: stream
Time 0
S1 0
S2 0
S3 0
S4 0
S5 0
S6 0
S7 0
S8 0
S9 0
S10 0
S11 0
S12 0
S13 0
S14 0
S15 0
S16 0
S17 0
S18 0
S19 0
S20 0
S21 0
S22 0
S23 0
S24 0
S25 0
S26 0
S27 0
S28 0
Class 0
dtype: int64
time: 29.6 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:markdown id: tags:
It is also a good exercise to check the uniqueness of the dataset, that is, whether there exists values repeating at different instances:
%% Cell type:code id: tags:
```
distinctCounter = data.apply(lambda x: len(x.unique()))
print(distinctCounter)
```
%%%% Output: stream
Time 124592
S1 275653
S2 275655
S3 275657
S4 275654
S5 275657
S6 275652
S7 275651
S8 275643
S9 275656
S10 275646
S11 275648
S12 275654
S13 275657
S14 275653
S15 275653
S16 275645
S17 275646
S18 275655
S19 275645
S20 275632
S21 275617
S22 275644
S23 275611
S24 275645
S25 275640
S26 275647
S27 275597
S28 275558
Class 2
dtype: int64
time: 216 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:markdown id: tags:
## Data Visualization
Another important pre-processing step is the data visualization. Histograms are suitable for a holistic view, where we can probe into the data for each attribute.
We can use `hist` from matplotlib for that purpose:
%% Cell type:code id: tags:
```
data.hist(bins=50, figsize=(25,15))
plt.show()
```
%%%% Output: display_data
![]()
%%%% Output: stream
time: 5.3 s (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:markdown id: tags: