Commit 488c335f authored by Cihan Ates's avatar Cihan Ates
Browse files

headers

parent 70c2d8df
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#Active Session 8: Autoencoders # Active Session 8: Autoencoders
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#Important Note # Important Note
Lecture notes and notebooks must not be copied and/or distributed without the express permission of ITS. Lecture notes and notebooks must not be copied and/or distributed without the express permission of ITS.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#1. Problem Definition: Probe into the Data # 1. Problem Definition: Probe into the Data
In this dataset, we will be looking into manufacturing error analysis. The dataset includes modified sensory inputs collected on the manufacturing line and we will be using this multi-dimensional input data to predict the defective products via dimensionality reduction. In this dataset, we will be looking into manufacturing error analysis. The dataset includes modified sensory inputs collected on the manufacturing line and we will be using this multi-dimensional input data to predict the defective products via dimensionality reduction.
Note that the data set includes over 280,000 instances, where only a small fraction (~500) is defective. Note that the data set includes over 280,000 instances, where only a small fraction (~500) is defective.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# 2. Preparing the enviroment # 2. Preparing the enviroment
Import the Python libraries that we will need to (i) load the data, (ii) analyze it, (iii) create our model, (iv) process the results. Import the Python libraries that we will need to (i) load the data, (ii) analyze it, (iii) create our model, (iv) process the results.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
!pip install ipython-autotime !pip install ipython-autotime
%load_ext autotime %load_ext autotime
``` ```
%%%% Output: stream %%%% Output: stream
Requirement already satisfied: ipython-autotime in /usr/local/lib/python3.6/dist-packages (0.3.1) Requirement already satisfied: ipython-autotime in /usr/local/lib/python3.6/dist-packages (0.3.1)
Requirement already satisfied: ipython in /usr/local/lib/python3.6/dist-packages (from ipython-autotime) (5.5.0) Requirement already satisfied: ipython in /usr/local/lib/python3.6/dist-packages (from ipython-autotime) (5.5.0)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.7.5) Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.7.5)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.8.1) Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.8.1)
Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.8.0) Requirement already satisfied: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.8.0)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (1.0.18) Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (1.0.18)
Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.3.3) Requirement already satisfied: traitlets>=4.2 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.3.3)
Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.4.2) Requirement already satisfied: decorator in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (4.4.2)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (51.1.1) Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (51.1.1)
Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (2.6.1) Requirement already satisfied: pygments in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (2.6.1)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython->ipython-autotime) (0.7.0) Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython->ipython-autotime) (0.7.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->ipython-autotime) (0.2.5) Requirement already satisfied: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->ipython-autotime) (0.2.5)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->ipython-autotime) (1.15.0) Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython->ipython-autotime) (1.15.0)
Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->ipython->ipython-autotime) (0.2.0) Requirement already satisfied: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.2->ipython->ipython-autotime) (0.2.0)
time: 1.37 ms (started: 2021-01-18 16:03:04 +00:00) time: 1.37 ms (started: 2021-01-18 16:03:04 +00:00)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
#Importing the necessary libraries #Importing the necessary libraries
import numpy as np import numpy as np
import pandas as pd import pandas as pd
import os import os
import io import io
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D from mpl_toolkits.mplot3d import Axes3D
import matplotlib.cm as cm import matplotlib.cm as cm
import seaborn as sns import seaborn as sns
color = sns.color_palette() color = sns.color_palette()
import matplotlib as mpl import matplotlib as mpl
#import altair as alt #import altair as alt
import plotly.express as px import plotly.express as px
``` ```
%%%% Output: stream %%%% Output: stream
time: 586 ms (started: 2021-01-18 16:03:04 +00:00) time: 586 ms (started: 2021-01-18 16:03:04 +00:00)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
# Data Preparation # Data Preparation
from sklearn import preprocessing as pp from sklearn import preprocessing as pp
from sklearn.model_selection import train_test_split from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve, average_precision_score from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import confusion_matrix, classification_report from sklearn.metrics import confusion_matrix, classification_report
``` ```
%%%% Output: stream %%%% Output: stream
time: 39 ms (started: 2021-01-18 16:03:05 +00:00) time: 39 ms (started: 2021-01-18 16:03:05 +00:00)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
# ML Algorithms to be used # ML Algorithms to be used
import tensorflow as tf import tensorflow as tf
import keras import keras
from keras import backend as K from keras import backend as K
from keras.models import Sequential, Model from keras.models import Sequential, Model
from keras import optimizers, models, layers, regularizers from keras import optimizers, models, layers, regularizers
from keras.layers import Activation, Dense, Dropout from keras.layers import Activation, Dense, Dropout
from keras.layers import BatchNormalization, Input, Lambda from keras.layers import BatchNormalization, Input, Lambda
from keras.losses import mse, binary_crossentropy from keras.losses import mse, binary_crossentropy
``` ```
%%%% Output: stream %%%% Output: stream
time: 1.42 s (started: 2021-01-18 16:03:05 +00:00) time: 1.42 s (started: 2021-01-18 16:03:05 +00:00)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# 3. Pre-processing # 3. Pre-processing
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Loading the Data ## Loading the Data
We need to upload the dataset to Colab enviroment. Pandas library is a practical way to load and read the data from an url. We need to upload the dataset to Colab enviroment. Pandas library is a practical way to load and read the data from an url.
The data is on ILIAS so this week you need to upload the data from your local pc / by using Google Drive link. The data is on ILIAS so this week you need to upload the data from your local pc / by using Google Drive link.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
# Loading the data # Loading the data
#Local drive: #Local drive:
#from google.colab import files #from google.colab import files
#uploaded = files.upload() #uploaded = files.upload()
#data = pd.read_csv('manufacturing.csv') #data = pd.read_csv('manufacturing.csv')
#data.head() #data.head()
``` ```
%%%% Output: stream %%%% Output: stream
time: 1.49 ms (started: 2021-01-18 16:03:06 +00:00) time: 1.49 ms (started: 2021-01-18 16:03:06 +00:00)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
# Loading the data # Loading the data
!pip install -U -q PyDrive !pip install -U -q PyDrive
from pydrive.auth import GoogleAuth from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive from pydrive.drive import GoogleDrive
from google.colab import auth from google.colab import auth
from oauth2client.client import GoogleCredentials# Authenticate and create the PyDrive client. from oauth2client.client import GoogleCredentials# Authenticate and create the PyDrive client.
auth.authenticate_user() auth.authenticate_user()
gauth = GoogleAuth() gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default() gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth) drive = GoogleDrive(gauth)
``` ```
%%%% Output: stream %%%% Output: stream
time: 2.22 s (started: 2021-01-18 16:03:06 +00:00) time: 2.22 s (started: 2021-01-18 16:03:06 +00:00)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
downloaded = drive.CreateFile({'id':'1mQZTB5gEkal_qH0TiivXAJY082Wclfbo'}) downloaded = drive.CreateFile({'id':'1mQZTB5gEkal_qH0TiivXAJY082Wclfbo'})
downloaded.GetContentFile('manufacturing.csv') downloaded.GetContentFile('manufacturing.csv')
data = pd.read_csv('manufacturing.csv') data = pd.read_csv('manufacturing.csv')
data.head() data.head()
``` ```
%%%% Output: execute_result %%%% Output: execute_result
Time S1 S2 S3 ... S26 S27 S28 Class Time S1 S2 S3 ... S26 S27 S28 Class
0 0.0 -1.359807 -0.072781 2.536347 ... -0.189115 0.133558 -0.021053 regular 0 0.0 -1.359807 -0.072781 2.536347 ... -0.189115 0.133558 -0.021053 regular
1 0.0 1.191857 0.266151 0.166480 ... 0.125895 -0.008983 0.014724 regular 1 0.0 1.191857 0.266151 0.166480 ... 0.125895 -0.008983 0.014724 regular
2 1.0 -1.358354 -1.340163 1.773209 ... -0.139097 -0.055353 -0.059752 regular 2 1.0 -1.358354 -1.340163 1.773209 ... -0.139097 -0.055353 -0.059752 regular
3 1.0 -0.966272 -0.185226 1.792993 ... -0.221929 0.062723 0.061458 regular 3 1.0 -0.966272 -0.185226 1.792993 ... -0.221929 0.062723 0.061458 regular
4 2.0 -1.158233 0.877737 1.548718 ... 0.502292 0.219422 0.215153 regular 4 2.0 -1.158233 0.877737 1.548718 ... 0.502292 0.219422 0.215153 regular
[5 rows x 30 columns] [5 rows x 30 columns]
%%%% Output: stream %%%% Output: stream
time: 9.51 s (started: 2021-01-18 16:03:08 +00:00) time: 9.51 s (started: 2021-01-18 16:03:08 +00:00)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
os.listdir('.') os.listdir('.')
``` ```
%%%% Output: execute_result %%%% Output: execute_result
['.config', ['.config',
'my_best_model_AE_2.h5', 'my_best_model_AE_2.h5',
'my_best_model_AE_reLU.h5', 'my_best_model_AE_reLU.h5',
'manufacturing.csv', 'manufacturing.csv',
'my_best_model_AE_int.h5', 'my_best_model_AE_int.h5',
'adc.json', 'adc.json',
'my_best_model_AE.h5', 'my_best_model_AE.h5',
'my_best_model_noise.h5', 'my_best_model_noise.h5',
'my_best_model_AE_Sparse.h5', 'my_best_model_AE_Sparse.h5',
'my_best_model_AE_Sparse_2.h5', 'my_best_model_AE_Sparse_2.h5',
'sample_data'] 'sample_data']
%%%% Output: stream %%%% Output: stream
time: 3.48 ms (started: 2021-01-18 16:03:18 +00:00) time: 3.48 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Data Exploration ## Data Exploration
Here we will look into the statistics of the data, identify any missing values or categorical features that is needed to be further process. Here we will look into the statistics of the data, identify any missing values or categorical features that is needed to be further process.
Let’s analyze our dataset first. Use dataset.head(n) to display top n data. You can change dataset.head(n) to dataset.sample(n) to display randomly picked data: Let’s analyze our dataset first. Use dataset.head(n) to display top n data. You can change dataset.head(n) to dataset.sample(n) to display randomly picked data:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
data.sample(5) data.sample(5)
``` ```
%%%% Output: execute_result %%%% Output: execute_result
Time S1 S2 ... S27 S28 Class Time S1 S2 ... S27 S28 Class
132851 80143.0 -2.635638 2.371788 ... -0.250755 -0.006164 regular 132851 80143.0 -2.635638 2.371788 ... -0.250755 -0.006164 regular
118123 74958.0 -1.060513 0.080976 ... -0.075486 0.053509 regular 118123 74958.0 -1.060513 0.080976 ... -0.075486 0.053509 regular
132847 80141.0 -1.347806 0.513449 ... -0.370661 -0.022077 regular 132847 80141.0 -1.347806 0.513449 ... -0.370661 -0.022077 regular
48752 43773.0 1.015943 -0.013354 ... 0.025135 0.015417 regular 48752 43773.0 1.015943 -0.013354 ... 0.025135 0.015417 regular
17983 29087.0 1.356317 -0.392223 ... -0.030075 -0.002300 regular 17983 29087.0 1.356317 -0.392223 ... -0.030075 -0.002300 regular
[5 rows x 30 columns] [5 rows x 30 columns]
%%%% Output: stream %%%% Output: stream
time: 54.6 ms (started: 2021-01-18 16:03:18 +00:00) time: 54.6 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
data.info() data.info()
``` ```
%%%% Output: stream %%%% Output: stream
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806 RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns): Data columns (total 30 columns):
# Column Non-Null Count Dtype # Column Non-Null Count Dtype
--- ------ -------------- ----- --- ------ -------------- -----
0 Time 284807 non-null float64 0 Time 284807 non-null float64
1 S1 284807 non-null float64 1 S1 284807 non-null float64
2 S2 284807 non-null float64 2 S2 284807 non-null float64
3 S3 284807 non-null float64 3 S3 284807 non-null float64
4 S4 284807 non-null float64 4 S4 284807 non-null float64
5 S5 284807 non-null float64 5 S5 284807 non-null float64
6 S6 284807 non-null float64 6 S6 284807 non-null float64
7 S7 284807 non-null float64 7 S7 284807 non-null float64
8 S8 284807 non-null float64 8 S8 284807 non-null float64
9 S9 284807 non-null float64 9 S9 284807 non-null float64
10 S10 284807 non-null float64 10 S10 284807 non-null float64
11 S11 284807 non-null float64 11 S11 284807 non-null float64
12 S12 284807 non-null float64 12 S12 284807 non-null float64
13 S13 284807 non-null float64 13 S13 284807 non-null float64
14 S14 284807 non-null float64 14 S14 284807 non-null float64
15 S15 284807 non-null float64 15 S15 284807 non-null float64
16 S16 284807 non-null float64 16 S16 284807 non-null float64
17 S17 284807 non-null float64 17 S17 284807 non-null float64
18 S18 284807 non-null float64 18 S18 284807 non-null float64
19 S19 284807 non-null float64 19 S19 284807 non-null float64
20 S20 284807 non-null float64 20 S20 284807 non-null float64
21 S21 284807 non-null float64 21 S21 284807 non-null float64
22 S22 284807 non-null float64 22 S22 284807 non-null float64
23 S23 284807 non-null float64 23 S23 284807 non-null float64
24 S24 284807 non-null float64 24 S24 284807 non-null float64
25 S25 284807 non-null float64 25 S25 284807 non-null float64
26 S26 284807 non-null float64 26 S26 284807 non-null float64
27 S27 284807 non-null float64 27 S27 284807 non-null float64
28 S28 284807 non-null float64 28 S28 284807 non-null float64
29 Class 284807 non-null object 29 Class 284807 non-null object
dtypes: float64(29), object(1) dtypes: float64(29), object(1)
memory usage: 65.2+ MB memory usage: 65.2+ MB
time: 45 ms (started: 2021-01-18 16:03:18 +00:00) time: 45 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
##Label Encoding ##Label Encoding
Unlike the previous airfoil example, here we have an object column for the labels (see Class). Lets convert the NaN data into categorical values. Here you can use either pandas dataframe directly or use scikit learn. For scikit implementation, you may check: Unlike the previous airfoil example, here we have an object column for the labels (see Class). Lets convert the NaN data into categorical values. Here you can use either pandas dataframe directly or use scikit learn. For scikit implementation, you may check:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
For our case, we will directly work on the dataframe and customize the labels as we want: For our case, we will directly work on the dataframe and customize the labels as we want:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
data["Class"].value_counts() data["Class"].value_counts()
``` ```
%%%% Output: execute_result %%%% Output: execute_result
regular 284315 regular 284315
defective 492 defective 492
Name: Class, dtype: int64 Name: Class, dtype: int64
%%%% Output: stream %%%% Output: stream
time: 31.7 ms (started: 2021-01-18 16:03:18 +00:00) time: 31.7 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
As you can see, we have two categories; `regular` and `defective`. We want to label them in a way that regulars are '0' and defectives are as '1'. In this case, we can use the `str` accessor plus `np.where` to modify the target column: As you can see, we have two categories; `regular` and `defective`. We want to label them in a way that regulars are '0' and defectives are as '1'. In this case, we can use the `str` accessor plus `np.where` to modify the target column:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
data["Class"] = np.where(data["Class"].str.contains("reg"), 0, 1) data["Class"] = np.where(data["Class"].str.contains("reg"), 0, 1)
data["Class"].value_counts() data["Class"].value_counts()
``` ```
%%%% Output: execute_result %%%% Output: execute_result
0 284315 0 284315
1 492 1 492
Name: Class, dtype: int64 Name: Class, dtype: int64
%%%% Output: stream %%%% Output: stream
time: 115 ms (started: 2021-01-18 16:03:18 +00:00) time: 115 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ```
data.info() data.info()
``` ```
%%%% Output: stream %%%% Output: stream
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806 RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns): Data columns (total 30 columns):
# Column Non-Null Count Dtype # Column Non-Null Count Dtype
--- ------ -------------- ----- --- ------ -------------- -----
0 Time 284807 non-null float64 0 Time 284807 non-null float64
1 S1 284807 non-null float64 1 S1 284807 non-null float64
2 S2 284807 non-null float64 2 S2 284807 non-null float64
3 S3 284807 non-null float64 3 S3 284807 non-null float64
4 S4 284807 non-null float64 4 S4 284807 non-null float64
5 S5 284807 non-null float64 5 S5 284807 non-null float64
6 S6 284807 non-null float64 6 S6 284807 non-null float64
7 S7 284807 non-null float64 7 S7 284807 non-null float64
8 S8 284807 non-null float64 8 S8 284807 non-null float64
9 S9 284807 non-null float64 9 S9 284807 non-null float64
10 S10 284807 non-null float64 10 S10 284807 non-null float64