Lecture notes and notebooks must not be copied and/or distributed without the express permission of ITS.
%% Cell type:markdown id: tags:
#1. Problem Definition: Probe into the Data
#1. Problem Definition: Probe into the Data
In this dataset, we will be looking into manufacturing error analysis. The dataset includes modified sensory inputs collected on the manufacturing line and we will be using this multi-dimensional input data to predict the defective products via dimensionality reduction.
Note that the data set includes over 280,000 instances, where only a small fraction (~500) is defective.
%% Cell type:markdown id: tags:
# 2. Preparing the enviroment
Import the Python libraries that we will need to (i) load the data, (ii) analyze it, (iii) create our model, (iv) process the results.
%% Cell type:code id: tags:
```
!pip install ipython-autotime
%load_ext autotime
```
%%%% Output: stream
Requirement already satisfied: ipython-autotime in /usr/local/lib/python3.6/dist-packages (0.3.1)
Requirement already satisfied: ipython in /usr/local/lib/python3.6/dist-packages (from ipython-autotime) (5.5.0)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.7.5)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython->ipython-autotime) (0.8.1)
time: 9.51 s (started: 2021-01-18 16:03:08 +00:00)
%% Cell type:code id: tags:
```
os.listdir('.')
```
%%%% Output: execute_result
['.config',
'my_best_model_AE_2.h5',
'my_best_model_AE_reLU.h5',
'manufacturing.csv',
'my_best_model_AE_int.h5',
'adc.json',
'my_best_model_AE.h5',
'my_best_model_noise.h5',
'my_best_model_AE_Sparse.h5',
'my_best_model_AE_Sparse_2.h5',
'sample_data']
%%%% Output: stream
time: 3.48 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags:
## Data Exploration
Here we will look into the statistics of the data, identify any missing values or categorical features that is needed to be further process.
Let’s analyze our dataset first. Use dataset.head(n) to display top n data. You can change dataset.head(n) to dataset.sample(n) to display randomly picked data:
time: 54.6 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:code id: tags:
```
data.info()
```
%%%% Output: stream
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 S1 284807 non-null float64
2 S2 284807 non-null float64
3 S3 284807 non-null float64
4 S4 284807 non-null float64
5 S5 284807 non-null float64
6 S6 284807 non-null float64
7 S7 284807 non-null float64
8 S8 284807 non-null float64
9 S9 284807 non-null float64
10 S10 284807 non-null float64
11 S11 284807 non-null float64
12 S12 284807 non-null float64
13 S13 284807 non-null float64
14 S14 284807 non-null float64
15 S15 284807 non-null float64
16 S16 284807 non-null float64
17 S17 284807 non-null float64
18 S18 284807 non-null float64
19 S19 284807 non-null float64
20 S20 284807 non-null float64
21 S21 284807 non-null float64
22 S22 284807 non-null float64
23 S23 284807 non-null float64
24 S24 284807 non-null float64
25 S25 284807 non-null float64
26 S26 284807 non-null float64
27 S27 284807 non-null float64
28 S28 284807 non-null float64
29 Class 284807 non-null object
dtypes: float64(29), object(1)
memory usage: 65.2+ MB
time: 45 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags:
##Label Encoding
Unlike the previous airfoil example, here we have an object column for the labels (see Class). Lets convert the NaN data into categorical values. Here you can use either pandas dataframe directly or use scikit learn. For scikit implementation, you may check:
For our case, we will directly work on the dataframe and customize the labels as we want:
%% Cell type:code id: tags:
```
data["Class"].value_counts()
```
%%%% Output: execute_result
regular 284315
defective 492
Name: Class, dtype: int64
%%%% Output: stream
time: 31.7 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags:
As you can see, we have two categories; `regular` and `defective`. We want to label them in a way that regulars are '0' and defectives are as '1'. In this case, we can use the `str` accessor plus `np.where` to modify the target column:
time: 115 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:code id: tags:
```
data.info()
```
%%%% Output: stream
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 30 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 S1 284807 non-null float64
2 S2 284807 non-null float64
3 S3 284807 non-null float64
4 S4 284807 non-null float64
5 S5 284807 non-null float64
6 S6 284807 non-null float64
7 S7 284807 non-null float64
8 S8 284807 non-null float64
9 S9 284807 non-null float64
10 S10 284807 non-null float64
11 S11 284807 non-null float64
12 S12 284807 non-null float64
13 S13 284807 non-null float64
14 S14 284807 non-null float64
15 S15 284807 non-null float64
16 S16 284807 non-null float64
17 S17 284807 non-null float64
18 S18 284807 non-null float64
19 S19 284807 non-null float64
20 S20 284807 non-null float64
21 S21 284807 non-null float64
22 S22 284807 non-null float64
23 S23 284807 non-null float64
24 S24 284807 non-null float64
25 S25 284807 non-null float64
26 S26 284807 non-null float64
27 S27 284807 non-null float64
28 S28 284807 non-null float64
29 Class 284807 non-null int64
dtypes: float64(29), int64(1)
memory usage: 65.2 MB
time: 30.8 ms (started: 2021-01-18 16:03:18 +00:00)
%% Cell type:markdown id: tags:
Let's look into the statistics of the data. This is usually a good starting point to have an idea about the range of the data, its nature, as well as the
time: 4.93 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:markdown id: tags:
It is also possible to explore individual features:
%% Cell type:code id: tags:
```
data['S12'].median()
```
%%%% Output: execute_result
0.140032588
%%%% Output: stream
time: 7.77 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:code id: tags:
```
data['S12'].mean()
```
%%%% Output: execute_result
1.0534877568289322e-12
%%%% Output: stream
time: 2.81 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:markdown id: tags:
## Identify non-numerical values
Some ML algorithms can not handle non-numerical values (NaN: not a number) so you may need to identify the type of the data for each feature and modify it if necessary. It is also quite common that different feature values are missing for different instances / examples so you may need to decide what to do: (i) omit the instance; (ii) replace them with the mean / median / mode of the feature; (iv) substitute them with a value of your choice.
The following line counts the NaNs for each feature for us. Note that in previous exercises, we used np for the task; not pandas. Since here we have an integer type in the Dataframe, np.isnan does not work here. This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).
%% Cell type:code id: tags:
```
nanCounter = pd.isnull(data).sum()
print(nanCounter)
```
%%%% Output: stream
Time 0
S1 0
S2 0
S3 0
S4 0
S5 0
S6 0
S7 0
S8 0
S9 0
S10 0
S11 0
S12 0
S13 0
S14 0
S15 0
S16 0
S17 0
S18 0
S19 0
S20 0
S21 0
S22 0
S23 0
S24 0
S25 0
S26 0
S27 0
S28 0
Class 0
dtype: int64
time: 29.6 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:markdown id: tags:
It is also a good exercise to check the uniqueness of the dataset, that is, whether there exists values repeating at different instances:
time: 216 ms (started: 2021-01-18 16:03:19 +00:00)
%% Cell type:markdown id: tags:
## Data Visualization
Another important pre-processing step is the data visualization. Histograms are suitable for a holistic view, where we can probe into the data for each attribute.
We can use `hist` from matplotlib for that purpose: