Commit 6df313ae authored by tills's avatar tills
Browse files

Merge carlo, sophia, justus

parent c9374670
# bda-analytics-challenge-template
## Group Members:
## Group Members
- Forename: Justus
- Surname: Knierim
- Matriculation Number: 1980956
- Forename: Peter
- Surname: Fabisch
- Matriculation Number: 2067112
- Forename: Sophia
- Surname: Sommerrock
- Matriculation Number: 1957933
- Forename: Bianca
- Surname: Crusius
- Matriculation Number: 1939806
- Forename: Till
- Surname: Schelhorn
- Matriculation Number: 2276251
## Special Stuff
Explain special stuff here (e.g., things an evaluator has to know to evaluate your submission)
#### Content of Jupyter Notebooks
###### 1 data_preprocessing
Methods for the preprocessing steps
- reverse_data (correct historical order)
- dec_data (decodes the data, removes letters and transforms into integer)
- remove_outliers (temperature outliers are removed)
- write_preprocessed_data (write dataframe into .csv file)
- detect_collection (detects trash collection when sudden changes in height occur)
- get_collection_data (summarize data since last collection)
The preprocessed data and collection data are written into a new file in data/preprocessed
###### 2 exploring_data
Notebook to explore the preprocessed data of the containers with different filter options:
![image-20210705104754696](https://git.scc.kit.edu/ufesk/bda-analytics-challenge-template/-/raw/Sophia/notebooks/pictures/image-20210705104754696.png)
###### 3 data_visualization_one node
Visualization of the data of one container node (based on lecture):
![image-20210705112452614](https://git.scc.kit.edu/ufesk/bda-analytics-challenge-template/-/raw/Sophia/notebooks/pictures/image-20210705112452614.png)
###### 4 detect_collections
Notebook to detect collections based on the height difference.
If a threshold of the pre_height difference between to time periods is reached a collection is detected.
The collections and relevant data are collected and summarized in an extra data file.
Example of a collection detection:
![image-20210705112346995](https://git.scc.kit.edu/ufesk/bda-analytics-challenge-template/-/raw/Sophia/notebooks/pictures/image-20210705112346995.png)
###### 5 modell_clustering
Clustering of the containers based on mean emptying intevals and pre_height before empything
![image-20210705112027118](https://git.scc.kit.edu/ufesk/bda-analytics-challenge-template/-/raw/Sophia/notebooks/pictures/image-20210705112027118.png)
###### 6 container_map
Visualization of the containers on a map
![image-20210705111842890](https://git.scc.kit.edu/ufesk/bda-analytics-challenge-template/-/raw/Sophia/notebooks/pictures/image-20210705111842890.png)
###### 7 data_visualization
Notebook for the visualization of available data.
Structure:
7.1 General Overview
- Number of containers
- Features of raw data and decoded data
- Container overview on map
7.2 Data Quality
- Container Height
- Order of data
- Temperature Outliers
7.3 Derived Insights
- Emptying intervals
![image-20210705114505018](https://git.scc.kit.edu/ufesk/bda-analytics-challenge-template/-/raw/Sophia/notebooks/pictures/image-20210705114505018.png)
- Mean pre_height of container before emptying
![image-20210705114521409](https://git.scc.kit.edu/ufesk/bda-analytics-challenge-template/-/raw/Sophia/notebooks/pictures/image-20210705114521409.png)
7.4 Derived Insights with Clusters
- Emptying intervals
- Mean pre_height of container before emptying
![image-20210705112548071](https://git.scc.kit.edu/ufesk/bda-analytics-challenge-template/-/raw/Sophia/notebooks/pictures/image-20210705112548071.png)
- Boxplot of pre_height
![image-20210705114521409](https://git.scc.kit.edu/ufesk/bda-analytics-challenge-template/-/raw/Sophia/notebooks/pictures/image-20210705114521410.png)
## Installation
1 Clone Repository
```
git clone https://git.scc.kit.edu/urkid/bda-analytics-challenge-template.git
```
2 Go to folder
```
cd bda-analytics-challenge-template
```
3 Install packages
```
pipenv install
```
4 Open notebook
```
jupyter notebook
```
## Project Organization
------------
```
├── README.md <-- this file. insert group members here
├── .gitignore <-- prevents you from submitting several clutter files
├── README.md <-- this file. insert group members here
├── .gitignore <-- prevents you from submitting several clutter files
├── data
│   ├── modeling
│   │   ├── dev <-- your development set goes here
│   │   ├── test <-- your test set goes here
│   │   └── train <-- your train set goes here goes here
│   ├── preprocessed <-- your preprocessed data goes here
│   └── raw <-- the provided raw data for modeling goes here
├── docs <-- provided explanation of raw input data goes here
│   │   ├── dev <-- your development set goes here
│   │   ├── test <-- your test set goes here
│   │   └── train <-- your train set goes here goes here
│   ├── preprocessed <-- your preprocessed data goes here
│   ├── raw <-- the provided raw data for modeling goes here
│   └── addition data <-- addition data for modeling and visualization
├── docs <-- provided explanation of raw input data goes here
├── models <-- dump models here
├── notebooks <-- your playground for juptyer notebooks
├── requirements.txt <-- required packages to run your submission (use a virtualenv!)
├── models <-- dump models here
├── notebooks <-- your playground for juptyer notebooks
├── requirements.txt <-- required packages to run your submission (use a virtualenv!)
└── src
   ├── additional_features.py <-- your creation of additional features/data goes here
   ├── predict.py <-- your prediction script goes here
   ├── preprocessing.py <-- your preprocessing script goes here
   └── train.py <-- your training script goes here
   ├── predict.py <-- your prediction script goes here
   ├── preprocessing.py <-- your preprocessing script goes here
   └── train.py <-- your training script goes here
```
This diff is collapsed.
This diff is collapsed.
%% Cell type:code id:de2f5927 tags:
``` python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
from pandas.io.json import json_normalize
import plotly.graph_objects as go
import os
from ipywidgets import interact, interact_manual
import ipywidgets as widgets
import datetime
from scipy.signal import savgol_filter
```
%% Cell type:code id:047d7eec tags:
``` python
file=os.getcwd()
file_m=file[0:(len(file)-9)]
file_dz='data\\preprocessed'#
file_data=file_m+file_dz
file_data
data_name=os.listdir(file_data)
```
%% Cell type:code id:e79fd35b tags:
``` python
def load_data(container: str,start,end):
file_number=file_m+file_dz+'\\'+container+'.txt'
df = pd.read_csv(file_number)
df["time"]=pd.to_datetime(list(map(lambda st: str(st)[0:19],df["created_at"])))
data=df[df["time"]>=pd.to_datetime(start)]
data=data[data["time"]<=pd.to_datetime(end)]
data=data.reset_index()
data=data.set_index('time')
return data
```
%% Cell type:code id:75698570 tags:
``` python
def data_arith_mean(data_g, auflösung: str):
data_g=data_g.reset_index()
data=data_g.resample(auflösung,on='time').mean()
return data
```
%% Cell type:code id:9755dbcc tags:
``` python
def data_filter(data_g,sensor:str, intervallgröße: int, polynom: int):
data_g[sensor]=savgol_filter(data_g[sensor],intervallgröße,polynom)
return data_g
```
%% Cell type:code id:8c41551e tags:
``` python
sensor_list=['Height 1 (cm)', 'Height 2 (cm)','Height 3 (cm)','Height 4 (cm)','Voltage (mV)','Temperature (C)','Tilt (Degree)']
resolution=['1H','6H','12H','1D','7D','14D','1M','3M']
fig = go.FigureWidget()
fig.add_scatter()
@interact(Container=list(map(lambda st: str.replace(st, '.txt', ''), data_name)),Sensor=list(sensor_list),
Start_date=widgets.DatePicker(value=pd.to_datetime('2020-05-09')),
End_date=widgets.DatePicker(value=pd.to_datetime('2021-05-09')),
Auflösung_Ein_Aus=False, Auflösung=list(resolution),Filter=False,Intervallgröße=widgets.IntSlider(value=25,min=1,max=201,step=2,continuous_update=False),
Polynom=widgets.IntSlider(value=1,min=1,max=5,step=1,continuous_update=False),
Mittelwert=False,Median=False,Erwartungswert_plus_minus_Standardabweichung=False,Quantil_75=False,Quantil_25=False,
Varianz=False)
def update(Container='70b3d50070001704',Sensor='Height 1',Start_date=pd.to_datetime('2020-05-09'),End_date=pd.to_datetime('2021-05-09'),
Auflösung_Ein_Aus=False, Auflösung='1D',Filter=False,Intervallgröße=25,Polynom=1, Mittelwert=False,Median=False,Erwartungswert_plus_minus_Standardabweichung=False,
Quantil_75=False,Quantil_25=False,Varianz=False):
data=load_data(Container,Start_date,End_date)
if Auflösung_Ein_Aus == True:
data=data_arith_mean(data,Auflösung)
if Filter == True:
data=data_filter(data,Sensor,Intervallgröße,Polynom)
with fig.batch_update():
fig.data[0].x=data.index.tolist()
fig.data[0].y=data[Sensor].tolist()
fig.data[0].name=Sensor
fig.update_layout(title=str('Data Vizualisation Container:'+str(Container)),xaxis_title="Time",yaxis_title=Sensor)
if Mittelwert==True:
fig.add_scatter()
fig.data[1].x=data.index.tolist()
fig.data[1].y=np.ones(len(data.index.tolist()))*np.mean(data[Sensor].tolist())
fig.data[1].name="Arith. Mittel"
if Mittelwert==False:
fig.add_scatter()
fig.data[1].x=[]
fig.data[1].y=[]
if Median==True:
fig.add_scatter()
fig.data[2].x=data.index.tolist()
fig.data[2].y=np.ones(len(data.index.tolist()))*np.median(data[Sensor].tolist())
fig.data[2].name="Median"
if Median==False:
fig.add_scatter()
fig.data[2].x=[]
fig.data[2].y=[]
if Quantil_75==True:
fig.add_scatter()
fig.data[3].x=data.index.tolist()
fig.data[3].y=np.ones(len(data.index.tolist()))*np.quantile(data[Sensor].tolist(),0.75)
fig.data[3].name="75%-Quantil"
if Quantil_75==False:
fig.add_scatter()
fig.data[3].x=[]
fig.data[3].y=[]
if Quantil_25==True:
fig.add_scatter()
fig.data[4].x=data.index.tolist()
fig.data[4].y=np.ones(len(data.index.tolist()))*np.quantile(data[Sensor].tolist(),0.25)
fig.data[4].name="25%-Quantil"
if Quantil_25==False:
fig.add_scatter()
fig.data[4].x=[]
fig.data[4].y=[]
if Erwartungswert_plus_minus_Standardabweichung==True:
fig.add_scatter()
fig.data[5].x=data.index.tolist()
fig.data[5].y=np.ones(len(data.index.tolist()))*np.mean(data[Sensor].tolist())+np.ones(len(data.index.tolist()))*np.sqrt(np.var(data[Sensor].tolist()))
fig.data[5].name="E+stdev"
fig.add_scatter()
fig.data[6].x=data.created_at.tolist()
fig.data[6].y=np.ones(len(data.index.tolist()))*np.mean(data[Sensor].tolist())-np.ones(len(data.index.tolist()))*np.sqrt(np.var(data[Sensor].tolist()))
fig.data[6].name="E-stdev"
if Erwartungswert_plus_minus_Standardabweichung==False:
fig.add_scatter()
fig.data[5].x=[]
fig.data[5].y=[]
fig.add_scatter()
fig.data[6].x=[]
fig.data[6].y=[]
if Varianz==True:
fig.add_scatter()
fig.data[7].x=data.index.tolist()
fig.data[7].y=np.ones(len(data.index.tolist()))*np.var(data["Sensor"].tolist())
fig.data[7].name="Variance"
if Varianz==False:
fig.add_scatter()
fig.data[7].x=[]
fig.data[7].y=[]
fig
```
%%%% Output: display_data
%%%% Output: display_data
%% Cell type:code id:73e1b6b5 tags:
``` python
```
%% Cell type:code id:58a9607d tags:
``` python
```
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize
import datetime
import os
```
%% Cell type:code id: tags:
``` python
# File Path of raw data
file_data=r'..\data\raw'
data_name=os.listdir(file_data)
write_to_path = r'..\data\preprocessed'
```
%% Cell type:code id: tags:
``` python
# Loads the raw json file for a specific container
def load_data(container):
file_number=file_data+'\\'+container+'.txt'
df = pd.read_json(file_number, lines=True)
raw_data = pd.DataFrame(df[1][0])
return raw_data
```
%% Cell type:code id: tags:
``` python
# Reverses the dataframe passed into the function
# The loaded raw data is in wrong historical order (Youngest date is the first entry)
def reverse_data(raw_data):
reversed_data = raw_data.loc[::-1].reset_index(drop = True)
return reversed_data
```
%% Cell type:code id: tags:
``` python
# Decodes the json data in the 'decoded_data' column of the dataframe
# Also removes letters from the cloumns listed below and transfroms then into Integers
def dec_data(data):
df_decoded_data = pd.json_normalize(data.decoded_data)
data["time"]=pd.to_datetime(list(map(lambda st: str(st)[0:19],raw_data["created_at"])))
data["Height (cm)"] = df_decoded_data["sensor_data.Height 1"].str.replace("cm","").astype(int)
data["Height 1 (cm)"] = df_decoded_data["sensor_data.Height 1"].str.replace("cm","").astype(int)
data["Height 2 (cm)"] = df_decoded_data["sensor_data.Height 2"].str.replace("cm","").astype(int)
data["Height 3 (cm)"] = df_decoded_data["sensor_data.Height 3"].str.replace("cm","").astype(int)
data["Height 4 (cm)"] = df_decoded_data["sensor_data.Height 4"].str.replace("cm","").astype(int)
data["Voltage (mV)"] = df_decoded_data["sensor_data.Voltage"].str.replace("mV","").astype(int)
data["Temperature (C)"] = df_decoded_data["sensor_data.Temperature"].str.replace("C","").astype(int)
data["Tilt (Degree)"] = df_decoded_data["sensor_data.Tilt"].str.replace("Degree","").astype(int)
return data
```
%% Cell type:code id: tags:
``` python
# Removes outliers from the data
# TODO: Right now only temperature outliers are removed
# TODO: Smoothing
def remove_outliers(data):
for idx, row in data[data['Temperature (C)'] >= 150].iterrows():
if idx != 0:
new_value = data.iloc[idx-1]['Temperature (C)']
else:
new_value = data.iloc[idx+1]['Temperature (C)']
data.at[idx, 'Temperature (C)'] = new_value
return data
```
%% Cell type:code id: tags:
``` python
# Writes the dataframe into a .csv file
# Container specifies the name of the file
def write_preprocessed_data(container, data):
data.to_csv(path_or_buf=write_to_path+'/'+container+'.txt')
```
%% Cell type:code id: tags:
``` python
# Detects trash collections when sudden changes in Heigth occur
# TODO: improve algorith
# TODO: test on multiple container files
def detect_collection(data):
last_value= None
collections = []
limit = 0.25
for idx, row in data.iterrows():
if last_value != None:
height = row['Height (cm)']
created_at = row['unix_time']
difference = height - last_value
if difference > last_value*limit and height > 110:
collections.append(idx)
last_value = row['Height (cm)']
return collections
```
%% Cell type:code id: tags:
``` python
# For each trash collection detected it summarizes the data since the last collection
# Data: timestamp of collection, container ID, time since last collection, Height before and after collection, mean, max and min Temperature
# TODO: More data?
def get_collection_data(data, collections):
collections_data = []
for collection in collections:
idx = collections.index(collection)
row = data.iloc[collection]
collection_data = {}
last_collection = 0
if idx != 0:
last_collection = collections[idx-1]
# Get data since last collection and data of last collection
data_since_last_collection = data.iloc[last_collection:collection]
data_last_collection = data.iloc[last_collection]
data_last_measurement = data.iloc[collection-1]
# Calculate time difference
collection_time = datetime.datetime.fromtimestamp(int(row['unix_time'])/1000)
last_collection_time = datetime.datetime.fromtimestamp(int(data_last_collection['unix_time'])/1000)
time_difference = last_collection_time-collection_time
# Calulacte values since last collection
temperature_mean = data_since_last_collection['Temperature (C)'].mean()
temperature_max = data_since_last_collection['Temperature (C)'].max()
temperature_min = data_since_last_collection['Temperature (C)'].min()
# Create collection data entry
collection_data['timestamp'] = row['created_at']
collection_data['container_id'] = row['deveui']
collection_data['last_collection'] = time_difference
collection_data['pre_height'] = data_last_measurement['Height (cm)']
collection_data['post_height'] = row['Height (cm)']
collection_data['mean_temperature'] = temperature_mean
collection_data['max_temperature'] = temperature_max
collection_data['min_temperature'] = temperature_min
collections_data.append(collection_data)
return collections_data
```
%% Cell type:code id: tags:
``` python
# Each raw data file gets processed and the written into a new file in data/preprocessed
collection_df = pd.DataFrame()
for file in data_name:
container_id = file.replace('.txt', '')
raw_data = load_data(container_id)
reversed_data = reverse_data(raw_data)
decoded_data = dec_data(reversed_data)
preprocessed = remove_outliers(decoded_data)
collections = detect_collection(decoded_data)
collections_data = get_collection_data(decoded_data, collections)
collection_df = collection_df.append(collections_data, ignore_index=True, sort=False)
write_preprocessed_data(container_id, preprocessed)
```
%% Cell type:code id: tags:
``` python
# Write all available colletion data into one file in data/preprocessed
collection_df.to_csv(path_or_buf=write_to_path+'/collection_data.txt')
```
%% Cell type:code id: tags:
``` python
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment