---
title: 'Using the standard deviation'
author: "Helene Angot"
date: "1/26/2021"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r warning=FALSE, include=FALSE}
library(DT)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(openair)
library(RColorBrewer)
library(lubridate)
library(data.table)
library(reshape2)
Sys.setenv(TZ = "UTC")
```

[El Yazidi et al., 2018 (AMT)](https://amt.copernicus.org/articles/11/1599/2018/)   
The SD method (Drewnick et al., 2012) considers that a time series is a combination of a smooth signal superimposed with a fast variable signal. The variable signal component in our case is related to local emissions causing spikes. To determine the variability of background concentration levels we calculated the standard deviation (σ) of data falling between the first and the third quartile of the entire dataset. The SD method is applied over 1-day and 1-week time windows, i.e. the standard deviation over a day and a week is used for threshold calculation. Using a longer period (e.g. 1 year) would give more weight to the seasonal and long-term variabilities, which are not relevant to identify short-term spikes.    

We then select the first available data point, called Cunf (unflagged data), assuming that it is not in a spike. The next data point in the time series Ci is evaluated with respect to Cunf; spikes are defined by data values higher than a threshold defined as Cunf plus an additive value, where alpha is a parameter to control the selection threshold, and n is the number of points between Cunf and Ci. The value of alpha depends on the time series variability. We set a default value of alpha = 1 for CO2 and CH4 and alpha = 3 for CO (Drewnick et al., 2012). The integer n contains a temporal information about the evolution of the time series. Indeed, while identifying a spike Ci, the next data point is evaluated against Cunf using an increased threshold to take in consideration the variability of the baseline during the spike event. If Ci is lower than the threshold from Eq. (1), it is considered as “non-spike” and becomes the new reference value Cunf. The following data will then be compared to this updated Cunf.    
Cunf + alpha * σ + sqrt(n) * σ  (1)  


I first load calibrated data.       

```{r eval=FALSE, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
load(file = '/Users/helene/Documents/MOSAiC/__EPFL_DATA/archive_raw_data/PICARRO/data_final.RData')
CRDS = data_final %>% 
  select(Date, CO2_final, CH4_final, CO_final) %>% 
  rename(CO2 = CO2_final) %>% 
  rename(CH4 = CH4_final) %>% 
  rename(CO = CO_final)
```

### CO2    

First calculate daily and weekly standard deviation based on values between the 25th and 75th percentiles.        

```{r eval=FALSE, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
test = CRDS %>% 
  # filter(Date >= as.POSIXct('2019-10-01')) %>%
  filter(month(Date) == '1') %>% 
  filter(CO2 > 390) %>% # To avoid first data point in the time series being a spike
  mutate(day = as.numeric(format(Date, '%j'))) %>% 
  group_by(day) %>% 
  mutate(p25_day = quantile(CO2, na.rm = TRUE, 0.25)) %>% 
  mutate(p75_day = quantile(CO2, na.rm = TRUE, 0.75)) %>% 
  mutate(CO2_interquartile_day = ifelse((CO2 > p25_day) & (CO2 < p75_day), CO2, NA)) %>% 
  mutate(sd_day = sd(CO2_interquartile_day, na.rm = TRUE)) %>% 
  ungroup() %>% 
  mutate(week = week(Date)) %>%
  group_by(week) %>%
  mutate(p25_week = quantile(CO2, na.rm = TRUE, 0.25)) %>%
  mutate(p75_week = quantile(CO2, na.rm = TRUE, 0.75)) %>%
  mutate(CO2_interquartile_week = ifelse((CO2 > p25_week) & (CO2 < p75_week), CO2, NA)) %>%
  mutate(sd_week = sd(CO2_interquartile_week, na.rm = TRUE)) %>%
  ungroup()
```

Test based on the daily standard deviation        

```{r eval=FALSE, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
CO2_dataset_day = NULL

alpha = 1

for (i in test$day[which.min(test$day)]:test$day[which.max(test$day)]) {

  temp = test %>% 
    filter(day == i) %>% 
    filter(CO2 > 0) %>% 
    mutate(spike_day = rep(NA)) %>% 
    mutate(threshold_up_day = rep(NA)) %>% 
    mutate(threshold_low_day = rep(NA)) %>% 
    mutate(n = row_number() - 1) %>% 
    filter(is.na(CO2) == 'FALSE')
  
  if (length(temp$Date) == '0') {
  next
  }

  Cunf = temp$CO2[1]
  temp$spike_day[1] = 'non-spike'  # We assume that the first data point is not a spike
  temp$threshold_up_day[1] = Cunf
  temp$threshold_low_day[1] = Cunf
  
  for (j in 2:length(temp$CO2)) {

    temp$threshold_up_day[j] = temp$CO2[max(which(temp$spike_day == 'non-spike'))] + alpha * temp$sd_day[j] + sqrt(temp$n[j]) * temp$sd_day[j]
    temp$threshold_low_day[j] = temp$CO2[max(which(temp$spike_day == 'non-spike'))] - alpha * temp$sd_day[j] - sqrt(temp$n[j]) * temp$sd_day[j]
  
    if (temp$CO2[j] >= temp$threshold_up_day[j] | temp$CO2[j] <= temp$threshold_low_day[j]) {temp$spike_day[j] = 'spike'} else
   
 # If Ci < threshold it is not a spike and becomes the new Cunf     
       {
      temp$spike_day[j] = 'non-spike'
      
      temp = temp %>% 
        mutate(n = n - n[max(which(temp$spike_day == 'non-spike'))])
    }
  }
  
  if(is.null(CO2_dataset_day)) {CO2_dataset_day = temp} else {CO2_dataset_day = rbind(CO2_dataset_day, temp)}
}
```

Test based on the weekly standard deviation        
        
```{r eval=FALSE, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
final_CO2_dataset = NULL

alpha = 1

for (i in CO2_dataset_day$week[which.min(CO2_dataset_day$week)]:CO2_dataset_day$week[which.max(CO2_dataset_day$week)]) {

  temp = CO2_dataset_day %>% 
    filter(week == i) %>% 
    filter(CO2 > 0) %>% 
    mutate(spike_week = rep(NA)) %>% 
    mutate(threshold_up_week = rep(NA)) %>% 
    mutate(threshold_low_week = rep(NA)) %>% 
    mutate(n = row_number() - 1) %>% 
    filter(is.na(CO2) == 'FALSE')
  
  if (length(temp$Date) == '0') {
  next
  }

  Cunf = temp$CO2[1]
  temp$spike_week[1] = 'non-spike'  
  temp$threshold_up_week[1] = Cunf
  temp$threshold_low_week[1] = Cunf
  
  for (j in 2:length(temp$CO2)) {

    temp$threshold_up_week[j] = temp$CO2[max(which(temp$spike_week == 'non-spike'))] + alpha * temp$sd_week[j] + sqrt(temp$n[j]) * temp$sd_week[j]
    temp$threshold_low_week[j] = temp$CO2[max(which(temp$spike_week == 'non-spike'))] - alpha * temp$sd_week[j] - sqrt(temp$n[j]) * temp$sd_week[j]
  
    if (temp$CO2[j] >= temp$threshold_up_week[j] | temp$CO2[j] <= temp$threshold_low_week[j]) {temp$spike_week[j] = 'spike'} else
    {
      temp$spike_week[j] = 'non-spike'
      
      temp = temp %>% 
        mutate(n = n - n[max(which(temp$spike_week == 'non-spike'))])
    }
  }
  
  if(is.null(final_CO2_dataset)) {final_CO2_dataset = temp} else {final_CO2_dataset = rbind(final_CO2_dataset, temp)}
}
```

```{r eval=FALSE, message=FALSE, warning=FALSE, include=FALSE, paged.print=FALSE}
final_CO2_dataset = final_CO2_dataset %>% 
  mutate(spike = rep(NA)) %>% 
  mutate(spike = ifelse((spike_day == 'spike') | (spike_week == 'spike'), 'spike', 'non-spike'))

save(final_CO2_dataset, file = '/Volumes/Seagate Backup Plus Drive/MOSAiC/__EPFL_DATA/scripts/picarro/final_CO2_dataset.RData')
```