Section - 7 Predictive Modeling
We finally have everything we need to start making predictive models now that the data has been cleaned and we have come up with a gameplan to understand the efficacy of the models.
7.1 Example Simple Model
We can start by making a simple linear regression model:
##
## Call:
## lm(formula = target_price_24h ~ ., data = cryptodata)
##
## Coefficients:
## (Intercept) symbolADA symbolAGIX symbolANT
## 2235.60320161 -0.00381281 -0.01674889 0.76321535
## symbolAPFC symbolAPT symbolARB symbolATOM
## -0.01538729 1.38620757 0.15024699 1.75503650
## symbolAVA symbolBAT symbolBCUG symbolBNT
## 0.04324331 -0.02309496 -0.05140092 -0.07433763
## symbolBSV symbolBTC symbolBTG symbolCHZ
## 7.59496521 6026.79089391 2.87797538 -0.04829540
## symbolCTC symbolDASH symbolDOT symbolDYDX
## 0.58441541 6.62008851 0.96885786 0.34498438
## symbolELF symbolENJ symbolEOS symbolETH
## -0.00691463 -0.00373924 0.08682494 377.48688326
## symbolETHW symbolGMT symbolGMTT symbolGODS
## -0.09193953 -0.03826624 0.03114603 -0.04280682
## symbolID symbolIOTA symbolKNC symbolLAZIO
## -0.07276949 -0.02888322 0.06223899 0.28400350
## symbolLQTY symbolLTC symbolMANA symbolMTLX
## 0.13310542 18.31260637 0.01374987 0.03041260
## symbolNEXO symbolOXT symbolPERP symbolREN
## 0.05080175 -0.05289287 0.03734625 -0.05043923
## symbolRIF symbolSKL symbolSTETH symbolSTORJ
## -0.04720152 -0.05882645 375.82284852 0.04466188
## symbolTHETA symbolTOMO symbolTRX symbolVGX
## 0.07115705 0.19382214 -0.04792843 0.28064780
## symbolXCH symbolXDC symbolXMR symbolZEC
## 6.54125999 -0.09320641 32.99061946 6.03042400
## symbolZRX date_time_utc date price_usd
## -0.02081919 0.00001155 -1.11261191 0.69399379
## lagged_price_1h lagged_price_2h lagged_price_3h lagged_price_6h
## 0.01259633 0.04844143 0.01599413 -0.04636081
## lagged_price_12h lagged_price_24h lagged_price_3d trainingtest
## 0.06861117 0.08378996 -0.07863864 -0.76890868
## trainingtrain split
## -0.58263595 -0.23074965
We defined the formula for the model as target_price_24h ~ .
, which means that we are want to make predictions for the target_price_24h field, and use (~
) every other column found in the data (.
). In other words, we specified a model that uses the target_price_24h field as the dependent variable, and all other columns (.
) as the independent variables. Meaning, we are looking to predict the target_price_24h, which is the only column that refers to the future, and use all the information available at the time the rest of the data was collected in order to infer statistical relationships that can help us forecast the future values of the target_price_24h field when it is still unknown on new data that we want to make new predictions for.
In the example above we used the cryptodata object which contained all the non-nested data, and was a big oversimplification of the process we will actually use.
7.1.1 Using Functional Programming
From this point forward, we will deal with the new dataset cryptodata_nested, review the previous section where it was created if you missed it. Here is a preview of the data again:
## # A tibble: 265 x 5
## # Groups: symbol, split [265]
## symbol split train_data test_data holdout_data
## <chr> <dbl> <list> <list> <list>
## 1 BTC 1 <tibble [157 x 11]> <tibble [60 x 11]> <tibble [63 x 11]>
## 2 ETH 1 <tibble [157 x 11]> <tibble [60 x 11]> <tibble [63 x 11]>
## 3 EOS 1 <tibble [157 x 11]> <tibble [60 x 11]> <tibble [62 x 11]>
## 4 LTC 1 <tibble [157 x 11]> <tibble [60 x 11]> <tibble [63 x 11]>
## 5 BSV 1 <tibble [157 x 11]> <tibble [60 x 11]> <tibble [63 x 11]>
## 6 ADA 1 <tibble [157 x 11]> <tibble [60 x 11]> <tibble [63 x 11]>
## 7 TRX 1 <tibble [157 x 11]> <tibble [60 x 11]> <tibble [63 x 11]>
## 8 ZEC 1 <tibble [157 x 11]> <tibble [60 x 11]> <tibble [63 x 11]>
## 9 XMR 1 <tibble [64 x 11]> <tibble [56 x 11]> <tibble [60 x 11]>
## 10 KNC 1 <tibble [157 x 11]> <tibble [60 x 11]> <tibble [63 x 11]>
## # ... with 255 more rows
Because we are now dealing with a nested dataframe, performing operations on the individual nested datasets is not as straightforward. We could extract the individual elements out of the data using indexing, for example we can return the first element of the column train_data by running this code:
## # A tibble: 157 x 11
## date_time_utc date price_usd target_price_24h lagged_price_1h
## <dttm> <date> <dbl> <dbl> <dbl>
## 1 2023-06-17 00:00:00 2023-06-17 26346. 26519. 26369.
## 2 2023-06-17 01:00:00 2023-06-17 26316. 26459. 26346.
## 3 2023-06-17 02:00:00 2023-06-17 26199. 26497. 26316.
## 4 2023-06-17 03:00:00 2023-06-17 26278. 26478. 26199.
## 5 2023-06-17 04:00:00 2023-06-17 26265. 26552. 26278.
## 6 2023-06-17 05:00:00 2023-06-17 26360. 26538. 26265.
## 7 2023-06-17 06:00:00 2023-06-17 26696. 26541. 26360.
## 8 2023-06-17 07:00:00 2023-06-17 26669. 26542. 26696.
## 9 2023-06-17 08:00:00 2023-06-17 26604. 26584. 26669.
## 10 2023-06-17 09:00:00 2023-06-17 26603. 26498. 26604.
## # ... with 147 more rows, and 6 more variables: lagged_price_2h <dbl>,
## # lagged_price_3h <dbl>, lagged_price_6h <dbl>, lagged_price_12h <dbl>,
## # lagged_price_24h <dbl>, lagged_price_3d <dbl>
remove STORJ to resolve weird problem that arose March 3rd, 2021:
As we already saw dataframes are really flexible as a data structure. We can create a new column in the data to store the models themselves that are associated with each row of the data. There are several ways that we could go about doing this (this tutorial itself was written to execute the same commands using three fundamentally different methodologies), but in this tutorial we will take a functional programming approach. This means we will focus the operations we will perform on the actions we want to take themselves, which can be contrasted to a for loop which emphasizes the objects more using a similar structure that we used in the example above showing the first element of the train_data column.
When using a functional programming approach, we first need to create functions for the operations we want to perform. Let’s wrap the lm() function we used as an example earlier and create a new custom function called linear_model, which takes a dataframe as an input (the train_data we will provide for each row of the nested dataset), and generates a linear regression model:
We can now use the map() function from the purrr package in conjunction with the mutate() function from dplyr to create a new column in the data which contains an individual linear regression model for each row of train_data:
## # A tibble: 260 x 6
## # Groups: symbol, split [260]
## symbol split train_data test_data holdout_data lm_model
## <chr> <dbl> <list> <list> <list> <list>
## 1 BTC 1 <tibble [157 x 11]> <tibble [60 x 11~ <tibble [63 x 11~ <lm>
## 2 ETH 1 <tibble [157 x 11]> <tibble [60 x 11~ <tibble [63 x 11~ <lm>
## 3 EOS 1 <tibble [157 x 11]> <tibble [60 x 11~ <tibble [62 x 11~ <lm>
## 4 LTC 1 <tibble [157 x 11]> <tibble [60 x 11~ <tibble [63 x 11~ <lm>
## 5 BSV 1 <tibble [157 x 11]> <tibble [60 x 11~ <tibble [63 x 11~ <lm>
## 6 ADA 1 <tibble [157 x 11]> <tibble [60 x 11~ <tibble [63 x 11~ <lm>
## 7 TRX 1 <tibble [157 x 11]> <tibble [60 x 11~ <tibble [63 x 11~ <lm>
## 8 ZEC 1 <tibble [157 x 11]> <tibble [60 x 11~ <tibble [63 x 11~ <lm>
## 9 XMR 1 <tibble [64 x 11]> <tibble [56 x 11~ <tibble [60 x 11~ <lm>
## 10 KNC 1 <tibble [157 x 11]> <tibble [60 x 11~ <tibble [63 x 11~ <lm>
## # ... with 250 more rows
Awesome! Now we can use the same tools we learned in the high-level version to make a wider variety of predictive models to test
7.2 Caret
Refer back to the high-level version of the tutorial for an explanation of the caret package, or consult this document: https://topepo.github.io/caret/index.html
7.2.1 Parallel Processing
R is a single thredded application, meaning it only uses one CPU at a time when performing operations. The step below is optional and uses the parallel and doParallel packages to allow R to use more than a single CPU when creating the predictive models, which will speed up the process considerably:
7.2.2 More Functional Programming
Now we can repeat the process we used earlier to create a column with the linear regression models to create the exact same models, but this time using the caret package.
linear_model_caret <- function(df){
train(target_price_24h ~ . -date_time_utc -date, data = df,
method = 'lm',
trControl=trainControl(method="none"))
}
We specified the method as lm
for linear regression. See the high-level version for a refresher on how to use different methods to make different models: https://cryptocurrencyresearch.org/high-level/#/method-options. the trControl argument tells the caret package to avoid additional resampling of the data. As a default behavior caret will do re-sampling on the data and do hyperparameter tuning to select values to use for the paramters to get the best results, but we will avoid this discussion for this tutorial. See the official caret documentation for more details.
Here is the full list of models that we can make using the caret package and the steps described the high-level version of the tutorial:
We can now use the new function we created linear_model_caret in conjunction with map() and mutate() to create a new column in the cryptodata_nested dataset called lm_model with the trained linear regression model for each split of the data (by cryptocurrency symbol and split):
We can see the new column called lm_model with the nested dataframe grouping variables:
## # A tibble: 260 x 3
## # Groups: symbol, split [260]
## symbol split lm_model
## <chr> <dbl> <list>
## 1 BTC 1 <train>
## 2 ETH 1 <train>
## 3 EOS 1 <train>
## 4 LTC 1 <train>
## 5 BSV 1 <train>
## 6 ADA 1 <train>
## 7 TRX 1 <train>
## 8 ZEC 1 <train>
## 9 XMR 1 <train>
## 10 KNC 1 <train>
## # ... with 250 more rows
And we can view the summarized contents of the first trained model:
## Linear Regression
##
## 157 samples
## 10 predictor
##
## No pre-processing
## Resampling: None
7.2.3 Generalize the Function
We can adapt the function we built earlier for the linear regression models using caret, and add a parameter that allows us to specify the method we want to use (as in what predictive model):
7.2.4 XGBoost Models
Now we can do the same thing we did earlier for the linear regression models, but use the new function called model_caret using the map2() function to also specify the model as xgbLinear to create an XGBoost model:
cryptodata_nested <- mutate(cryptodata_nested,
xgb_model = map2(train_data, "xgbLinear", model_caret))
We won’t dive into the specifics of each individual model as the correct one to use may depend on a lot of factors and that is a discussion outside the scope of this tutorial. We chose to use the XGBoost model as an example because it has recently gained a lot of popularity as a very effective framework for a variety of problems, and is an essential model for any data scientist to have at their disposal.
There are several possible configurations for XGBoost models, you can find the official documentation here: https://xgboost.readthedocs.io/en/latest/parameter.html
7.2.5 Neural Network Models
We can keep adding models. As we saw, caret allows for the usage of over 200 predictive models. Let’s make another set of models, this time setting the method to dnn
to create deep neural networks :
Again, we will not dive into the specifics of the individual models, but a quick Google search will return a myriad of information on the subject.
7.2.6 Random Forest Models
Next let’s use create Random Forest models using the method ctree
7.2.7 Principal Component Regression
For one last set of models, let’s make Principal Component Regression models using the method pcr
7.2.8 Caret Options
Caret offers some additional options to help pre-process the data as well. We outlined an example of this in the high-level version of the tutorial when showing how to make a Support Vector Machine model, which requires the data to be centered and scaled to avoid running into problems (which we won’t discuss further here).
7.3 Make Predictions
Awesome! We have trained the predictive models, and we want to start getting a better understanding of how accurate the models are on data they have never seen before. In order to make these comparisons, we will want to make predictions on the test and holdout datasets, and compare those predictions to what actually ended up happening.
In order to make predictions, we can use the prediict() function, here is an example on the first elements of the nested dataframe:
predict(object = cryptodata_nested$lm_model[[1]],
newdata = cryptodata_nested$test_data[[1]],
na.action = na.pass)
## 1 2 3 4 5 6 7 8
## 31596.80 31620.10 31750.04 31906.10 31728.68 32019.42 32057.49 32169.79
## 9 10 11 12 13 14 15 16
## 32139.34 31935.95 31940.34 32015.62 31985.89 32019.68 32130.93 32315.17
## 17 18 19 20 21 22 23 24
## 32404.13 32561.04 32462.70 32405.61 32160.85 32220.09 32143.98 32089.84
## 25 26 27 28 29 30 31 32
## 32095.14 32120.89 32223.77 31831.56 31918.18 31992.28 32002.91 31926.18
## 33 34 35 36 37 38 39 40
## 31939.59 31880.00 31949.85 31931.20 31885.78 31862.00 31580.68 31717.49
## 41 42 43 44 45 46 47 48
## 31796.75 31751.88 31734.62 32059.72 NA NA NA 31933.51
## 49 50 51 52 53 54 55 56
## 31937.87 NA 32030.98 32546.45 32082.23 32186.65 32163.33 NA
## 57 58 59 60
## 32107.15 31993.39 32125.00 32200.38
Now we can create a new custom function called make_predictions that wraps this functionality in a way that we can use with map() to iterate through all options of the nested dataframe:
make_predictions <- function(model, test){
predict(object = model, newdata = test, na.action = na.pass)
}
Now we can create the new columns lm_test_predictions and lm_holdout_predictions with the predictions:
cryptodata_nested <- mutate(cryptodata_nested,
lm_test_predictions = map2(lm_model,
test_data,
make_predictions),
lm_holdout_predictions = map2(lm_model,
holdout_data,
make_predictions))
The predictions were made using the models that had only seen the training data, and we can start assessing how good the model is on data it has not seen before in the test and holdout sets. Let’s view the results from the previous step:
## # A tibble: 260 x 4
## # Groups: symbol, split [260]
## symbol split lm_test_predictions lm_holdout_predictions
## <chr> <dbl> <list> <list>
## 1 BTC 1 <dbl [60]> <dbl [63]>
## 2 ETH 1 <dbl [60]> <dbl [63]>
## 3 EOS 1 <dbl [60]> <dbl [62]>
## 4 LTC 1 <dbl [60]> <dbl [63]>
## 5 BSV 1 <dbl [60]> <dbl [63]>
## 6 ADA 1 <dbl [60]> <dbl [63]>
## 7 TRX 1 <dbl [60]> <dbl [63]>
## 8 ZEC 1 <dbl [60]> <dbl [63]>
## 9 XMR 1 <dbl [56]> <dbl [60]>
## 10 KNC 1 <dbl [60]> <dbl [63]>
## # ... with 250 more rows
Now we can do the same for the rest of the models:
cryptodata_nested <- mutate(cryptodata_nested,
# XGBoost:
xgb_test_predictions = map2(xgb_model,
test_data,
make_predictions),
# holdout
xgb_holdout_predictions = map2(xgb_model,
holdout_data,
make_predictions),
# Neural Network:
nnet_test_predictions = map2(nnet_model,
test_data,
make_predictions),
# holdout
nnet_holdout_predictions = map2(nnet_model,
holdout_data,
make_predictions),
# Random Forest:
rf_test_predictions = map2(rf_model,
test_data,
make_predictions),
# holdout
rf_holdout_predictions = map2(rf_model,
holdout_data,
make_predictions),
# PCR:
pcr_test_predictions = map2(pcr_model,
test_data,
make_predictions),
# holdout
pcr_holdout_predictions = map2(pcr_model,
holdout_data,
make_predictions))
We are done using the caret package and can stop the parallel processing cluster:
7.4 Timeseries
Because this tutorial is already very dense, we will just focus on the models we created above. When creating predictive models on timeseries data there are some other excellent options which consider when the information was collected in similar but more intricate ways to the way we did when creating the lagged variables.
For more information on using excellent tools for ARIMA and ETS models, consult the high-level version of this tutorial where they were discussed.
Move on to the next section ➡️ to assess the accuracy of the models as described in the previous section.