Section - 4 Data Prep

Next we will do some data cleaning to make sure our data is in the format we need it to be in. For a gentler introduction to data prep using the dplyr package (Wickham, François, et al. 2020) consult the high-level version.

4.1 Remove Nulls

First off, we aren’t able to do anything at all with a row of data if we don’t know when the data was collected. The specific price doesn’t matter if we can’t tie it to a timestamp, given by the date_time_utc field.

We can exclude all rows where the date_time_utc field has a Null (missing) value by using the filter() function from the dplyr package:

cryptodata <- filter(cryptodata, !is.na(date_time_utc))

This step removed 0 rows from the data on the latest run (2023-08-16). The is.na() function finds all cases where the date_time_utc field has a Null value. The function is preceded by the ! operator, which tells the filter() function to exclude these rows from the data.

By the same logic, if we don’t know what the price was for any of the rows, the whole row of data is useless and should be removed. But how will we define the price of a cryptocurrency?

4.2 Calculate price_usd Column

In the previous section we discussed the intricacies of a cryptocurrency’s price. We could complicate our definition of a price by considering both the bid and ask prices from the perspective of someone who wants to perform trades, but this is not a trading tutorial. Instead, we will define the price of a cryptocurrency as the price we could purchase it for. We will calculate the price_usd field using the cheapest price available from the ask side where at least $15 worth of the cryptocurrency are being sold.

Therefore, let’s figure out the lowest price from the order book data that would allow us to purchase at least $15 worth of the cryptocurrency. To do this, for each ask price and quantity, let’s figure out the value of the trade in US Dollars. We can create each of the new trade_usd columns using the mutate() function. The trade_usd_1 should be calculated as the ask_1_price multiplied by the ask_1_quantity. The next one trade_usd_2 should consider the ask_2_price, but be multiplied by the sum of ask_1_quantity and ask_2_quantity because at the ask_2_price pricepoint we can also purchase the quantity available at the ask_1_price pricepoint:

cryptodata <- mutate(cryptodata, 
                     trade_usd_1 = ask_1_price * ask_1_quantity,
                     trade_usd_2 = ask_2_price * (ask_1_quantity + ask_2_quantity),
                     trade_usd_3 = ask_3_price * (ask_1_quantity + ask_2_quantity + ask_3_quantity),
                     trade_usd_4 = ask_4_price * (ask_1_quantity + ask_2_quantity + ask_3_quantity + ask_4_quantity),
                     trade_usd_5 = ask_5_price * (ask_1_quantity + ask_2_quantity + ask_3_quantity + ask_4_quantity + ask_5_quantity))

We can confirm that the trade_usd_1 field is calculating the $ value of the lowest ask price and quantity:

head(select(cryptodata, symbol, date_time_utc, ask_1_price, ask_1_quantity, trade_usd_1))
## # A tibble: 6 x 5
##   symbol date_time_utc       ask_1_price ask_1_quantity trade_usd_1
##   <chr>  <dttm>                    <dbl>          <dbl>       <dbl>
## 1 BTC    2023-08-16 00:00:00   29204.            0.0412      1204. 
## 2 ETH    2023-08-16 00:00:01    1830.            0.158        290. 
## 3 EOS    2023-08-16 00:00:02       0.680        63.8           43.3
## 4 LTC    2023-08-16 00:00:05      79.3           1.46         116. 
## 5 BSV    2023-08-16 00:00:11      34.6           2.62          90.7
## 6 ADA    2023-08-16 00:00:11       0.282      2092            590.

Now we can use the mutate() function to create the new field price_usd and find the lowest price at which we could have purchased at least $15 worth of the cryptocurrency. We can use the case_when() function to find the first trade_usd value that is greater or equal to $15, and assign the correct ask_price for the new column price_usd:

cryptodata <- mutate(cryptodata, 
                     price_usd = case_when(
                       cryptodata$trade_usd_1 >= 15 ~ cryptodata$ask_1_price,
                       cryptodata$trade_usd_2 >= 15 ~ cryptodata$ask_2_price,
                       cryptodata$trade_usd_3 >= 15 ~ cryptodata$ask_3_price,
                       cryptodata$trade_usd_4 >= 15 ~ cryptodata$ask_4_price,
                       cryptodata$trade_usd_5 >= 15 ~ cryptodata$ask_5_price))

Let’s also remove any rows that have Null values for the new price_usd field like we did for the date_time_utc field in a previous step. These will mostly be made up of rows where the value of trades through the 5th lowest ask price was lower than $15.

cryptodata <- filter(cryptodata, !is.na(price_usd))

This step removed 34257 rows on the latest run.

4.3 Clean Data by Group

In the high-level version of this tutorial we only dealt with one cryptocurrency. In this version however, we will be creating independent models for each cryptocurrency. Because of this, we need to ensure data quality not only for the data as a whole, but also for the data associated with each individual cryptocurrency. Instead of considering all rows when applying a transformation, we can group the data by the individual cryptocurrency and apply the transformation to each group. This will only work with compatible functions from dplyr and the tidyverse.

For example, we could count the number of observations by running the count() function on the data:

count(cryptodata)
## # A tibble: 1 x 1
##        n
##    <int>
## 1 265743

But what if we wanted to know how many observations in the data are associated with each cryptocurrency separately?

We can group the data using the group_by() function from the dplyr package and group the data by the cryptocurrency symbol:

cryptodata <- group_by(cryptodata, symbol)

Now if we run the same operation using the count() function, the operation is performed grouped by the cryptocurrency symbol:

count(cryptodata)
## # A tibble: 297 x 2
## # Groups:   symbol [297]
##    symbol     n
##    <chr>  <int>
##  1 1INCH   1493
##  2 AAB      719
##  3 ABBC     417
##  4 ACT       77
##  5 ADA     1493
##  6 AGIX    1492
##  7 AKRO     371
##  8 ALCX     229
##  9 ALI      568
## 10 ALICE    569
## # ... with 287 more rows

We can remove the grouping at any point by running the ungroup() function:

count(ungroup(cryptodata))
## # A tibble: 1 x 1
##        n
##    <int>
## 1 265743

4.3.1 Remove symbols without enough rows

Because this dataset evolves over time, we will need to be proactive about issues that may arise even if they aren’t currently a problem.

What happens if a new cryptocurrency gets added to the cryptocurrency exchange? If we only had a couple days of data for an asset, not only would that not be enough information to build effective predictive models, but we may run into actual errors since the data will be further split into more groups to validate the results of the models against several datasets using cross validation, more to come on that later.

To ensure we have a reasonable amount of data for each individual cryptocurrency, let’s filter out any cryptocurrencies that don’t have at least 1,000 observations using the filter() function:

cryptodata <- filter(cryptodata, n() >= 1000)

The number of rows for the cryptodata dataset before the filtering step was 265743 and is now 145476. This step removed 197 cryptocurrencies from the analysis that did not have enough observations associated with them.

4.3.2 Remove symbols without data from the last 3 days

If there was no data collected for a cryptocurrency over the last 3 day period, let’s exclude that asset from the dataset since we are only looking to model data that is currently flowing through the process. If an asset is removed from the exchange (if a project is a scam for example) or is no longer being actively captured by the data collection process, we can’t make new predictions for it, so might as well exclude these ahead of time as well.

cryptodata <- filter(cryptodata, max(date) > Sys.Date()-3)

The number of rows for the cryptodata dataset before this filtering step was 140876 and is now 145476.

4.4 Calculate Target

Our goal is to be able to make predictions on the price of each cryptocurrency 24 hours into the future from when the data was collected. Therefore, the target variable that we will be using as what we want to predict for the predictive models, is the price 24 hours into the future relative to when the data was collected.

To do this we can create a new column in the data that is the price_usd offset by 24 rows (one for each hour), but before we can do that we need to make sure there are no gaps anywhere in the data.

4.4.1 Convert to tsibble

We can fill any gaps in the data using the tsibble package (Wang et al. 2020), which was covered in more detail in the high-level version of the tutorial.

4.4.1.1 Convert to hourly data

The data we are using was collected between the 0th and the 5th minute of every hour; it is collected in the same order every hour to try and get the timing as consistent as possible for each cryptocurrency, but the cadence is not exactly one hour. Therefore, if we convert the data now to a tsibble object, it would recognize the data as being collected on the wrong cadence.

To fix this issue, let’s create a new column called ts_index using the mutate() function which will keep the information relating to the date and hour collected, but generalize the minutes and seconds as “00:00”, which will be correctly recognized by the tsibble package as being data collected on an hourly basis. The pkDummy field contains the date and hour, so we can add the text “:00:00” to the end of that field, and then convert the new string to a date time object using the anytime() function from the anytime package (Eddelbuettel 2020):

cryptodata <- mutate(cryptodata, ts_index = anytime(paste0(pkDummy,':00:00')))

Before we can convert the data to be a tsibble and easily fill in the gaps, we also need to make sure there are no duplicate values in the ts_index column for each cryptocurrency. There shouldn’t be any duplicates, but just in case any make their way into the data somehow, we can use the distinct() function from the dplyr package to prevent the issue from potentially arising:

cryptodata <- distinct(cryptodata, symbol, ts_index, .keep_all=TRUE)

Now we can finally convert the table to a tsibble data type by using the as_tsibble() function from the tsibble package (Wang et al. 2020), and providing the symbol column for the key parameter to preserve the grouped structure:

cryptodata <- as_tsibble(cryptodata, index = ts_index, key = symbol)

Notice how the preview of the data below looks a bit different from the summary we were seeing up to this point, and now it says “A tsibble”, and next to the table dimensions says [1h], indicating the observations are 1 hour apart from each other. The second row tells us the “Key” of the tsibble is the symbol column

cryptodata
## # A tsibble: 140,876 x 34 [1h] <UTC>
## # Key:       symbol [96]
## # Groups:    symbol [96]
##    pair  symbol quote_currency ask_1_price ask_1_quantity ask_2_price
##    <chr> <chr>  <chr>                <dbl>          <dbl>       <dbl>
##  1 1INC~ 1INCH  USD                  0.269           191        0.269
##  2 1INC~ 1INCH  USD                  0.269          8719.       0.269
##  3 1INC~ 1INCH  USD                  0.270           191        0.270
##  4 1INC~ 1INCH  USD                  0.271          8719.       0.271
##  5 1INC~ 1INCH  USD                  0.270          8719.       0.270
##  6 1INC~ 1INCH  USD                  0.269          8719.       0.270
##  7 1INC~ 1INCH  USD                  0.268          8719.       0.268
##  8 1INC~ 1INCH  USD                  0.268           954.       0.268
##  9 1INC~ 1INCH  USD                  0.267          8719.       0.267
## 10 1INC~ 1INCH  USD                  0.268          8719.       0.268
## # ... with 140,866 more rows, and 28 more variables: ask_2_quantity <dbl>,
## #   ask_3_price <dbl>, ask_3_quantity <dbl>, ask_4_price <dbl>,
## #   ask_4_quantity <dbl>, ask_5_price <dbl>, ask_5_quantity <dbl>,
## #   bid_1_price <dbl>, bid_1_quantity <dbl>, bid_2_price <dbl>,
## #   bid_2_quantity <dbl>, bid_3_price <dbl>, bid_3_quantity <dbl>,
## #   bid_4_price <dbl>, bid_4_quantity <dbl>, bid_5_price <dbl>,
## #   bid_5_quantity <dbl>, date_time_utc <dttm>, date <date>, pkDummy <chr>,
## #   pkey <chr>, trade_usd_1 <dbl>, trade_usd_2 <dbl>, trade_usd_3 <dbl>,
## #   trade_usd_4 <dbl>, trade_usd_5 <dbl>, price_usd <dbl>, ts_index <dttm>

4.4.2 Fill gaps

Now we can use the fill_gaps() function from the tsibble package to fill any gaps found in the data, as being implicitly Null. Meaning, we will add these rows into the data with NA values for everything except for the date time field. This will allow us to safely compute the target price found 24 hours into the future relative to when each row was collected.

cryptodata <- fill_gaps(cryptodata)

Now looking at the data again, there are 4085 additional rows that were added as implicitly missing in the data:

cryptodata
## # A tsibble: 144,961 x 34 [1h] <UTC>
## # Key:       symbol [96]
## # Groups:    symbol [96]
##    pair  symbol quote_currency ask_1_price ask_1_quantity ask_2_price
##    <chr> <chr>  <chr>                <dbl>          <dbl>       <dbl>
##  1 1INC~ 1INCH  USD                  0.269           191        0.269
##  2 1INC~ 1INCH  USD                  0.269          8719.       0.269
##  3 1INC~ 1INCH  USD                  0.270           191        0.270
##  4 1INC~ 1INCH  USD                  0.271          8719.       0.271
##  5 1INC~ 1INCH  USD                  0.270          8719.       0.270
##  6 1INC~ 1INCH  USD                  0.269          8719.       0.270
##  7 1INC~ 1INCH  USD                  0.268          8719.       0.268
##  8 1INC~ 1INCH  USD                  0.268           954.       0.268
##  9 1INC~ 1INCH  USD                  0.267          8719.       0.267
## 10 1INC~ 1INCH  USD                  0.268          8719.       0.268
## # ... with 144,951 more rows, and 28 more variables: ask_2_quantity <dbl>,
## #   ask_3_price <dbl>, ask_3_quantity <dbl>, ask_4_price <dbl>,
## #   ask_4_quantity <dbl>, ask_5_price <dbl>, ask_5_quantity <dbl>,
## #   bid_1_price <dbl>, bid_1_quantity <dbl>, bid_2_price <dbl>,
## #   bid_2_quantity <dbl>, bid_3_price <dbl>, bid_3_quantity <dbl>,
## #   bid_4_price <dbl>, bid_4_quantity <dbl>, bid_5_price <dbl>,
## #   bid_5_quantity <dbl>, date_time_utc <dttm>, date <date>, pkDummy <chr>,
## #   pkey <chr>, trade_usd_1 <dbl>, trade_usd_2 <dbl>, trade_usd_3 <dbl>,
## #   trade_usd_4 <dbl>, trade_usd_5 <dbl>, price_usd <dbl>, ts_index <dttm>

Now that all of the gaps have been filled in, let’s convert the data back to be in the structure of a tibble, which is the data structure that supports the grouping structure we discussed previously, and let’s group the data by the symbol again:

cryptodata <- group_by(as_tibble(cryptodata), symbol)

4.4.3 Calculate Target

Now we finally have everything we need to calculate the target variable containing the price 24 hours into the future relative to when the data was collected. We can use the usual mutate() function to add a new column to the data called target_price_24h, and use the lead() function from dplyr to offset the price_usd column by 24 hours:

cryptodata <- mutate(cryptodata, 
                     target_price_24h = lead(price_usd, 24, order_by=ts_index))

4.4.4 Calculate Lagged Prices

What about doing the opposite? If we added a new column showing the price from 24 hours earlier, could the price movement between then and when the data was collected help us predict where the price is headed next? If the price has gone down significantly over the previous 24 hours, is the price for the next 24 hours more likely to increase or decrease? What if the price has gone down significantly over the previous 24 hours, but has increased significantly since the past hour?

These relationships around the sensitivity of a price to recent price changes may help our models come up with more accurate forecasts about the future, so let’s go ahead and add some lagged prices using the same methodology used to calculate the target variable, but this time using the lag() function to get past observations instead of the lead() function used before:

cryptodata <- mutate(cryptodata,
                     lagged_price_1h  = lag(price_usd, 1, order_by=ts_index),
                     lagged_price_2h  = lag(price_usd, 2, order_by=ts_index),
                     lagged_price_3h  = lag(price_usd, 3, order_by=ts_index),
                     lagged_price_6h  = lag(price_usd, 6, order_by=ts_index),
                     lagged_price_12h = lag(price_usd, 12, order_by=ts_index),
                     lagged_price_24h = lag(price_usd, 24, order_by=ts_index),
                     lagged_price_3d  = lag(price_usd, 24*3, order_by=ts_index))

This step can be thought of as data engineering more than data cleaning, because rather than fixing an issue we are enhancing the dataset with columns that may help with the forecasts.

Let’s view an example of the oldest 30 rows of data associated with the Bitcoin cryptocurrency (symbol == "BTC"). With the oldest data starting from the top, the lagged_price_1h field should have a NA value for the first row because we don’t have any prices before that point. By that same logic, the lagged_price_24h column should be missing the first 24 values and have the last 6 values showing the first 6 rows of the price_usd column. The target_price_24h would values for the oldest data because the opposite is true and we don’t know the values for data for the most recent 24 rows of the data:

print(select(filter(cryptodata, symbol == 'BTC'), 
             ts_index, price_usd, lagged_price_1h, 
             lagged_price_24h, target_price_24h), n=30)
## # A tibble: 1,513 x 6
## # Groups:   symbol [1]
##    symbol ts_index            price_usd lagged_price_1h lagged_price_24h
##    <chr>  <dttm>                  <dbl>           <dbl>            <dbl>
##  1 BTC    2023-06-14 00:00:00    25933.             NA               NA 
##  2 BTC    2023-06-14 01:00:00    25990.          25933.              NA 
##  3 BTC    2023-06-14 02:00:00    26046.          25990.              NA 
##  4 BTC    2023-06-14 03:00:00    26025.          26046.              NA 
##  5 BTC    2023-06-14 04:00:00    25985.          26025.              NA 
##  6 BTC    2023-06-14 05:00:00    25980.          25985.              NA 
##  7 BTC    2023-06-14 06:00:00    25845.          25980.              NA 
##  8 BTC    2023-06-14 07:00:00    25895.          25845.              NA 
##  9 BTC    2023-06-14 08:00:00    25897.          25895.              NA 
## 10 BTC    2023-06-14 09:00:00    25929.          25897.              NA 
## 11 BTC    2023-06-14 10:00:00    25982.          25929.              NA 
## 12 BTC    2023-06-14 11:00:00    25957.          25982.              NA 
## 13 BTC    2023-06-14 12:00:00       NA           25957.              NA 
## 14 BTC    2023-06-14 13:00:00       NA              NA               NA 
## 15 BTC    2023-06-14 14:00:00       NA              NA               NA 
## 16 BTC    2023-06-14 15:00:00       NA              NA               NA 
## 17 BTC    2023-06-14 16:00:00       NA              NA               NA 
## 18 BTC    2023-06-14 17:00:00       NA              NA               NA 
## 19 BTC    2023-06-14 18:00:00       NA              NA               NA 
## 20 BTC    2023-06-14 19:00:00       NA              NA               NA 
## 21 BTC    2023-06-14 20:00:00       NA              NA               NA 
## 22 BTC    2023-06-14 21:00:00       NA              NA               NA 
## 23 BTC    2023-06-14 22:00:00       NA              NA               NA 
## 24 BTC    2023-06-14 23:00:00       NA              NA               NA 
## 25 BTC    2023-06-15 00:00:00    25130.             NA            25933.
## 26 BTC    2023-06-15 01:00:00    25214.          25130.           25990.
## 27 BTC    2023-06-15 02:00:00    25073.          25214.           26046.
## 28 BTC    2023-06-15 03:00:00    25092.          25073.           26025.
## 29 BTC    2023-06-15 04:00:00    25072.          25092.           25985.
## 30 BTC    2023-06-15 05:00:00    25033.          25072.           25980.
## # ... with 1,483 more rows, and 1 more variable: target_price_24h <dbl>

We can wrap the code used above in the tail() function to show the most recent data and see the opposite dynamic with the new fields we created:

print(tail(select(filter(cryptodata, symbol == 'BTC'), 
                  ts_index, price_usd, lagged_price_24h, 
                  target_price_24h),30), n=30)
## # A tibble: 30 x 5
## # Groups:   symbol [1]
##    symbol ts_index            price_usd lagged_price_24h target_price_24h
##    <chr>  <dttm>                  <dbl>            <dbl>            <dbl>
##  1 BTC    2023-08-14 19:00:00    29326.           29422.           29322.
##  2 BTC    2023-08-14 20:00:00    29358.           29448.           29190.
##  3 BTC    2023-08-14 21:00:00    29403.           29424.           29199.
##  4 BTC    2023-08-14 22:00:00    29404.           29373.           29234.
##  5 BTC    2023-08-14 23:00:00    29435.           29320.           29186.
##  6 BTC    2023-08-15 00:00:00    29431.           29305.           29204.
##  7 BTC    2023-08-15 01:00:00    29432.           29271.              NA 
##  8 BTC    2023-08-15 02:00:00    29413.           29269.              NA 
##  9 BTC    2023-08-15 03:00:00    29386.           29325.              NA 
## 10 BTC    2023-08-15 04:00:00    29403.           29410.              NA 
## 11 BTC    2023-08-15 05:00:00    29376.           29416.              NA 
## 12 BTC    2023-08-15 06:00:00    29366.           29448               NA 
## 13 BTC    2023-08-15 07:00:00    29371.           29420.              NA 
## 14 BTC    2023-08-15 08:00:00    29408.           29420.              NA 
## 15 BTC    2023-08-15 09:00:00    29427.           29423.              NA 
## 16 BTC    2023-08-15 10:00:00    29424.           29408.              NA 
## 17 BTC    2023-08-15 11:00:00    29407.           29426.              NA 
## 18 BTC    2023-08-15 12:00:00    29369.           29407.              NA 
## 19 BTC    2023-08-15 13:00:00    29369.           29382.              NA 
## 20 BTC    2023-08-15 14:00:00    29377.           29382.              NA 
## 21 BTC    2023-08-15 15:00:00    29455.           29538.              NA 
## 22 BTC    2023-08-15 16:00:00    29335.           29602.              NA 
## 23 BTC    2023-08-15 17:00:00    29287.           29655.              NA 
## 24 BTC    2023-08-15 18:00:00    29327.           29497.              NA 
## 25 BTC    2023-08-15 19:00:00    29322.           29326.              NA 
## 26 BTC    2023-08-15 20:00:00    29190.           29358.              NA 
## 27 BTC    2023-08-15 21:00:00    29199.           29403.              NA 
## 28 BTC    2023-08-15 22:00:00    29234.           29404.              NA 
## 29 BTC    2023-08-15 23:00:00    29186.           29435.              NA 
## 30 BTC    2023-08-16 00:00:00    29204.           29431.              NA

Reading the code shown above is less than ideal. One of the more popular tools introduced by the tidyverse is the %>% operator, which works by starting with the object/data you want to make changes to first, and then apply each transformation step by step. It’s simply a way of re-writing the same code in a way that is easier to read by splitting the way the function is called rather than adding functions onto each other into a single line that becomes really hard to read. In the example above it becomes difficult to keep track of where things begin, the order of operations, and the parameters associated with the specific functions. Compare that to the code below:

# Start with the object/data to manipulate
cryptodata %>% 
  # Filter the data to only the BTC symbol
  filter(symbol == 'BTC') %>% 
  # Select columns to display
  select(ts_index, price_usd, lagged_price_24h, target_price_24h) %>% 
  # Show the last 30 elements of the data
  tail(30) %>% 
  # Show all 30 elements instead of the default 10 for a tibble dataframe
  print(n = 30)
## # A tibble: 30 x 5
## # Groups:   symbol [1]
##    symbol ts_index            price_usd lagged_price_24h target_price_24h
##    <chr>  <dttm>                  <dbl>            <dbl>            <dbl>
##  1 BTC    2023-08-14 19:00:00    29326.           29422.           29322.
##  2 BTC    2023-08-14 20:00:00    29358.           29448.           29190.
##  3 BTC    2023-08-14 21:00:00    29403.           29424.           29199.
##  4 BTC    2023-08-14 22:00:00    29404.           29373.           29234.
##  5 BTC    2023-08-14 23:00:00    29435.           29320.           29186.
##  6 BTC    2023-08-15 00:00:00    29431.           29305.           29204.
##  7 BTC    2023-08-15 01:00:00    29432.           29271.              NA 
##  8 BTC    2023-08-15 02:00:00    29413.           29269.              NA 
##  9 BTC    2023-08-15 03:00:00    29386.           29325.              NA 
## 10 BTC    2023-08-15 04:00:00    29403.           29410.              NA 
## 11 BTC    2023-08-15 05:00:00    29376.           29416.              NA 
## 12 BTC    2023-08-15 06:00:00    29366.           29448               NA 
## 13 BTC    2023-08-15 07:00:00    29371.           29420.              NA 
## 14 BTC    2023-08-15 08:00:00    29408.           29420.              NA 
## 15 BTC    2023-08-15 09:00:00    29427.           29423.              NA 
## 16 BTC    2023-08-15 10:00:00    29424.           29408.              NA 
## 17 BTC    2023-08-15 11:00:00    29407.           29426.              NA 
## 18 BTC    2023-08-15 12:00:00    29369.           29407.              NA 
## 19 BTC    2023-08-15 13:00:00    29369.           29382.              NA 
## 20 BTC    2023-08-15 14:00:00    29377.           29382.              NA 
## 21 BTC    2023-08-15 15:00:00    29455.           29538.              NA 
## 22 BTC    2023-08-15 16:00:00    29335.           29602.              NA 
## 23 BTC    2023-08-15 17:00:00    29287.           29655.              NA 
## 24 BTC    2023-08-15 18:00:00    29327.           29497.              NA 
## 25 BTC    2023-08-15 19:00:00    29322.           29326.              NA 
## 26 BTC    2023-08-15 20:00:00    29190.           29358.              NA 
## 27 BTC    2023-08-15 21:00:00    29199.           29403.              NA 
## 28 BTC    2023-08-15 22:00:00    29234.           29404.              NA 
## 29 BTC    2023-08-15 23:00:00    29186.           29435.              NA 
## 30 BTC    2023-08-16 00:00:00    29204.           29431.              NA

There are several advantages to writing code the tidy way, but while some love it others hate it, so we won’t force anyone to have to understand how the %>% operator works and we have stayed away from its use for the rest of the code shown, but we do encourage the use of this tool: https://magrittr.tidyverse.org/reference/pipe.html

4.5 Remove Nulls

We can’t do anything with a row of data if we don’t know when the data was collected, so let’s just double confirm that all rows have a value for the date_time_utc field by using the filter() function from the dplyr package to exclude any rows with NA values for the column:

# Remove all NA values of date_time_utc:
cryptodata <- filter(cryptodata, !is.na(date_time_utc))

This step removed 4085 rows from the data. This step mainly helps us avoid issues when programmatically labeling charts in the next section, move on to the next section ➡️ to learn some amazingly powerful tools to visualize data!

References

Eddelbuettel, Dirk. 2020. Anytime: Anything to Posixct or Date Converter. http://dirk.eddelbuettel.com/code/anytime.html.

Wang, Earo, Di Cook, Rob Hyndman, and Mitchell O’Hara-Wild. 2020. Tsibble: Tidy Temporal Data Frames and Tools. https://tsibble.tidyverts.org.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.