Data Preparation
To fully utilize the power of HftBacktest, it requires to input Tick-by-Tick full order book and trade feed data. Unfortunately, free Tick-by-Tick full order book and trade feed data for HFT is not available unlike daily bar data provided by platforms like Yahoo Finance. However, in the case of cryptocurrency, you can collect the full raw feed yourself.
Getting started from Binance Futures’ raw feed data
You can collect Binance Futures feed yourself using Data Collector.
[1]:
import gzip
with gzip.open('usdm/btcusdt_20240808.gz', 'r') as f:
for i in range(5):
line = f.readline()
print(line)
b'1723161255030314667 {"stream":"btcusdt@depth@0ms","data":{"e":"depthUpdate","E":1723161256299,"T":1723161256298,"s":"BTCUSDT","U":5123107832006,"u":5123107837557,"pu":5123107831937,"b":[["58710.20","0.014"],["61496.50","0.010"],["61510.90","0.000"],["61641.50","1.211"],["61652.80","0.195"],["61653.30","0.072"],["61653.70","0.067"],["61657.90","0.067"],["61668.50","0.086"],["61670.60","0.161"],["61672.50","0.821"],["61673.60","0.048"],["61675.60","0.050"],["61684.50","0.765"],["61686.20","0.008"],["61701.80","0.331"],["61703.10","0.238"],["61715.90","0.308"],["61721.60","0.235"],["61724.10","0.002"],["61737.00","0.015"],["61739.00","0.000"],["61740.10","0.008"],["61740.50","12.111"],["61756.90","0.550"],["61758.70","0.003"],["61763.20","0.014"],["61764.10","0.168"],["61764.30","0.000"],["61765.50","0.000"],["61767.40","0.004"],["61768.20","0.120"],["61768.60","0.020"],["61768.90","0.099"],["61770.80","0.049"],["61771.10","0.612"],["61771.70","0.010"],["61773.50","0.035"],["61773.80","0.025"],["61774.00","0.112"],["61775.60","0.010"],["61776.00","0.084"],["61778.30","0.000"],["61778.60","0.408"],["61779.30","0.020"],["61779.60","0.220"],["61783.80","0.002"],["61784.90","0.102"],["61785.00","0.000"],["61788.10","0.140"],["61789.50","0.000"],["61798.70","0.153"],["61800.20","2.507"]],"a":[["61800.30","3.330"],["61804.60","0.057"],["61810.00","0.285"],["61812.00","0.732"],["61814.90","0.000"],["61817.20","0.000"],["61818.70","0.040"],["61824.00","0.860"],["61829.10","0.185"],["61831.30","0.008"],["61831.40","0.501"],["61839.00","0.002"],["61840.00","0.192"],["61856.30","0.003"],["61857.10","0.027"],["61857.40","0.000"],["61858.80","0.005"],["61858.90","0.032"],["61859.60","0.034"],["61874.80","0.006"],["61893.40","0.335"],["61911.90","0.014"],["61925.90","0.000"],["61930.50","0.015"],["61945.10","0.000"],["61953.70","0.000"],["62144.00","0.006"],["63113.70","0.000"],["65880.70","15.918"]]}}\n'
b'1723161255088169167 {"stream":"btcusdt@bookTicker","data":{"e":"bookTicker","u":5123107839020,"s":"BTCUSDT","b":"61800.20","B":"2.507","a":"61800.30","A":"2.510","T":1723161256313,"E":1723161256313}}\n'
b'1723161255088176367 {"stream":"btcusdt@trade","data":{"e":"trade","E":1723161256322,"T":1723161256322,"s":"BTCUSDT","t":5266583935,"p":"61800.30","q":"0.006","X":"MARKET","m":false}}\n'
b'1723161255088181667 {"stream":"btcusdt@bookTicker","data":{"e":"bookTicker","u":5123107840008,"s":"BTCUSDT","b":"61800.20","B":"2.507","a":"61800.30","A":"2.504","T":1723161256322,"E":1723161256322}}\n'
b'1723161255088182467 {"stream":"btcusdt@bookTicker","data":{"e":"bookTicker","u":5123107840016,"s":"BTCUSDT","b":"61800.20","B":"2.507","a":"61800.30","A":"2.522","T":1723161256322,"E":1723161256322}}\n'
The first token of the line is timestamp received by local.
Note: The timestamp is in nanoseconds.
convert
method also attempts to correct timestamps by reordering the rows.[2]:
import numpy as np
from hftbacktest.data.utils import binancefutures
data = binancefutures.convert(
'usdm/btcusdt_20240808.gz',
combined_stream=True
)
Correcting the latency
local_timestamp is ahead of exch_timestamp by 1272156851
Correcting the event order
Normalized data as follows. You can find more details on Data.
[3]:
import polars as pl
pl.DataFrame(data)
[3]:
ev | exch_ts | local_ts | px | qty | order_id | ival | fval |
---|---|---|---|---|---|---|---|
u64 | i64 | i64 | f64 | f64 | u64 | i64 | f64 |
3758096385 | 1723161256298000000 | 1723161256302471518 | 58710.2 | 0.014 | 0 | 0 | 0.0 |
3758096385 | 1723161256298000000 | 1723161256302471518 | 61496.5 | 0.01 | 0 | 0 | 0.0 |
3758096385 | 1723161256298000000 | 1723161256302471518 | 61510.9 | 0.0 | 0 | 0 | 0.0 |
3758096385 | 1723161256298000000 | 1723161256302471518 | 61641.5 | 1.211 | 0 | 0 | 0.0 |
3758096385 | 1723161256298000000 | 1723161256302471518 | 61652.8 | 0.195 | 0 | 0 | 0.0 |
… | … | … | … | … | … | … | … |
3489660929 | 1723161600030000000 | 1723161600043617932 | 62292.9 | 0.0 | 0 | 0 | 0.0 |
3758096385 | 1723161600319000000 | 1723161600370793433 | 5000.0 | 2.321 | 0 | 0 | 0.0 |
3489660929 | 1723161600709000000 | 1723161600760777134 | 61659.8 | 0.981 | 0 | 0 | 0.0 |
3758096385 | 1723161601054000000 | 1723161601105649435 | 61631.7 | 0.283 | 0 | 0 | 0.0 |
3758096385 | 1723161601054000000 | 1723161601105649435 | 61632.6 | 0.0 | 0 | 0 | 0.0 |
You can save the data directly to a file by providing output_filename
.
[4]:
_ = binancefutures.convert(
'usdm/btcusdt_20240808.gz',
output_filename='usdm/btcusdt_20240808.npz',
combined_stream=True
)
Correcting the latency
local_timestamp is ahead of exch_timestamp by 1272156851
Correcting the event order
Saving to usdm/btcusdt_20240808.npz
Creating a market depth snapshot
[5]:
from hftbacktest.data.utils.snapshot import create_last_snapshot
# Builds 20240808 End of Day snapshot. It will be used for the initial snapshot for 20240809.
data = create_last_snapshot(
['usdm/btcusdt_20240808.npz'],
tick_size=0.1,
lot_size=0.001
)
Bid levels are shown before ask levels in the snapshot, and levels are sorted from the best price to the farthest price.
[6]:
pl.DataFrame(data)
[6]:
ev | exch_ts | local_ts | px | qty | order_id | ival | fval |
---|---|---|---|---|---|---|---|
u64 | i64 | i64 | f64 | f64 | u64 | i64 | f64 |
3758096388 | 0 | 0 | 61659.7 | 1.486 | 0 | 0 | 0.0 |
3758096388 | 0 | 0 | 61659.0 | 0.002 | 0 | 0 | 0.0 |
3758096388 | 0 | 0 | 61658.1 | 0.033 | 0 | 0 | 0.0 |
3758096388 | 0 | 0 | 61658.0 | 6.718 | 0 | 0 | 0.0 |
3758096388 | 0 | 0 | 61657.9 | 0.007 | 0 | 0 | 0.0 |
… | … | … | … | … | … | … | … |
3489660932 | 0 | 0 | 77354.3 | 0.015 | 0 | 0 | 0.0 |
3489660932 | 0 | 0 | 77905.9 | 0.003 | 0 | 0 | 0.0 |
3489660932 | 0 | 0 | 80000.0 | 10.708 | 0 | 0 | 0.0 |
3489660932 | 0 | 0 | 104765.0 | 0.034 | 0 | 0 | 0.0 |
3489660932 | 0 | 0 | 617050.0 | 0.003 | 0 | 0 | 0.0 |
[7]:
from hftbacktest.data.utils.snapshot import create_last_snapshot
# Builds 20240808 End of Day snapshot. It will be used for the initial snapshot for 20240809.
_ = create_last_snapshot(
['usdm/btcusdt_20240808.npz'],
tick_size=0.1,
lot_size=0.001,
output_snapshot_filename='usdm/btcusdt_20240808_eod.npz'
)
[8]:
# Converts 20240809 data.
_ = binancefutures.convert(
'usdm/btcusdt_20240809.gz',
output_filename='usdm/btcusdt_20240809.npz',
combined_stream=True
)
# Builds 20240809's last snapshot.
# Due to the file size limitation of GitHub, btcusdt_20240809.npz does not contain data for the entire day.
_ = create_last_snapshot(
['usdm/btcusdt_20240809.npz'],
tick_size=0.1,
lot_size=0.001,
output_snapshot_filename='usdm/btcusdt_20240809_last.npz',
initial_snapshot='usdm/btcusdt_20240808_eod.npz',
)
Correcting the latency
local_timestamp is ahead of exch_timestamp by 1273873720
Correcting the event order
Saving to usdm/btcusdt_20240809.npz
[9]:
# Builds 20240809's last snapshot without the initial snapshot.
_ = create_last_snapshot(
['usdm/btcusdt_20240809.npz'],
tick_size=0.1,
lot_size=0.001,
output_snapshot_filename='usdm/btcusdt_20240809_last_wo_ss.npz'
)
# Builds the 20240809's last snapshot from 20240808 without the initial snapshot.
_ = create_last_snapshot(
[
'usdm/btcusdt_20240808.npz',
'usdm/btcusdt_20240809.npz'
],
tick_size=0.1,
lot_size=0.001,
output_snapshot_filename='usdm/btcusdt_20240809_last.npz'
)
Getting started from Tardis.dev data
Few vendors offer tick-by-tick full market depth data along with snapshot and trade data, and Tardis.dev is among them.
Note: Some data may have an issue with the exchange timestamp. Ideally, the exchange timestamp should reflect the moment the event occurs at the matching engine. However, some data uses the server’s data sent timestamp instead of the matching engine timestamp.
[10]:
# https://docs.tardis.dev/historical-data-details/binance-futures
# Downloads sample Binance futures BTCUSDT trades
!wget https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_trades.csv.gz
# Downloads sample Binance futures BTCUSDT book
!wget https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_book.csv.gz
--2024-08-09 09:42:51-- https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/BTCUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 104.18.6.96, 104.18.7.96, 2606:4700::6812:760, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|104.18.6.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3090479 (2.9M) [text/csv]
Saving to: ‘BTCUSDT_trades.csv.gz’
BTCUSDT_trades.csv. 100%[===================>] 2.95M 5.66MB/s in 0.5s
2024-08-09 09:42:52 (5.66 MB/s) - ‘BTCUSDT_trades.csv.gz’ saved [3090479/3090479]
--2024-08-09 09:42:52-- https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/2020/02/01/BTCUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 104.18.7.96, 104.18.6.96, 2606:4700::6812:760, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|104.18.7.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 250016849 (238M) [text/csv]
Saving to: ‘BTCUSDT_book.csv.gz’
BTCUSDT_book.csv.gz 100%[===================>] 238.43M 9.93MB/s in 23s
2024-08-09 09:43:16 (10.3 MB/s) - ‘BTCUSDT_book.csv.gz’ saved [250016849/250016849]
It is recommended to input trade files before depth files. This is because if a depth event occurs due to a trade event, having the trade event before the depth event could provide a more realistic fill during backtesting. However, the sorting process will prioritize events from the first input file when both events have the same timestamp.
[11]:
from hftbacktest.data.utils import tardis
data = tardis.convert(
['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz']
)
Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Correcting the latency
Correcting the event order
[12]:
pl.DataFrame(data)
[12]:
ev | exch_ts | local_ts | px | qty | order_id | ival | fval |
---|---|---|---|---|---|---|---|
u64 | i64 | i64 | f64 | f64 | u64 | i64 | f64 |
3758096386 | 1580515202342000000 | 1580515202497052000 | 9364.51 | 1.197 | 0 | 0 | 0.0 |
3758096386 | 1580515202342000000 | 1580515202497346000 | 9365.67 | 0.02 | 0 | 0 | 0.0 |
3758096386 | 1580515202342000000 | 1580515202497352000 | 9365.86 | 0.01 | 0 | 0 | 0.0 |
3758096386 | 1580515202342000000 | 1580515202497357000 | 9366.36 | 0.002 | 0 | 0 | 0.0 |
3758096386 | 1580515202342000000 | 1580515202497363000 | 9366.36 | 0.003 | 0 | 0 | 0.0 |
… | … | … | … | … | … | … | … |
3489660929 | 1580601599812000000 | 1580601599944404000 | 9397.79 | 0.0 | 0 | 0 | 0.0 |
3758096385 | 1580601599826000000 | 1580601599952176000 | 9354.8 | 4.07 | 0 | 0 | 0.0 |
3758096385 | 1580601599836000000 | 1580601599962961000 | 9351.47 | 3.914 | 0 | 0 | 0.0 |
3489660929 | 1580601599836000000 | 1580601599963461000 | 9397.78 | 0.1 | 0 | 0 | 0.0 |
3758096385 | 1580601599848000000 | 1580601599973647000 | 9348.14 | 3.98 | 0 | 0 | 0.0 |
You can save the data directly to a file by providing output_filename
. If there are too many rows, you need to increase buffer_size
.
[13]:
_ = tardis.convert(
['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz'],
output_filename='btcusdt_20200201.npz',
buffer_size=200_000_000
)
Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Correcting the latency
Correcting the event order
Saving to btcusdt_20200201.npz
Tardis.dev artificially inserts the SOD snapshot to the start of the daily file. If you continuously backtest multiple days, you don’t need the snapshot every start of days and it may incur more time to backtest. You can choose to include the Tardis.dev’s SOD snapshot in the converted file using the option.