Data Preparation

To fully utilize the power of HftBacktest, it requires to input Tick-by-Tick full order book and trade feed data. Unfortunately, free Tick-by-Tick full order book and trade feed data for HFT is not available unlike daily bar data provided by platforms like Yahoo Finance. However, in the case of cryptocurrency, you can collect the full raw feed yourself.

Getting started from Binance Futures’ raw feed data

You can collect Binance Futures feed yourself using Data Collector.

[1]:
import gzip

with gzip.open('usdm/btcusdt_20240808.gz', 'r') as f:
    for i in range(5):
        line = f.readline()
        print(line)
b'1723161255030314667 {"stream":"btcusdt@depth@0ms","data":{"e":"depthUpdate","E":1723161256299,"T":1723161256298,"s":"BTCUSDT","U":5123107832006,"u":5123107837557,"pu":5123107831937,"b":[["58710.20","0.014"],["61496.50","0.010"],["61510.90","0.000"],["61641.50","1.211"],["61652.80","0.195"],["61653.30","0.072"],["61653.70","0.067"],["61657.90","0.067"],["61668.50","0.086"],["61670.60","0.161"],["61672.50","0.821"],["61673.60","0.048"],["61675.60","0.050"],["61684.50","0.765"],["61686.20","0.008"],["61701.80","0.331"],["61703.10","0.238"],["61715.90","0.308"],["61721.60","0.235"],["61724.10","0.002"],["61737.00","0.015"],["61739.00","0.000"],["61740.10","0.008"],["61740.50","12.111"],["61756.90","0.550"],["61758.70","0.003"],["61763.20","0.014"],["61764.10","0.168"],["61764.30","0.000"],["61765.50","0.000"],["61767.40","0.004"],["61768.20","0.120"],["61768.60","0.020"],["61768.90","0.099"],["61770.80","0.049"],["61771.10","0.612"],["61771.70","0.010"],["61773.50","0.035"],["61773.80","0.025"],["61774.00","0.112"],["61775.60","0.010"],["61776.00","0.084"],["61778.30","0.000"],["61778.60","0.408"],["61779.30","0.020"],["61779.60","0.220"],["61783.80","0.002"],["61784.90","0.102"],["61785.00","0.000"],["61788.10","0.140"],["61789.50","0.000"],["61798.70","0.153"],["61800.20","2.507"]],"a":[["61800.30","3.330"],["61804.60","0.057"],["61810.00","0.285"],["61812.00","0.732"],["61814.90","0.000"],["61817.20","0.000"],["61818.70","0.040"],["61824.00","0.860"],["61829.10","0.185"],["61831.30","0.008"],["61831.40","0.501"],["61839.00","0.002"],["61840.00","0.192"],["61856.30","0.003"],["61857.10","0.027"],["61857.40","0.000"],["61858.80","0.005"],["61858.90","0.032"],["61859.60","0.034"],["61874.80","0.006"],["61893.40","0.335"],["61911.90","0.014"],["61925.90","0.000"],["61930.50","0.015"],["61945.10","0.000"],["61953.70","0.000"],["62144.00","0.006"],["63113.70","0.000"],["65880.70","15.918"]]}}\n'
b'1723161255088169167 {"stream":"btcusdt@bookTicker","data":{"e":"bookTicker","u":5123107839020,"s":"BTCUSDT","b":"61800.20","B":"2.507","a":"61800.30","A":"2.510","T":1723161256313,"E":1723161256313}}\n'
b'1723161255088176367 {"stream":"btcusdt@trade","data":{"e":"trade","E":1723161256322,"T":1723161256322,"s":"BTCUSDT","t":5266583935,"p":"61800.30","q":"0.006","X":"MARKET","m":false}}\n'
b'1723161255088181667 {"stream":"btcusdt@bookTicker","data":{"e":"bookTicker","u":5123107840008,"s":"BTCUSDT","b":"61800.20","B":"2.507","a":"61800.30","A":"2.504","T":1723161256322,"E":1723161256322}}\n'
b'1723161255088182467 {"stream":"btcusdt@bookTicker","data":{"e":"bookTicker","u":5123107840016,"s":"BTCUSDT","b":"61800.20","B":"2.507","a":"61800.30","A":"2.522","T":1723161256322,"E":1723161256322}}\n'

The first token of the line is timestamp received by local.

Note: The timestamp is in nanoseconds.

The data needs to be converted to normalized data that can be fed into HftBacktest.
convert method also attempts to correct timestamps by reordering the rows.
[2]:
import numpy as np

from hftbacktest.data.utils import binancefutures

data = binancefutures.convert(
    'usdm/btcusdt_20240808.gz',
    combined_stream=True
)
Correcting the latency
local_timestamp is ahead of exch_timestamp by 1272156851
Correcting the event order

Normalized data as follows. You can find more details on Data.

[3]:
import polars as pl

pl.DataFrame(data)
[3]:
shape: (491_973, 8)
evexch_tslocal_tspxqtyorder_idivalfval
u64i64i64f64f64u64i64f64
37580963851723161256298000000172316125630247151858710.20.014000.0
37580963851723161256298000000172316125630247151861496.50.01000.0
37580963851723161256298000000172316125630247151861510.90.0000.0
37580963851723161256298000000172316125630247151861641.51.211000.0
37580963851723161256298000000172316125630247151861652.80.195000.0
34896609291723161600030000000172316160004361793262292.90.0000.0
3758096385172316160031900000017231616003707934335000.02.321000.0
34896609291723161600709000000172316160076077713461659.80.981000.0
37580963851723161601054000000172316160110564943561631.70.283000.0
37580963851723161601054000000172316160110564943561632.60.0000.0

You can save the data directly to a file by providing output_filename.

[4]:
_ = binancefutures.convert(
    'usdm/btcusdt_20240808.gz',
    output_filename='usdm/btcusdt_20240808.npz',
    combined_stream=True
)
Correcting the latency
local_timestamp is ahead of exch_timestamp by 1272156851
Correcting the event order
Saving to usdm/btcusdt_20240808.npz

Creating a market depth snapshot

As Binance Futures exchange runs 24/7, you need the initial snapshot to get the complete(almost) market depth.
Data Collector fetches the snapshot only when it makes the connection, so you need build the initial snapshot from the start of the collected feed data.
[5]:
from hftbacktest.data.utils.snapshot import create_last_snapshot

# Builds 20240808 End of Day snapshot. It will be used for the initial snapshot for 20240809.
data = create_last_snapshot(
    ['usdm/btcusdt_20240808.npz'],
    tick_size=0.1,
    lot_size=0.001
)

Bid levels are shown before ask levels in the snapshot, and levels are sorted from the best price to the farthest price.

[6]:
pl.DataFrame(data)
[6]:
shape: (9_597, 8)
evexch_tslocal_tspxqtyorder_idivalfval
u64i64i64f64f64u64i64f64
37580963880061659.71.486000.0
37580963880061659.00.002000.0
37580963880061658.10.033000.0
37580963880061658.06.718000.0
37580963880061657.90.007000.0
34896609320077354.30.015000.0
34896609320077905.90.003000.0
34896609320080000.010.708000.0
348966093200104765.00.034000.0
348966093200617050.00.003000.0
[7]:
from hftbacktest.data.utils.snapshot import create_last_snapshot

# Builds 20240808 End of Day snapshot. It will be used for the initial snapshot for 20240809.
_ = create_last_snapshot(
    ['usdm/btcusdt_20240808.npz'],
    tick_size=0.1,
    lot_size=0.001,
    output_snapshot_filename='usdm/btcusdt_20240808_eod.npz'
)
[8]:
# Converts 20240809 data.
_ = binancefutures.convert(
    'usdm/btcusdt_20240809.gz',
    output_filename='usdm/btcusdt_20240809.npz',
    combined_stream=True
)

# Builds 20240809's last snapshot.
# Due to the file size limitation of GitHub, btcusdt_20240809.npz does not contain data for the entire day.
_ = create_last_snapshot(
    ['usdm/btcusdt_20240809.npz'],
    tick_size=0.1,
    lot_size=0.001,
    output_snapshot_filename='usdm/btcusdt_20240809_last.npz',
    initial_snapshot='usdm/btcusdt_20240808_eod.npz',
)
Correcting the latency
local_timestamp is ahead of exch_timestamp by 1273873720
Correcting the event order
Saving to usdm/btcusdt_20240809.npz
[9]:
# Builds 20240809's last snapshot without the initial snapshot.
_ = create_last_snapshot(
    ['usdm/btcusdt_20240809.npz'],
    tick_size=0.1,
    lot_size=0.001,
    output_snapshot_filename='usdm/btcusdt_20240809_last_wo_ss.npz'
)

# Builds the 20240809's last snapshot from 20240808 without the initial snapshot.
_ = create_last_snapshot(
    [
        'usdm/btcusdt_20240808.npz',
        'usdm/btcusdt_20240809.npz'
    ],
    tick_size=0.1,
    lot_size=0.001,
    output_snapshot_filename='usdm/btcusdt_20240809_last.npz'
)

Getting started from Tardis.dev data

Few vendors offer tick-by-tick full market depth data along with snapshot and trade data, and Tardis.dev is among them.

Note: Some data may have an issue with the exchange timestamp. Ideally, the exchange timestamp should reflect the moment the event occurs at the matching engine. However, some data uses the server’s data sent timestamp instead of the matching engine timestamp.

[10]:
# https://docs.tardis.dev/historical-data-details/binance-futures

# Downloads sample Binance futures BTCUSDT trades
!wget https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_trades.csv.gz

# Downloads sample Binance futures BTCUSDT book
!wget https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/2020/02/01/BTCUSDT.csv.gz -O BTCUSDT_book.csv.gz
--2024-08-09 09:42:51--  https://datasets.tardis.dev/v1/binance-futures/trades/2020/02/01/BTCUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 104.18.6.96, 104.18.7.96, 2606:4700::6812:760, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|104.18.6.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3090479 (2.9M) [text/csv]
Saving to: ‘BTCUSDT_trades.csv.gz’

BTCUSDT_trades.csv. 100%[===================>]   2.95M  5.66MB/s    in 0.5s

2024-08-09 09:42:52 (5.66 MB/s) - ‘BTCUSDT_trades.csv.gz’ saved [3090479/3090479]

--2024-08-09 09:42:52--  https://datasets.tardis.dev/v1/binance-futures/incremental_book_L2/2020/02/01/BTCUSDT.csv.gz
Resolving datasets.tardis.dev (datasets.tardis.dev)... 104.18.7.96, 104.18.6.96, 2606:4700::6812:760, ...
Connecting to datasets.tardis.dev (datasets.tardis.dev)|104.18.7.96|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 250016849 (238M) [text/csv]
Saving to: ‘BTCUSDT_book.csv.gz’

BTCUSDT_book.csv.gz 100%[===================>] 238.43M  9.93MB/s    in 23s

2024-08-09 09:43:16 (10.3 MB/s) - ‘BTCUSDT_book.csv.gz’ saved [250016849/250016849]

It is recommended to input trade files before depth files. This is because if a depth event occurs due to a trade event, having the trade event before the depth event could provide a more realistic fill during backtesting. However, the sorting process will prioritize events from the first input file when both events have the same timestamp.

[11]:
from hftbacktest.data.utils import tardis

data = tardis.convert(
    ['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz']
)
Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Correcting the latency
Correcting the event order
[12]:
pl.DataFrame(data)
[12]:
shape: (27_532_602, 8)
evexch_tslocal_tspxqtyorder_idivalfval
u64i64i64f64f64u64i64f64
3758096386158051520234200000015805152024970520009364.511.197000.0
3758096386158051520234200000015805152024973460009365.670.02000.0
3758096386158051520234200000015805152024973520009365.860.01000.0
3758096386158051520234200000015805152024973570009366.360.002000.0
3758096386158051520234200000015805152024973630009366.360.003000.0
3489660929158060159981200000015806015999444040009397.790.0000.0
3758096385158060159982600000015806015999521760009354.84.07000.0
3758096385158060159983600000015806015999629610009351.473.914000.0
3489660929158060159983600000015806015999634610009397.780.1000.0
3758096385158060159984800000015806015999736470009348.143.98000.0

You can save the data directly to a file by providing output_filename. If there are too many rows, you need to increase buffer_size.

[13]:
_ = tardis.convert(
    ['BTCUSDT_trades.csv.gz', 'BTCUSDT_book.csv.gz'],
    output_filename='btcusdt_20200201.npz',
    buffer_size=200_000_000
)
Reading BTCUSDT_trades.csv.gz
Reading BTCUSDT_book.csv.gz
Correcting the latency
Correcting the event order
Saving to btcusdt_20200201.npz

Tardis.dev artificially inserts the SOD snapshot to the start of the daily file. If you continuously backtest multiple days, you don’t need the snapshot every start of days and it may incur more time to backtest. You can choose to include the Tardis.dev’s SOD snapshot in the converted file using the option.