About

This BigQuery cookbook takes some interesting public datasets and presents queries you can try out using Google Colab. I also include Python code snippets to plot results of the queries for some examples. This is a work in progress and I will add queries from time to time.

Setup

Go to https://colab.research.google.com/notebooks/bigquery.ipynb to setup Colab and your access to BigQuery.

Bitcoin Transactions

The table we will query is bigquery-public-data.crypto_bitcoin.transactions and is partitioned by block_timestamp_month field. The schema of this dataset is reproduced below.

Field name Type Mode Description

hash

STRING

REQUIRED

The hash of this transaction

size

INTEGER

NULLABLE

The size of this transaction in bytes

virtual_size

INTEGER

NULLABLE

The virtual transaction size (differs from size for witness transactions)

version

INTEGER

NULLABLE

Protocol version specified in block which contained this transaction

lock_time

INTEGER

NULLABLE

Earliest time that miners can include the transaction in their hashing of the Merkle root to attach it in the latest block of the blockchain

block_hash

STRING

REQUIRED

Hash of the block which contains this transaction

block_number

INTEGER

REQUIRED

Number of the block which contains this transaction

block_timestamp

TIMESTAMP

REQUIRED

Timestamp of the block which contains this transaction

block_timestamp_month

DATE

REQUIRED

Month of the block which contains this transaction

input_count

INTEGER

NULLABLE

The number of inputs in the transaction

output_count

INTEGER

NULLABLE

The number of outputs in the transaction

input_value

NUMERIC

NULLABLE

Total value of inputs in the transaction

output_value

NUMERIC

NULLABLE

Total value of outputs in the transaction

is_coinbase

BOOLEAN

NULLABLE

true if this transaction is a coinbase transaction

fee

NUMERIC

NULLABLE

The fee paid by this transaction

inputs

RECORD

REPEATED

Transaction inputs

inputs. index

INTEGER

REQUIRED

0-indexed number of an input within a transaction

inputs. spent_transaction_hash

STRING

NULLABLE

The hash of the transaction which contains the output that this input spends

inputs. spent_output_index

INTEGER

NULLABLE

The index of the output this input spends

inputs. script_asm

STRING

NULLABLE

Symbolic representation of the bitcoin’s script language op-codes

inputs. script_hex

STRING

NULLABLE

Hexadecimal representation of the bitcoin’s script language op-codes

inputs. sequence

INTEGER

NULLABLE

A number intended to allow unconfirmed time-locked transactions to be updated before being finalized; not currently used except to disable locktime in a transaction

inputs. required_signatures

INTEGER

NULLABLE

The number of signatures required to authorize the spent output

inputs. type

STRING

NULLABLE

The address type of the spent output

inputs. addresses

STRING

REPEATED

Addresses which own the spent output

inputs. value

NUMERIC

NULLABLE

The value in base currency attached to the spent output

outputs

RECORD

REPEATED

Transaction outputs

outputs. index

INTEGER

REQUIRED

0-indexed number of an output within a transaction used by a later transaction to refer to that specific output

outputs. script_asm

STRING

NULLABLE

Symbolic representation of the bitcoin’s script language op-codes

outputs. script_hex

STRING

NULLABLE

Hexadecimal representation of the bitcoin’s script language op-codes

outputs. required_signatures

INTEGER

NULLABLE

The number of signatures required to authorize spending of this output

outputs. type

STRING

NULLABLE

The address type of the output

outputs. addresses

STRING

REPEATED

Addresses which own this output

outputs. value

NUMERIC

NULLABLE

The value in base currency attached to this output

The screenshot below shows the dataset bigquery-public-data.crypto_bitcoin.transactions as it appears in the public datasets catalog.

bitcoin transactions schema

Head over to this guide that goes into more details about Bitcoin transactions if you would like a refresher.


Selecting Bitcoin transactions and preparing a dataset for further analysis

Selecting a random collection of rows allows you to figure out general trends in your data without having to analyze the entire dataset. Below, we will create a table and populate it with a random selection of bitcoin transactions.

create table `mltest-202903.bitcoin_playground.sample_transactions`  as (1)
select
    *
from
    `bigquery-public-data.crypto_bitcoin.transactions` (2)
where
    block_timestamp_month = '2020-05-01' (3)
    and rand() <= 0.01 (4)
;
1 The name of the table created is composed of project name.dataset name.table name where the project name is mltest-202903, the dataset name is bitcoin_playground and the table name is sample_transactions.
2 The name of the public dataset table from which to read the Bitcoin transactions data.
3 Table is partitioned on the block_timestamp_month column and we are filtering for 2020-05-01.
4 rand() returns a random value between (0, 1]. We filter for random values ⇐ 0.01. This should samples 1% of rows from the table.

The table created is approximately 80 MB in size.

You can run the query in Google Colab (screenshot below)

bitcoin random table create

Another way to select a random subset is to extract blocks so you get all the transactions inside of the block. This is useful if you want to analyze mining fees and need to look at all the transactions inside the block.

create table `mltest-202903.bitcoin_playground.sample_blocks`  as
select
    *
from
    `bigquery-public-data.crypto_bitcoin.transactions`
where
    block_timestamp_month = '2020-05-01'
    and mod(block_number,100) = 1 (1)
;
1 MOD(a, b) returns the remainder of the division of a by b. In this case, we divide block_number by 100 and check if the remainder is 1. This should sample 1% of the blocks. You can choose to use any number between 0 and 99 in place of 1 in mod(block_number,100) = 1 and you should sample 1% of the blocks. The underlying assumption is that block numbers are incremented uniformly.

The table created is approximately 93 MB in size.

We will create a dataset of a few columns that will be utilized for further analyses into transactions growth and fees.

create table `mltest-202903.bitcoin_playground.all_transactions_until_dec_04_2020`  as
select
    block_number,
    block_timestamp,
    input_count,
    output_count,
    input_value,
    output_value,
    is_coinbase,
    fee
from
    `bigquery-public-data.crypto_bitcoin.transactions`
where
    block_timestamp <= '2020-12-05'
;

The dataset created is approximately 44 GB in size.

all trans until dec 2020

Schema of the table created above.

Field name Type Mode Description

block_number

INTEGER

REQUIRED

Number of the block which contains this transaction

block_timestamp

TIMESTAMP

REQUIRED

Timestamp of the block which contains this transaction

input_count

INTEGER

NULLABLE

The number of inputs in the transaction

output_count

INTEGER

NULLABLE

The number of outputs in the transaction

input_value

NUMERIC

NULLABLE

Total value of inputs in the transaction

output_value

NUMERIC

NULLABLE

Total value of outputs in the transaction

is_coinbase

BOOLEAN

NULLABLE

true if this transaction is a coinbase transaction

fee

NUMERIC

NULLABLE

The fee paid by this transaction

Run the query below if you would like to drop the table.

drop table if exists `mltest-202903.bitcoin_playground.all_transactions_until_dec_04_2020`;

Transactions per block

The number of transactions per block has been steadily increasing but is bound by the scalability problem. Lets analyze transactions per block and quantify the number of blocks by the transactions they contain.

with hash_ct_by_block_num as ( (1)
  select
    block_number,
    count(*) as per_block_transaction_hash_ct
  from `mltest-202903.bitcoin_playground.all_transactions_until_dec_04_2020`
  group by
    block_number
)
select
  floor(per_block_transaction_hash_ct/100) as bkt_100, (2)
  count(distinct block_number) as block_ct
from hash_ct_by_block_num (3)
group by
  bkt_100
order by
  bkt_100
;
1 CTE to aggregate count of transactions by block_number. Common Table Expression (CTE) are very useful when writing complex queries. Think of them as creating tables on fly that you can use in later sections of the same query. Pretty neat!
2 Bucket the number of transactions by each block in buckets of size 100. For e.g., if per_block_transaction_hash_ct is 50, floor(50/100) will be 0. The function floor rounds down to the nearest integer. The result of floor(per_block_transaction_hash_ct/100) if 0 identifies the bucket [1, 99], 1 identifies [100, 199], 2 identifies [200, 299] and so on.
3 Reference the CTE query name hash_ct_by_block_num which acts like a temporary table name.

The Python snippet below can be used to visualize the distribution of the number of blocks for each transaction count bucket.

# function to create string label for the buckets
def create_label(val, bucket_width=100):
    return "[" + str(int(val)*bucket_width) + ',' + str((int(val)*bucket_width)+(bucket_width-1)) + "]"

df_hash_per_block['bkt_100_label'] = df_hash_per_block['bkt_100'].map(create_label) (1)
1 df_hash_per_block is the pandas dataframe that contains the results of the preceding query. We also create an additional column,bkt_100_label to use as labels for better readability in the plot.

Lets plot the distribution of number of blocks against the number of transactions per block.

import plotly.express as px

fig = px.bar(df_hash_per_block,
  x='bkt_100_label',
  y='block_ct')

fig.show()

This is what the Colab cells look like for the above.

transactions per block across years

Seems like the vast majority of blocks have < 100 transactions.

Given the rise in popularity of Bitcoin and increase in hash power, the number of transactions per block has been steadily rising. We will now look at the average transactions per block over the years.

with hash_ct_by_block_num as (
  select
    extract(year from block_timestamp) as block_timestamp_year, (1)
    block_number,
    count(*) as per_block_transaction_hash_ct
  from `mltest-202903.bitcoin_playground.all_transactions_until_dec_04_2020`
  group by
    block_timestamp_year,
    block_number
)
select
  block_timestamp_year,
  avg(per_block_transaction_hash_ct) as avg_per_block_transaction_hash_ct (2)
from hash_ct_by_block_num
group by
  block_timestamp_year
order by
  block_timestamp_year
;
1 extract is a Timestamp function that can extract certain parts of a timestamp such as the year, month, day and so on.
2 avg is an Aggregate function.

Plot the results using the snippet below.

import plotly.express as px
fig = px.line(df_hash_per_block,
    x="block_timestamp_year",
    y="avg_per_block_transaction_hash_ct",
    title="Average transactions per block across the years")
fig.show()
average trans per block across years

The average transactions per block has been steadily increasing. Lets also check the median number of transactions per block across the years since average can be influenced by very large numbers.

median requires ordering the data and can be a very expensive operation when dealing with very large amounts of data. In BigQuery, the median is computed using Approximate Aggregate function, APPROX_QUANTILES.

with hash_ct_by_block_num as (
  select
    extract(year from block_timestamp) as block_timestamp_year, (1)
    block_number,
    count(*) as per_block_transaction_hash_ct
  from `mltest-202903.bitcoin_playground.all_transactions_until_dec_04_2020`
  group by
    block_timestamp_year,
    block_number
)
select
  block_timestamp_year,
  approx_quantiles(per_block_transaction_hash_ct, 100)[OFFSET (50)] as median_per_block_transaction_hash_ct (2)
from hash_ct_by_block_num
group by
  block_timestamp_year
order by
  block_timestamp_year
;
1 extract is a Timestamp function that can extract certain parts of a timestamp such as the year, month, day and so on.
2 approx_quantiles is an Approximate Aggregate function. The way this function works is that it returns an array of number + 1 elements where the first element is the approximate minimum and the last element is the approximate maximum. Using the array function offset you can extract the specific percentile value you desire by indexing the array returned by approx_quantiles.

Plot the results using the snippet below.

import plotly.express as px
fig = px.line(df_hash_per_block,
  x="block_timestamp_year",
  y="median_per_block_transaction_hash_ct",
  title="Median transactions per block across the years")
fig.show()

This is the result of the above query and Python snippet. The average and the median follow a similar trend. Eyeballing the values does not show any large deviations.

median transactions per block across years

Lets write a query to extract the average, median and other order statistics.

with hash_ct_by_block_num as (
  select
    extract(year from block_timestamp) as block_timestamp_year, (1)
    block_number,
    count(*) as per_block_transaction_hash_ct
  from `mltest-202903.bitcoin_playground.all_transactions_until_dec_04_2020`
  group by
    block_timestamp_year,
    block_number
)
select
  block_timestamp_year,
  avg(per_block_transaction_hash_ct) as avg_per_block_transaction_hash_ct,
-- Extract ordered statistics
  approx_quantiles(per_block_transaction_hash_ct, 100)[OFFSET (25)] as per_25_per_block_transaction_hash_ct,
  approx_quantiles(per_block_transaction_hash_ct, 100)[OFFSET (50)] as median_per_block_transaction_hash_ct,
  approx_quantiles(per_block_transaction_hash_ct, 100)[OFFSET (75)] as per_75_per_block_transaction_hash_ct,
  approx_quantiles(per_block_transaction_hash_ct, 100)[OFFSET (95)] as per_95_per_block_transaction_hash_ct,
  approx_quantiles(per_block_transaction_hash_ct, 100)[OFFSET (99)] as per_99_per_block_transaction_hash_ct
from hash_ct_by_block_num
group by
  block_timestamp_year
order by
  block_timestamp_year
;

Plot using the snippet below.

import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Scatter(
  x=df_hash_per_block['block_timestamp_year'],
  y=df_hash_per_block['avg_per_block_transaction_hash_ct'],
  mode='lines',
  name='Avg')
  )
fig.add_trace(go.Scatter(
  x=df_hash_per_block['block_timestamp_year'],
  y=df_hash_per_block['per25_per_block_transaction_hash_ct'],
  mode='lines',
  name='25th_percentile')
  )
fig.add_trace(go.Scatter(
  x=df_hash_per_block['block_timestamp_year'],
  y=df_hash_per_block['median_per_block_transaction_hash_ct'],
  mode='lines',
  name='Median')
  )
fig.add_trace(go.Scatter(
  x=df_hash_per_block['block_timestamp_year'],
  y=df_hash_per_block['per75_per_block_transaction_hash_ct'],
  mode='lines',
  name='75th_percentile')
  )
fig.add_trace(go.Scatter(
  x=df_hash_per_block['block_timestamp_year'],
  y=df_hash_per_block['per95_per_block_transaction_hash_ct'],
  mode='lines',
  name='95th_percentile')
  )
fig.add_trace(go.Scatter(
  x=df_hash_per_block['block_timestamp_year'],
  y=df_hash_per_block['per99_per_block_transaction_hash_ct'],
  mode='lines',
  name='99th_percentile')
  )

fig.show()

The screenshot below show the Colab cells for the above.

ordered stats transactions per block

You can also loop through the y variables as shown in the snippet below to produce the plot.

import plotly.graph_objects as go

y_vars = ['avg_per_block_transaction_hash_ct',
'per25_per_block_transaction_hash_ct', 'median_per_block_transaction_hash_ct',
'per75_per_block_transaction_hash_ct', 'per95_per_block_transaction_hash_ct',
'per99_per_block_transaction_hash_ct']

fig = go.Figure()

for y_var in y_vars:
  fig.add_trace(go.Scatter(
    x=df_hash_per_block['block_timestamp_year'],
    y=df_hash_per_block[y_var],
    mode='lines',
    name=y_var.split('_')[0]
  ))

# Figure titles
fig.update_layout(
    title="Statistics related of transactions per block",
    xaxis_title="Year",
    yaxis_title="Transactions per block"
)

fig.show()
ordered stats trans per block just plot

Transactions Growth

The number of transactions over the years has been steadily increasing. Lets quantify the daily transactions count and plot it.

select
    date(block_timestamp) as block_timestamp_date,
    count(*) as ct_transactions
  from `mltest-202903.bitcoin_playground.all_transactions_until_dec_04_2020`
  group by
    block_timestamp_date
order by
  block_timestamp_date
;

Plot the daily transactions count.

import plotly.express as px

fig = px.line(df_hash_per_block,
    x="block_timestamp_date",
    y="ct_transactions",
    title="Daily Transactions")

fig.show()
transactions per day

Lets compute a moving average of the transactions count to smooth out some of the daily variations. The query below will show you how to compute a 7-day moving average.

with daily_transactions as (
    select
      date(block_timestamp) as block_timestamp_date,
      count(*) as ct_transactions
    from `mltest-202903.bitcoin_playground.all_transactions_until_dec_04_2020`
    group by
      block_timestamp_date
)
select
  block_timestamp_date,
  ct_transactions,
  AVG(ct_transactions) (1)
    OVER ( (2)
      ORDER BY block_timestamp_date (3)
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW (4)
    ) as ma_7d_transactions
from
  daily_transactions
order by
  block_timestamp_date
;
1 Example of an aggregate analytic function.
2 The over clause defines the window of rows over which the aggregate function, avg in this case will be computed.
3 The rows are ordered by the block_timestamp_date column.
4 We are looking to compute a 7-day moving average so we include the current row along with the 6 previous rows.
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Scatter(
  x=df_transactions_daily['block_timestamp_date'],
  y=df_transactions_daily['ct_transactions'],
  mode='lines',
  name='Daily Transactions Count')
  )
fig.add_trace(go.Scatter(
  x=df_transactions_daily['block_timestamp_date'],
  y=df_transactions_daily['ma_7d_transactions'],
  mode='lines',
  name='7-day Moving Average of Daily Transactions Count')
  )

fig.update_layout(
    title="Daily Transactions Count",
    xaxis_title="Date",
    yaxis_title="Transactions/day"
)

fig.show()
ma 7d daily trans count

We will now look at the BTC to USD price movement in relation to daily transactions. I’ve downloaded historical daily BTC/USD data on the Coinbase Gemini exchange from here.

I uploaded the CSV file into Google Sheets and created a table in BigQuery from this spreadhseet. See instructions on how this can be done here.

The query below combines the computation to quantify daily transactions count, 7-day moving average of daily transactions count and joins with open USD price of 1 BTC on Coinbase exchange.

with daily_transactions as (
    select
      date(block_timestamp) as block_timestamp_date,
      count(*) as ct_transactions
    from `mltest-202903.bitcoin_playground.all_transactions_until_dec_04_2020`
    group by
      block_timestamp_date
),
moving_avg_d as (
    select
      block_timestamp_date,
      ct_transactions,
      AVG(ct_transactions)
        OVER (
          ORDER BY block_timestamp_date
          ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) as ma_7d_transactions
    from
      daily_transactions
    order by
      block_timestamp_date
),
daily_usd_per_btc as (
    select
      date(`date`) as date_val, (1)
      `open` as usd_open
    from `mltest-202903.bitcoin_playground.btc_to_usd`
)
select
  a.date_val,
  a.usd_open,
  b.ct_transactions,
  b.ma_7d_transactions
from daily_usd_per_btc a
join moving_avg_d b on (a.date_val = b.block_timestamp_date)
;
1 Using backticks since the column name is a reserved word. You can also change the original spreadhseet column name to avoid the confusion or the need for backticks.

Plot using the snippet below.

import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scatter(
  x=df_transactions_daily['date_val'],
  y=df_transactions_daily['ct_transactions'],
  mode='lines',
  name='Daily Transactions Count')
  )
fig.add_trace(go.Scatter(
  x=df_transactions_daily['date_val'],
  y=df_transactions_daily['ma_7d_transactions'],
  mode='lines',
  name='7-day Moving Average')
  )

fig.add_trace(go.Scatter(
  x=df_transactions_daily['date_val'],
  y=df_transactions_daily['usd_open'],
  mode='lines',
  name='USD Open Price of 1 BTC'
  ),
  secondary_y = True
  )


# Figure title
fig.update_layout(
    title_text="Daily Transactions and Open Price of 1 BTC in USD"
)

# Set x-axis title
fig.update_xaxes(title_text="Date")

# Set y-axes titles
fig.update_yaxes(title_text="Transactions/day", secondary_y=False)
fig.update_yaxes(title_text="Open Price in USD of 1 BTC", secondary_y=True)

# Horizontal legend layout and move below the plotting area
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="top",
    y=-0.15,
    xanchor="left",
    x=0.3
))

fig.show()
daily trans open btc price

Growth in Daily Input Addresses

Analyzing growth in input addresses.

with base_data as (  (1)
    select
        date(block_timestamp) as block_timestamp_date,
        input_data.addresses as input_data_addresses
    from
        `bigquery-public-data.crypto_bitcoin.transactions`
    CROSS JOIN UNNEST(inputs) as input_data
    where
        block_timestamp_month >= '2015-01-01'
),
individual_input_address as ( (2)
    select
        block_timestamp_date,
        input_address
    from base_data
    CROSS JOIN UNNEST(input_data_addresses) as input_address
),
dist_input_addresses_per_day as ( (3)
    select
        block_timestamp_date,
        count(distinct input_address) as distinct_input_addresses
    from individual_input_address
    group by
        block_timestamp_date
    order by
        block_timestamp_date
),
moving_avg_d as ( (4)
    select
        block_timestamp_date,
        distinct_input_addresses,
        AVG(distinct_input_addresses)
        OVER (
            ORDER BY block_timestamp_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) as ma_7d_input_addresses,
        AVG(distinct_input_addresses)
        OVER (
            ORDER BY block_timestamp_date
            ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
      ) as ma_30d_input_addresses
    from
        dist_input_addresses_per_day
),
daily_usd_per_btc as ( (5)
    select
        date(`date`) as date_val,
        `open` as usd_open
    from `mltest-202903.bitcoin_playground.btc_to_usd`
)
select
    a.block_timestamp_date,
    a.distinct_input_addresses,
    a.ma_7d_input_addresses,
    a.ma_30d_input_addresses,
    b.usd_open
from moving_avg_d a
join daily_usd_per_btc b on (a.block_timestamp_date = b.date_val)
order by
   block_timestamp_date
;
1 Extract the addresses array from the inputs column.
2 Extract each individual input addresses.
3 Count the distinct number of input addresses by day.
4 Compute moving averages.
5 Select the daily USD open price of 1 BTC.

Plot the daily input addresses count with the USD open price of 1 BTC

import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter( x=df['block_timestamp_date'], y=df['distinct_input_addresses'], mode='lines',name='Daily'))
fig.add_trace(go.Scatter( x=df['block_timestamp_date'], y=df['ma_7d_input_addresses'], mode='lines',name='7d MA'))
fig.add_trace(go.Scatter( x=df['block_timestamp_date'], y=df['ma_30d_input_addresses'], mode='lines',name='30d MA'))
fig.add_trace(go.Scatter( x=df['block_timestamp_date'], y=df['usd_open'], mode='lines',name='USD Open Price of 1 BTC'), secondary_y = True)

# Figure title
fig.update_layout(
    title_text="Daily Distinct Input Addresses and Open Price of 1 BTC in USD"
)

# Set x-axis title
fig.update_xaxes(title_text="Date")

# Set y-axes titles
fig.update_yaxes(title_text="Distinct Input Addresses/day", secondary_y=False)
fig.update_yaxes(title_text="Open Price in USD of 1 BTC", secondary_y=True)

# Horizontal legend layout and move below the plotting area
fig.update_layout(legend=dict(
    orientation="h",
    yanchor="top",
    y=-0.15,
    xanchor="left",
    x=0.3
))

fig.show()
distinct input addres d

Here is the Colab notebook if you would like to reproduce the analysis.