# Pastebin JjfwEylh So summary of merkle grinding. So the header format is https://en.bitcoin.it/wiki/Block_hashing_algorithm version(4bytes) prevBlock (32bytes) merkleRoot (32bytes) time (4bytes) bits (4bytes) nonce (4bytes) = 80 bytes. sha256 works on 64 byte chunks so that will be processed in two chunks. the 64-bit message length is appended to the data after 1 or more 0bytes to pad to 64 bytes so what is actually hashed is: there is an inner hash and an outer hash. inner first, data hashed is inner hased data = version(4bytes) prevBlock (32bytes) merkleRoot (32bytes) time (4bytes) bits (4bytes) nonce (4bytes) <40bytes of 0> loCount (4byte value 80) hiCount (4bytes) hiCount is always 0. IV is magic constants. stateA = transform call A( IV, version || prevBlock[0-31] || merkleRoot[0-27] ) inner digest = transform call B( stateA, merkleRoot[28-31] || time || nonce || <40bytes of0> || loCount || <4bytes 0> ) outer hashed data = || <28bytes 0> || loCount (4 byte value 32) || <4bytes 0> outer = transform call C( IV, || <28bytes 0> || loCount (4 byte value 32) || <4bytes 0> ) if target outer bits == 0 found proof of work. stateA is precomputed and transform call 1 only done when extraNonce changes, which changes merkleRoot. so the most work is repeating call B by changing nonce (and maybe some low order bits of time) and then calling transform call C. now transform itself is in two parts. W array = transform_part1( data ) state = transform_part2( state, W ) part1 does 13 operations of various things rightrotate, rightshift, xor, 32bit unsigned add 48 times. importantly transform_part1 does not depend on state and so doesnt depend on the first block. part2 does 23 operations of various rightrotate, xor, and, 32-bit unsigned add 64 times. it costs more than part1. now if we precompute multiple merkleRoots that have the same last 4bytes, then transform_part1 in transform call 2 can be reused like this: expensive precompute eg FPGA (mrA,mrB,mrC,mrD) = precompute_merkle_collision() such that mrA[28..31]==mrB[28..31]==mrC[28..31]==mrD[28..31] cheap precompute stateA1= transform call A( IV, prevBlock, mrA[0-27] ) stateB1= transform call A( IV, prevBlock, mrB[0-27] ) stateC1= transform call A( IV, prevBlock, mrC[0-27] ) stateD1= transform call A( IV, prevBlock, mrD[0-27] ) then repeat in loop changing 4 byte nonce, and some low bits of time maybe. inner W = transform_part1( mrA[28-31] || || time || nonce || <40bytes of0> || loCount || <4bytes 0> ) inner digest A1=transform_part2( stateA1, inner W ) inner digest B1=transform_part2( stateB1, inner W ) inner digest C1=transform_part2( stateC1, inner W ) inner digest D1=transform_part2( stateD1, inner W ) outerA = transform call C( IV, || <28bytes 0> || loCount (4 byte value 32) || <4bytes 0> ) outerB = transform call C( IV, || <28bytes 0> || loCount (4 byte value 32) || <4bytes 0> ) outerC = transform call C( IV, || <28bytes 0> || loCount (4 byte value 32) || <4bytes 0> ) outerD = transform call C( IV, || <28bytes 0> || loCount (4 byte value 32) || <4bytes 0> )