flaying the blockchain ledger for fun, profit, and hip hop
TRANSCRIPT
Flaying the Blockchain Ledger for Fun, Profit,
and Hip HopAndrew Morris
BSides Las Vegas, 2016
Acknowledgements• Colin Morrell• Chris Donaher• Dr Richard Seymour Esquire Jr M.D.• Bobby (guessing you probably don’t want your last name on this)• Andrew Askins• Eduard Iskandarov – Author of PyBlockchain• Sentdex on YouTube• Tom Gebhar• Curt Barnard• Satoshi Nakamoto (whoever the hell you are)• Authors of Bitcoin developer documentation• The two guys that got in a fight at Shmoocon two years ago• Shout out to Endgame for being an incredible company to work at
Outline• Intro• Pre Background• Bitcoin Primer• Blockchain Primer• Pulling apart the ledger• Observations• Future work• Conclusion
http://pastebin.com/fyvKK3wE
About Me• Andrew Morris• Twitter - @Andrew___Morris• Work at Endgame R&D team• Background is in offensive cyber stuff (private sector, gov stuff, etc)• Been doing computer stuff for the majority of my life, somehow still
pretty bad at it• Dropped out of high school a couple years ago and don’t really feel
like going back• Enjoy computers, playing music, tweeting dumb jokes
Disclaimers• Today I’m speaking on behalf of myself, not on behalf of my employer• I’ll be discussing strictly technical observations• My employer doesn’t do work involving the blockchain or bitcoin• This is my own personal recreational research• I’m not a bitcoin or blockchain expert• Don’t expect to walk away from this talk with a solid understand of
blockchain. It’s really really complicated• Please correct me if you see any inaccuracies
Outline• Intro• Pre Background• Bitcoin Primer• Blockchain Primer• Pulling apart the ledger• Observations• Future work• Conclusion
August 26, 2015 - Guy buys Wu-Tang Clan Album for $2 Million
February 11, 2016 - Guy tries to buy Kanye West album T.L.O.P exclusively for $10 Million on Twitter
February 14, 2015 – Guy claims to have money stolen
Source: https://twitter.com/MartinShkreli/status/698748496153866240
Source: https://twitter.com/MartinShkreli/status/698749973769363456
Source: https://twitter.com/martinshkreli/status/698750951889506305
Source: https://twitter.com/martinshkreli/status/698750951889506305
Source: https://twitter.com/martinshkreli/status/698750951889506305
Is it possible to find this Bitcoin transaction?• $15 million is a lot of money• Bitcoin records all transactions on a distributed ledger• The ledger is publicly available and not encrypted• Lets replicate the ledger, search through it, and see if we can find a
transaction that matches ~$15 million around that time• Given a date range, find transactions that fall in a USD value range
My Approach• Replicate entire blockchain ledger• Parse raw bitcoin transactions• Get transactions into consumable format
• “Payer | Payee | Amount (BTC) | Amount (USD) | Time”• Shove data into a database of some kind• Write interesting queries• Review results• ???• Profit!
Pseudo QuerySELECT * FROM all_bitcoin_transactions_ever WHERE
Date < Feb 14, 2016 AND Date > Feb 10, 2016
AND USD(BTC_Amount) > $14,000,000 AND USD(BTC_Amount) < $16,000,000;
Outline• Intro• Pre Background• Bitcoin Primer• Blockchain Primer• Pulling apart the ledger• Observations• Future work• Conclusion
Bitcoin Primer• Bitcoin is a cryptocurrency• Peer-to-Peer• No central authority (no trusted third party)• Think of it as the “cash” of the Internet• Uses cryptographic protocols to ensure integrity• Distributed blockchain ledger to record transactions• Uses proof of work to prevent double-spending• Everything is auditable back to the first block (“genesis block”)• Coins are ”mined” with CPU and ”traded” for <whatever>• Uses a non-turing complete bitcoin specific script language to validate transactions called
Bitcoin Script• Takes ~15 minutes to completely validate a bitcoin transaction• More here: https://bitcoin.org/bitcoin.pdf
Bitcoin Primer (cont’d)• When you send someone a bitcoin, you are signing a ”coin” value in a
way that the amount can now only be spent by the intended recipient• This transaction is broadcast to the entire world• Only the recipient can spend that bitcoin• All transactions are known by everyone• Network self-regulates with ”difficulty”• Dollar value fluctuates (as with any currency)
Is Bitcoin anonymous?• Sort of• Everyone knows about every transaction• But you don’t always know who the wallet belongs to IRL • One person can use lots of wallets• Bitcoin can be “mixed” or “tumbled” to make it harder to trace• Wallets can be created offline without Internet access
Is Bitcoin secure?• Yes• The implementation is brilliant• There are attacks, but they are unlikely (not impossible)• As long as the majority of members of the network do not collude
against the network, it works• 51% attack
• Most existing attacks affect shitty opsec
Outline• Intro• Pre Background• Bitcoin Primer• Blockchain Primer• Pulling apart the ledger• Observations• Future work• Conclusion
Blockchain 101• List of every bitcoin transaction ever• Basically a giant series of linked lists in the form of serialized data• Eternally growing chain of transactions• Computers are incentivized to host the full ledger and keeping the
party going by getting rewarded for mining a coin• Nothing that has happened in the past can ever be changed• Everything is hashed, is hashed, is hashed• No state, necessarily, just a log of activity
Blockchain 101 (cont’d)• The blockchain is made up of blocks (surprise)• Blocks are made up transactions• Transactions are made up of inputs and outputs• An output is one wallet ”sending” coins to another wallet (kind of)• An input is someone claiming the coins sent to them previously• Every input corresponds to an output using a “Previous Hash” field• An output without a corresponding input means the coins have not yet been
spent
Accessing the Blockchain like a plebian• Blockchain.info• Webbtc.com• Blockexplorer.com• Blockr.io• ABE (offline)• There are tons
This is so 2015• There’s no way to query “Show me all transactions above a certain
dollar value in this time range” • They have the data but there is not API exposed for that• You just have to go around and click stuff• We need to query based on dollar value and date/time• Also maybe you don’t want some random website having logs of your
queriesSELECT * FROM all_bitcoin_transactions WHERE DATE < Feb 14, 2016 AND DATE > Feb 10, 2016AND USD(BTC_Amount) > 14,000,000 and USD < 16,000,000;
Potential Cheat Codes• Can download a dump of webbtc.com’s Postgres database (~80 GB
compressed)• http://dumps.webbtc.com/bitcoin/
• Generate the db your self with ”ABE”• https://github.com/bitcoin-abe/bitcoin-abe
Cheat Codes (cont’d)• Parse out the blockchain into a SQLite database with Docker + Bitcoin
ABE in ONE COMMAND• https://github.com/c0achmcguirk/docker-bitcoin-abe
docker run -d --name abe -P -p 49001:80 \ -v <PATH_TO_YOUR_BITCOIN_DIR>:/datadir poliver/bitcoin-abe
Outline• Intro• Pre Background• Bitcoin Primer• Blockchain Primer• Pulling apart the ledger• Observations• Future work• Conclusion
Blockchain Ledger• Lots of “.dat” files• Currently about 600 128 MB .dat files
• blk000000.dat• blk000001.dat• …
• Each .dat file is a serialized binary blob• Each file contains blocks• Each block contains a header and transactions• Each transaction contains inputs and outputs• Data structure is complex (if you’ve never done anything with data structures
before)
Block Header
Transactions
Outputs
Inputs
Getting the Ledger (Good Way)• Install Bitcoin Core• Ensure you have ~100GB
of disk space• Replicate the ledger over
the network for a few days• Done!• ~80 GB in size (as of July
2016)
Getting the Ledger (Lazy but Faster Way)• Download a torrent of the ledger .dat files• I’ve seen a few hovering around the Internet• I’ll provide a torrent for my copy of the ledger (as of July 2016)
THE ANSWERS ARE IN HERE SOMEWHERE
Parsing out the Blockchain (or, Andrew learns 2 use structs)
Parsing the blockchain sucks• Started writing my own parser a la this guide:
• http://blog.gebhartom.com/posts/Parsing%20the%20Bitcoin%20Blockchain• Shout out Tom Gebhar
• Remembered that I’m a terrible programmer• Found a bunch of libraries that *almost* did what I need• Found THIS
• https://github.com/toidi/pyblockchain• (committed two months ago, great timing)
• Shout out to this guy tho• https://github.com/tenthirtyone/blocktools
• And this guy• https://github.com/znort987/blockparser
Parsing the stuff that I need• Transaction ID• Payer Wallet• Receiver Wallet• Time of transaction• BTC amount• USD amount
Problems• Inputs + Outputs• Change• Payer (in general)• Exchange Rate• Big Data?• Transaction Patterns
Transaction ID• Unique transaction ID. Think UUID of a transaction.• This is easy to get. It’s a SHA256 of the entire transaction itself.
Block Header
Transactions
SHA256( )
Transaction ID
Output 0
Input 0
Time• Easy. An epoch time stamp is always included in the block header• Convert that to datetime with my code
Block Header
Transactions
Outputs
Inputs
Time
Receiver Wallet• This was kind of hard• Wallets/addresses are actually shorthand for a public key• Wallets are included as part of the “Script” payload of a transaction
output• Bitcoin script pushes the public key data onto it’s stack and does some
magic to verify the key
Block Header
Transactions
Outputs
Inputs
Script
Receiver Wallet (cont’d)• There are multiple different “Script” address opcode patterns• The majority of transactions fall under ~six patterns• The library I’m using the parse the blockchain (pyblockchain) had only
implemented one pattern (it’s still being actively developed)• I forked it and implemented the two other most common patterns• Missing data? Yes. • My data set is missing <00.5% transactions• Deal with it• Multi Signature Transactions???? Nah
The “change” problem• I have a Bitcoin wallet with 20 BTC in it and I want to send 5 to Kanye
West• I cryptographically sign a 5.00 BTC output to Kanye West’s address (public key)• I cryptographically sign a 15.00 BTC output to my own address (public key)
• Looking only at outputs, change transactions and non-change transactions are indistinguishable• Need the whole chain to solve this• A wallet containing $100M worth of Bitcoin sending $5 to a coffee
shop and giving itself $99.9999M in change may appear like a straight up $99.99999M transaction to our analytics if we aren’t smart
Sender Wallet• The public key that signed the coin before you got it• This is the worst thing in the universe to parse• It technically doesn’t exist, at least in the ledger• You can derive it from the previous transaction
• Previous Transaction + Input Index <- Transaction ID + Output Index <- Receiver Wallet
• Everything is a chain• Chris Donaher wrote a long ass query that makes these connections• Come back to this…
Sender Wallet (cont’d)• Straight up didn’t have enough RAM to build the correlated ”VIEW” in
Clickhouse• Overnighted 32 GB of RAM to myself a few days before the talk• …….still didn’t have nearly enough RAM to make the query• WELP
Sender Wallet (cont’d)(cont’d)• Sitting on 100GB of input/outputs (total of 700 million records) a
week before this talk• Shitty South Carolina residential Internet connection• Not enough RAM to make final query
Andrew (Askins) and Curt to the rescue• Drive to Andrew’s office downtown which has gigabit fiber• Compress 100GB dataset, upload to S3• Spin up a AWS instance with ~200 GB of RAM ($5 per hour, dear god)• Install Clickhouse• Load the database• Make the giant frankenquery• Export data• Download data• Terminate instance• I’ll come back to this
Amount• Amount of BTC in the transaction• This comes from the transaction outputs• In “Satoshis”• 1 BTC == 100,000,000 Satoshis
Block Header
Transactions
Outputs
Inputs
Amount
Historical USD Exchange Rate• I built a BTC -> USD lookup table• Given an epoch timestamp, give me the BTC -> USD value at that time• Downloaded historical BTC pricing data from blockchain.info• https://blockchain.info/charts/market-price?timespan=all#
• (Satoshi * 100,000,000) * USD exchange that day
That’s everything!
Parser source code• Get my fork of the PyBlockchain library here-• https://github.com/andrew-morris/pyblockchain
• My parser source code here-• https://github.com/andrew-morris/blockchain_research
• Note:• Requires Python 3.5 (will not work with < Python 3.4)• Definitely won’t work with Python 2.7• Slow, not multi threaded yet• I’m a generally shitty developer, sorry bout it
Load into a DBMS• Decided to use Yandex Clickhouse• Why?• Because my coworker is a lot smarter than me and recommended it• He was right!
• https://clickhouse.yandex• Allows “views” or pseudo tables made out of other queries• Very, very, very fast• Only downside is that it can only make queries that fit into RAM• Indexes to disk
Getting the data into the DB• ~98GB plaintext CSV• Took ~13 hours to parse• Took ~15 minutes to load into the DB
cat outputs.csv | clickhouse-client --query="INSERT INTO blockchain_outputs FORMAT CSV"
cat inputs.csv | clickhouse-client --query="INSERT INTO blockchain_inputs FORMAT CSV"
Linking outputs to inputs• Need to create a database “view” with Clickhouse to link inputs to
outputs, getting us the “Payer” field• The RAM problem…
create table transactions
ENGINE = MergeTree(Date, (TXID, Receiver, Payer),8192)
AS
select TXID, Payer, Receiver, Satoshis, USD, Epoch, Date from
(
select TXID,Receiver from output
)
ALL LEFT JOIN
(
select Receiver as Payer, TXID, Satoshis, USD, Epoch, Date from
(
select previousHash as hash , inputIndex as num ,TXID from input
)
ALL LEFT JOIN
(
select TXID as hash, outputIndex as num, Receiver, Satoshis, USD, Epoch,
Date from output
)
USING hash,num
)
USING TXID;
I love you Chris Donaher
Outline• Intro• Pre Background• Bitcoin Primer• Blockchain Primer• Pulling apart the ledger• Observations• Future work• Conclusion
Inputs parsed
Stats• ~1.4 billion records in “transactions” VIEW table
How many transfers worth OVER $1M
60,152
How many transfers worth OVER $10M
414
How many transfers worth OVER $100M
7
Biggest USD transfers of all time
Biggest USD transfers of all time (cont’d)• $127 Million Dollars• November 22, 2013• Someone wrote an article about it apparently• https://blockchain.info/tx/1c12443203a48f42cdf7b1acee5b4b1c1fedc14
4cb909a3bf5edbffafb0cd204
SELECT *, bar(USD, 10000000, 149000000, 40) FROM outputs ORDER BY USD DESC LIMIT 50;
How many transactions, ever?
140M
Largest Unredeemed BTC Transaction• TXID -2a29fdb4e188f827da3c3175856b3ed95819b323bb303a46b8036534e78c76db• $34M USD, unspent for two years, still unspent
Days where most USD was traded
How much money was donated to…• Wikileaks - $125k• The Pirate Bay - $3k• etc
Other• Average number of transactions per month• Average USD moved in a day• Most USD lost, ever• Find automated trades (timestamp/block height difference)• Find tumblers
AND FINALLY….
AND FINALLY….
Contenders• I found an $8M transaction four days before the tweet• Looks like it may be exiting a tumbler the day of the tweet, which
would make sense if they were stolen• Or not?
• https://blockchain.info/tx/c9b93760b545f17cad6e7308692419c3b75db21d200a0d139d61b0b559ec29b6• …coming from here• https://blockchain.info/tx/0f99c7b37dd88ef202a3f8025589c72c20b1
6dcd900fba46ec149bc1a9f58b04• Obviously I can’t confirm this
Contenders (con’t)• What if the transaction was split into multiple smaller transactions• What if dude bro was tricked into sending the BTC in 10 different
transactions?• This would avoid triggering my analytic
Show me top 50 wallets that received the most money in February 2016SELECT Receiver, sum(USD) AS total_usd_received FROM output WHERE (Epoch > (Feb 1 2016)) AND (Epoch < (Feb 29 2016))GROUP BY Receiver ORDER BY total_usd_received DESC LIMIT 50;
│ 1D12giTaEK9zVePUX9d5R5boQJpGZeLet │ 21554740.961761475 │ │ 1FAv42GaDuQixSzEzSbx6aP1Kf4WVWpQUY │ 19422711.717998274 │ │ 1FCBQRfEVrPnHcLLpuChesYTRyJQ8YwNeu │ 18656293.95471573 │ │ 1MrQKQ7ZGrkEVtzVjQ6M8obRKqXsseEEeK │ 18528720.99706912 │ │ 1MLchJtjmcugGbv7pmKdGnpqJnfNvVYNFo │ 15983460 │ │ 13WLPHwqc81itvJxjq3ijAbotMDR3B6Qfz │ 15305706.36877884 │ │ 1ATChjYwUkiT1FXPDMwER3Gbzq3aQmvqJX │ 15180784.070986748 │ │ 1Bzcrtsy3FjrTmTbB8dsvBmGGCsadnXPFW │ 15118540.048821181 │ │ 1GyDFwbK8AKks5jvouaSwp12Q4LEwSM4BK │ 13009113.512695312 │ │ 17irB8xLxhVRerCoUyypnmpoak3QBpVp2z │ 12868709.04 │ │ 1M5sKoqV3cLmmvV2MM7os8rEqmR3t9Pisk │ 12803530.500000006 │ │ 15EhvB17N8z2jRRrFczfkomNFcvgcTiZCg │ 11690543.555664062 │ │ 1CLCYgXddMwFDVJsVMiWtqrQxJbGAEE9bQ │ 11662185.460601807 │ │ 18cVXEkRyWU8Z4VT7TvzgWZhqKJLdC2w3M │ 11649447.772470951 │ │ 1PXNPo29gbZV7h9Htd2NGvP4bXeLaPJ837 │ 11488992 │
Interesting transaction - https://blockchain.info/address/1PXNPo29gbZV7h9Htd2NGvP4bXeLaPJ837 https://blockchain.info/tx/e99efe3d33fb93bc95d90eca3bc1106c808b878103b121a416083022adc64fda
Outline• Intro• Pre Background• Bitcoin Primer• Blockchain Primer• Pulling apart the ledger• Observations• Future work• Conclusion
Things for me to do• Open up a public website to allow anyone to make these queries
without needing serious $$$, disk, and cycles• Maybe build a front end for it?• Change architecture• Steam straight from the blockchain network instead of periodic ETL
• Explore opportunities with graph databases to find fun relationships• Make my parser code less shitty• Multithread it• More address patterns
Things for me to do (cont’d)• Watch everyone spend bitcoin earned from donation links• Build analytic signatures for known tumblers, gambling sites• Can some simple analytics• Build an API• Harvest bitcoin wallets from search engines, watch those• Build an alerting engine
Use Cases• Law enforcement? Intelligence community?• Investigation purposes
• FinTech use cases• Investment use cases (hedge funds, currency exchanges)• AML (Anti Money Laundering)
Use Cases (evil)• Correlate addresses + prices + timing with Darknet, find people
buying…• Drugs• Guns• Porn• Whatever
• Use as a targeting platform to find high-value wallets• Violate people’s privacy???• Idk
Outline• Intro• Pre Background• Bitcoin Primer• Blockchain Primer• Pulling apart the ledger• Observations• Future work• Conclusion
What did we do?• Blockchain is cool, but it’s not formatted to allow queries• We ripped the data apart and shoved it into a new, flexible data
structure• Basically built a lightweight blockchain search engine• Made queries based on social media posts• Identified possible transactions of interest
Did anyone move $15 million around the time of the Tweet?• Not that I observed• Found a couple close ones, but nothing at $15M• Maybe my time range is off? idk
Am I deanonymizing the Blockchain?• Nah• Only if Google “deanonymized” the Internet by creating a search
engine• The data is already there• Someone else is probably doing this a lot better than I am
What can this database do?• Ask any questions of the blockchain, fast• Literally anything• Show me who has the most unspent bitcoin• Show me how much money someone donated to Wikileaks• View a given bitcoin wallet’s coin / USD balance at any point in history• Correlate the price of bitcoin with:• Events• Stock Market• Other (crypto)currencies
Questions?
Thank You!• Twitter – @andrew___morris• Email – [email protected]• Website – https://morris.guru• GitHub – https://github.com/andrew-morris