Current Work:
Future Work:
– A webservice that, when given an address and a few other parameters, will output Gephi data files that give users the ability to fully explore the big problems associated with linkages. Stay tuned….
Current Work:
Future Work:
– A webservice that, when given an address and a few other parameters, will output Gephi data files that give users the ability to fully explore the big problems associated with linkages. Stay tuned….
After several failures due to strange anomalies in the block chain (generally because things weren’t standardized as well as they are now), a VERY frustrating API that took weeks to wrangle, and 5 days of running the application without a major failure – the graph is finally complete as of yesterday:
The visualization software, Gephi, I was planning on using doesn’t work very well with graphs that are larger than 1 million entities (vertices + edges) mainly because Gephi loads the entire graph in memory. A computer with 16 GB of RAM was only able to load about 5 million vertices and 0 edges before it failed. As a result, I do not have a visualization of the graph at this time (not like it would have been very useful at that size anyway – but it is one of the things I really want to do). The next step of this process will be to abstract related entities from the graph, hopefully reducing the number of nodes and relationships some.
If there are any graph hobbyists/theorists that would like a copy of the neo4j database (~5 GB compressed, ~17 GB uncompressed (includes JSON files)), shoot me an email and I’d be happy to get that to you. As much as I would like to publicly link to it right now, I simply do not want to pay for the bandwidth (sorry brah). If you would like to build the graph yourself from scratch, feel free to fork my project -> https://github.com/thallium205/BlockchainNeo4J. This program will download the raw json files (nearly 200,000 of them) from blockchain.info first, validate the consistency (optional, but highly recommended at first run) and try to fix any files that may be missing, then persist them to a running instance of neo4j that you specify given a URI. Transactions are indexed so redeeming inputs from existing outputs is a constant time operation. Any sub-branching blocks that deviate from the main chain will perform a breadth first traversal to find its parent block, but that is usually very fast since these chains are not very long.
This has been a really cool project, and I look forward to what kind of information can be derived from the data. Soon, I will be transitioning into a new project that I can assure you will be far more entertaining and that is completely unrelated to Bitcoin. As always, stay tuned!
Bitcoin’s block chain lottery system that is… Block creation, at its essence, is nothing more than a relatively predictable lottery system where whoever’s computer calculates the correct value is awarded. This event happens every 10 minutes or so and the winning person gets 50 BTC (~$260 right now). Sometimes, however, two clients find the correct value at the same time, and both are rewarded! This results in a situation where the blockchain has a split, and clients have to then determine which chain is valid and continue it from there. Here is an example I encountered while building the chain into Neo4j:
Because clients have the genesis block programmed directly into them, they are always able to determine which chain to append. This genesis block is the very first block in the chain, and will be used when there is a discrepancy between a client who may be on a different chain than the other client. Once discrepancies are resolved, typically by exponentially going backward down the chain toward the genesis block until a common block is reached, block exchange continues.
My next post will hopefully be the finalized build of the graph, so stay tuned!

Here is a screenshot of our built graph database modeling some of the earliest transactions in the chain. You may find the source code of the program I developed to do this here: https://github.com/thallium205/BlockchainNeo4J
By properly modeling the Bitcoin blockchain and grouping these multi-input (and multi-output!) transactions into entities, a single identified address could reveal large pools of addresses. Combined with the thousands of addresses attributed to people’s signatures on the Bitcoin forums, or even privileged market data from Mt. Gox, we believe portions of the economy could be identified. Even if an entity was not explicitly marked, its relationships with other entities may reveal its associations, thus implicitly identifying them.
Nonetheless, we will still generate a highly visual model of the blockchain and perform analysis on that with tools such as https://gephi.org/, something that simply has not happened yet. Special permission was given to me from blockchain.info which allows me to hit their API unthrottled, so expect to see some cool stuff in the future!
Three weeks… sheesh. I could bore you with the details as to why this is the case or I can jump right into why the Lucene library is super awesome. Recently I had to generate n-grams, particularly 2-grams, which, given a sentence such as “Leave me now, you’re unworthy.”, would result in tokenized strings like “Leave me”, “me now”, “now you’re”, and “you’re unworthy”. I could have rolled my own implementation, but then I would have missed out on the awesome capabilities of using Lucene.
This post won’t do the power of this library justice, but one can simply think of it as a way of extracting “fields of text” from “documents”. In this example, we:
1) Run a document through an Analyzer which filters out the stuff we don’t care about. SimpleAnalyzer, in this case, applies a lower case filter and a letter tokenizer, which makes all text lowercase and divides text at non-letters, respectively.
2) Wrap this analyzer with ShingleAnalyzerWrapper which constructs shingles (token n-grams) from a stream. This is the main thing we want to accomplish.
3) We generate a TokenStream which enumerates (a fancy word for establishes) “fields” from a “document” (what I mentioned earlier).
4) Given a token stream, we want to extract certain things from it, like just the characters and not all the other stuff that comes along with the stream. We’ll use CharTermAttribute which extract just the words from the stream.
5) Finally, we iterate over the stream by incrementing the tokens, extracting each CharTermAttribute from the tokens.
Okay enough talking, lets look at the code:
public void run() { try { FileReader reader = new FileReader(file); // Parse the file into n-gram tokens SimpleAnalyzer simpleAnalyzer = new SimpleAnalyzer(Version.LUCENE_35); ShingleAnalyzerWrapper shingleAnalyzer = new ShingleAnalyzerWrapper(simpleAnalyzer, min, max); TokenStream stream = shingleAnalyzer.tokenStream("contents", reader); CharTermAttribute charTermAttribute = stream.getAttribute(CharTermAttribute.class); // Do something with the tokens ArrayList<String> gram = new ArrayList<String>(); while (stream.incrementToken()) { System.out.println(charTermAttribute.toString()); } LOG.log(Level.INFO, file.getName() + " completed."); } catch (FileNotFoundException e) { LOG.log(Level.SEVERE, "Parse Failed. Reason: " + e.getMessage(), e); } catch (IOException e) { LOG.log(Level.SEVERE, "Parse Failed. Reason: " + e.getMessage(), e); } }
My overall application source can be viewed here -> https://github.com/thallium205/Shakespeare-Generator It is not complete at this time, but when it is, should produce relatively coherent shakespearean sentences thanks in part to the power of n-grams. Enjoy!
I am introducing a tiny application I wrote that exploits the awesome archiving power of Google Reader to let you archive any website into a sqllite database in a matter of seconds! Don’t believe me?
As of today:
Here are all 38 blog posts from this blog -> goo.gl/NSvO5
Here are all 2,169 blog posts from Google’s official blog ->goo.gl/JNSW3
Here is a screenshot of the contents of the database it created from Google’s blog:

It even stores all the site’s categories:

I was even able to capture a forum with tens of thousands of posts within minutes.
To use this application, just feed it the website’s rss feed, a time at which you want to start collecting posts (just pass in 0 to get an entire site), and the path to where you want the database file stored and watch the magic happen. Example usage: java -jar Website_Archiver.jar http://googleblog.blogspot.com/feeds/posts/default 0 C:\Users\John\Desktop\google.sqlite Any sqlite browser will allow you to view the data… in this example I used http://sqlitebrowser.sourceforge.net/ for the screenshots. You can check out the source here -> https://github.com/thallium205/Website_Archiver The ready-to-go executable can be downloaded here -> https://github.com/thallium205/Website_Archiver/raw/master/Website_Archiver.jar
Let me know if any of you find any cool uses for this! Enjoy!
With my latest and greatest project, ‘AS SEEN ON REDDIT’ -> http://apps.facebook.com/asseenonreddit/ Using full MVP architecture provided on GWTP framework powered by Google App Engine, I used this as an opportunity to solidify these core concepts. By rerouting the traffic through one of my private servers, I was able to bypass Reddit’s App Engine blacklist for API calls. The source is available here, enjoy!
Blochain.info has a really cool tool that gives users the ability to visualize transactions. Given the database that the Bitcoin Updater builds, I wanted to see if I could take this concept, but also include the dates at which each transaction was added to an approved block. After feeding it a random transaction hash, the dendrogram from the website returns this:
With just a few SQL strokes, I was able to reproduce this result as well, including the dates from the block it belongs to:
Haha! Business.
SELECT Incoming.prev_out AS 'Current Transaction', Curr_Block.TIME AS 'Current Time', Outputs.`Redeemed at input` AS 'Next Transaction', Next_Block.TIME AS 'Next Time', Outputs.amount AS 'Next Value' FROM Incoming JOIN TRANSACTION AS Next_Transaction ON Next_Transaction.hash = Incoming.Transaction_hash JOIN TRANSACTION AS Curr_Transaction ON Curr_Transaction.hash = Incoming.prev_out JOIN Block AS Next_Block ON Next_Block.hash = Next_Transaction.Block_hash JOIN Block AS Curr_Block ON Curr_Block.hash = Curr_Transaction.Block_hash JOIN Outputs ON Outputs.`Transaction Hash` = Incoming.prev_out WHERE Incoming.prev_out = 'b08ff6b529c09337de8e4e09ac6e7e1cd3697ecddab76f253a40f0d105c0ac8e' GROUP BY `Next Transaction`
The Outputs table I am using is a view:
SELECT TRANSACTION.hash AS 'Transaction Hash', Outgoing.n AS 'Index', IFNULL(Incoming.Transaction_hash, 'Not yet redeemed') AS 'Redeemed at input', Outgoing.VALUE AS 'Amount', TRIM(REPLACE(REPLACE(REPLACE(Outgoing.scriptPubKey, 'OP_CHECKSIG', ''), 'OP_EQUALVERIFY', ''), 'OP_DUP OP_HASH160', '')) AS 'To Address', '???' AS 'Type', Outgoing.scriptPubKey AS 'ScriptPubKey' FROM TRANSACTION JOIN Outgoing ON Outgoing.Transaction_hash = TRANSACTION.hash LEFT OUTER JOIN Incoming ON Incoming.prev_out = Outgoing.Transaction_hash AND Incoming.n = Outgoing.n
How cool would it be to have a colossal dendrogram visualizing the entire transaction chain? What if one of the axis were based upon time? Would we be able to see the evolution of the currency from its infancy to the complex multi-organ behemoth it is today? … definitely. So stay tuned!
PS: I will soon be releasing a tool that will allow you to download and archive any website that has an RSS feed thanks to Google Reader. I’m still figuring out what kind of format would be the most useful to the most people. I may or may not be using this to scrape the entire official bitcoin forums and finding a relationship between post frequency to market events to blockchain events.
Having the Bitcoin blockchain and historical market data in a searchable database can now be available to you! Simply assemble the little program I made, run it, and in 24-30 hours you too can reap the sweet sweet useless awesome benefits as I have. Check it out here!
Want to build a histogram of rounded transaction values to the nearest whole Bitcoin for the year 2011? Of course you do and now you can!
It looks like people have an affinity for nice, even, round amounts: e.g. 100, 150, 200 when sending Bitcoin to one another. But who doesn’t? (This query took 4 hours to run so APPRECIATE IT)
SELECT ROUND(VALUE), COUNT(VALUE) FROM Outgoing JOIN TRANSACTION ON TRANSACTION.hash = Transaction_hash JOIN Block ON Block.hash = TRANSACTION.Block_hash WHERE TIME BETWEEN '2011-1-1' AND '2012-1-1' GROUP BY ROUND(VALUE) ORDER BY VALUE
Want to see if there is a relationship between the amount of trades on the US market to the value of Bitcoin to the USD? Well say no more!
It looks like people got pretty interested in trading when the value shot up!
SELECT symbol AS 'Market', DATE(TIME) AS 'Date', ROUND(AVG(price),2) AS 'Average Price in USD', ROUND(SUM(amount)) AS '# of Trades', ROUND(SUM(price * amount),2) AS 'Bitcoin Traded' FROM Trade JOIN Market ON Market.symbol = Trade.Market_Symbol WHERE symbol = 'mtgoxUSD' GROUP BY DATE(TIME), '# of Trades', 'Average Price in USD' ORDER BY TIME
Let me know what you think! You can access the data points and interactive charts here (Google has a slight bug trying to embed a 2 vertical axis chart, which is why the second graph is just an image) -> https://docs.google.com/spreadsheet/ccc?key=0AjEiltOWxrwvdE5Xb1hMbTFZS09oeGFWQS1IUm9iV1E
In order to learn more about Bitcoin, I thought I would start by storing the chain itself into a normal mysql database for a school project I am doing in an effort to familiarize myself with it more for the Visualizer. (Hint: nothing is better than tying in something you are interested about to a class you are taking.) I was asked to put together a quick little presentation about comparing blockchain statistics to Bitcoin market statistics to give the class some ideas on what to do their semester long project on. Here is the Powerpoint I plan on presenting tomorrow. Enjoy!
Recent Comments