Tuesday, December 7, 2010

Chrome OS is about disruption, not domination

Today I had used Google's Chrome Web Store for the first time and I had an epiphany. I suddenly understood, why Chrome, the web store, and the Chrome OS is actually a rather big deal.

Here's the future, as I saw this morning.

The web (ie. Javascript + DOM) will be THE runtime for software. For the most part, as web services, but with some applications still running locally as extensions or chrome web store 'applications'.

The latter will be more prevalent where local caching of assets and volatile data is necessary to match the performance of existing native applications for frequent use. The boundary between the two will be blurry and for the most part abstract to the end user.

You might for example, visit Google Docs to work on a program, which will take advantage of a local helper application on your machine to make sure the whole thing loads quickly and is stable even over a transient internet connection. It will do so in a private sandbox isolated from the rest of your data, and you may well not even be told it's happening. After all you don't really care, you just want to get to your doc.

But even with such edge-caching, storage of application data and user preferences/credentials will always be hosted off site. This means that you can lose or wipe your machine, and the only delay to being fully back up and running is the time taken to refresh your 'cache'.

And in this world, a web browser is the only actual program on your desktop you'll ever need.

Sounds crazy? You'd be hard pressed today to find a major tech company that doesn't have a web delivery strategy of some kind. Microsoft Office is becoming Microsoft Office Live. Skype is widely expected to have a web based version from 2011. Adobe's product suite is becoming increasingly web-based, and HTML5 at that.

Chrome is a stepping stone in the transition to this state. It starts as just a tool for looking at HTML, but increasingly becomes a runtime for everything a user needs. Case in point, I've just installed the Tweetdeck web store app, and deleted Echofon from my desktop. At a not too distant point in the future, I won't need Office anymore. I probably don't need Apple Mail or Yammer right now.

As people start to reach this point in the next few years, they will need to ask themselves if they really need a traditional OS at all, or if they wouldn't be better off with something like Chrome OS that literally just runs Chrome. After all, it would be nice to have faster boot times, super secure access, few-to-no IT admin headaches, no licensing fees, no backups, and complete crash recovery? For most people, who aren't buried in hard-core desktop tools like ray-tracers or IDEs, this would actually be pretty nice right?

It's especially true in the Enterprise - the driver being the significantly reduced IT cost. Not only would there be way less support needed but the hardware would be cheaper and more standardised too. Hell my old employer could shift most of it's employees to Chrome OS today if it were stable enough, and save a ton of money and hassle in the process. Everything they do is web based anyway.

If none of that made any sense, spend a minute watching this. The Epihheo guys explain it much better than I can:


Anyway, back to the vision. To get to this state, in the short term a bunch of things that are good need to be better. Chrome is fast, but it needs to be faster. And more stable. For developers, you really need to tool up to build desktop-grade apps in the browser. Forget jQuery, you need something like GWT. Not surprisingly, Google is leading the charge here too. Now you know why.

And in the long term - like any architecture transition, it will take a while. For many orgs, a LONG while, while legacy cruft remains that can't cost-effectively be 'ported' to the web. And for those of us running specialised tools it may never happen entirely, but it is a logical transition because of the user benefits and cost reduction.

It's effectively a return to a thin client architecture that puts the web front and centre.

Platform plays are nothing new, but the really surprising thing is that Google's aim here doesn't seem to be platform dominance (unlike Microsoft with Windows, or Apple with iOS), but rather disruption - preventing any other player from 'owing the screen' of the devices users use to access Google's services (and thereby block them).

I think this explains why they picked the web as the ideal runtime. It's an appalling choice in many respects - it was never designed to support rich applications, but every OS and smartphone has to support it to stay relevant - they are trapped into tacitly supporting Google's platform of choice. It's also why Google can afford to go open source on Chromium - since its not about vendor lock-in, its fine if Novell or whoever decide to clone it and tweak it for their own purposes. That being said, I've no doubt Google will achieve significant market penetration in many workplaces and households.

They don't need to win in the sense of making tons of money or having Chrome OS on every device, they just need to do well enough that the web remains a first class citizen, another player can't own the whole market and block their services. In this, I think they'll succeed.

I suspect this strategy has been obvious to many for a while, but today I suddenly 'got' it. And I have to admit it's actually quite well thought out and rather elegant.

The 'Web Store' is something of a last minute addition to this strategy, but a logical one and entirely consistent. And it will do very well for Google, and will hopefully side-step the closed-garden. And where the hell does Android sit in all of this? Did Palm have the right idea all along with WebOS? (Answer: No.) But that will have to be the subject of another post.

Monday, November 29, 2010

What high technology startups can learn from the Wizard of Oz

A guest post of mine, originally posted at Pollenizer.com

One of the principles of Lean development, particularly as it applies to software, is that of Minimum Viable Product – the absolute bare minimum you can get away with in order to determine if your idea will be successful or not. The idea is that an MVP should require as little investment as possible, because the sooner it is up the sooner you can validate if your brilliant world changing idea actually makes sense as a business.

As you discover that perhaps your idea didn’t stand up in the real world as well as you’d like, or wasn’t adopted in the way you expected it would be, you will be less recitent to throw away a misconcieved early idea if it didn’t take too much energy to build in the first place.

This model works well for many web-based businesses – your MVP can be as simple as a building a basic website (perhaps with a limited degree of interactivity), add in some tracking, throw some limited marketing at it and seeing if it will fly.

But what about those truly innovative ideas, the ones that require a substantial investment in research and development before even you can be convinced if you have a product that has traction? Something that might take millions of dollars in research and development, specialised teams, and months or even years. How do you execute on a project like that while remaining true to the principles of Lean?

One such project where we encountered was Friendorse.com – a great little service that helps you find information from local experts in your area. Users visit the site, connect with Facebook, ask a question, and a few minutes later get e-mailed an answer from a local expert.

It sounds simple, but as we fleshed the idea out it became clear that the technology required to run a business like Friendorse at scale is actually pretty complicated. Questions must be tokenised, term vectors extracted, categories clustered, taxonomies developed, users social and spatial distances must be calculated and indexed, and all this data would need to be fed into a fairly sophisticated machine learning algorithm.

Such technology would require a significant investment – many months of development by skilled engineers – as well as a significant amount of data in order to build and verify a working product. That’s if it could be done at all. All in all – a pretty big barrier just to test if the idea even worked. Furthermore it was recognised that the Friendorse matching algorithm, like the Google search algorithm, would never be truly ‘finished’, but would have to continue to evolve as the product matured. Could we ever justify investing in something like this without even knowing if (or how) people would use it?

Our solution was simple – don’t build an algorithm.

Inspired by a great discussion from the founders of Aardvark.com, we realised that to test if the product worked in the market – all we really needed to do was create a site in which people could ask and answer questions. But behind the scenes the crucial step of ‘routing’ a user’s question to a local expert would actually be handled by a person (working under a strict confidentiality clause) rather than a machine. The Aardvark guys call this the ‘Wizard of Oz’ approach, and we think it’s a pretty apt term for the process. Behind the shiny stage and the fancy lights, it’s actually a person pulling the strings.

Building this project ‘Wizard of Oz’ style allowed us to deliver a product much faster, cheaper and with greater certainty than we otherwise could. It also allowed us to experiment and pivot more quickly – dropping those ideas which didn’t work in the wild and iterating towards those that did.

From a technical perspective, it had another significant benefit – having a live project in production early on, taking queries from real users that were being manually routed and managed by human operators, gave us a significant cache of training data with which to embark on the next stage of the project.

We know now for example that people don’t typically ask questions like “What’s the best pest controller in Melbourne?”, they ask “How do I get rid of these damn cockroaches?”. We know that coffee suppliers in Brisbane don’t know all that much about the cafe scene in Sydney. And so it goes.

Having data like this before seriously sitting down and figuring out the internals of the algorithm is immensely valuable and a huge competitive advantage for the Friendorse team.
By following a Lean approach, Friendorse has charged into it’s first battle (market acceptance) as a lean and agile creature. Not only did it survive, but it will face it’s next battle (fund-raising and scale) better equipped and far more informed than it otherwise would have.

Tuesday, October 5, 2010

Extracting e-mail addresses from Apple Mail

Ever wanted to extract every e-mail address you've ever sent or received to from Apple Mail? With a bit of *nix-fu, it's fairly straightfoward.

Just open up a terminal, and run the following 3 commands in order (the second two will take some time to execute).

cd ~/Library/Mail

find . -name *.emlx -print0 | xargs -0 perl -wne'while(/[\w\.]+@[\w\.]+/g){print "$&\n"}' * > ~/Desktop/extracted_emails.txt

sort ~/Desktop/extracted_emails.txt | uniq > ~/Desktop/sorted_emails.txt

You'll now have a file on your desktop with a sorted, de-duped list of e-mail addresses.

Tuesday, September 7, 2010

A simple example of technology in action

This post might not interest some of the more technically minded readers of this blog, but I thought the following video (taken from my laptop this afternoon) was a poignent example of not only just how much data is now available for general consumption by both man and machine, but how accessible that data really is to the average citizen.

The video in the left is a live stream of Independent Australian MP Andrew Oakshott provided by the Sydney Morning Herald, announcing which major political party he intended to support; in effect announcing which political party would be able to form government.

On the right is a live concurrent feed of tweets with the hashtag #ausvotes - a collective reaction from a nation that has had to wait for more than two weeks since the polls closed - to find out who would be leading their country. Over 1000 tweets were recorded in the couple of minutes that the video runs for.

Andrew Oakshott has proved himself a master of spinning out a story in which everyone just wants to get to the punchline, as reflected in the tweet stream. As is the elation/frustration as the announcement is made.

Here's the video:

Saturday, August 28, 2010

My presentation at Sydney GTUG, August

UPDATE:The fine folks at Google/GTUG managed to film the presentation [spoken very quickly], as well as host a live Wave of the talks.

Hey folks, I'm giving a presentation at the Sydney Google Tech Users Group on August 30, talking about my experiences building http://beste.st/ with Google App Engine, Google Web Toolkit, appengine-mapreduce, and a few other things with the word Google in the title.

If you're interested and in the Sydney area, come on down! Details here. It's held at Google Sydney's schmantzy new offices.

There will be cake.

Wednesday, August 11, 2010

What 180 nuns can teach us about hapiness

The video of my talk at Sydney Ignite 4, on Nuns and Happiness. My first time speaking at Ignite, and it was a great experience.

While negative emotions like anger or fear serve an obvious purpose in helping our species survive, why did we evolve positive emotions, like gratitude, compassion and forgiveness? What purpose could such emotions have beyond making us feel good? How could they help us survive, and thrive, as a species?

More importantly, what could 180 nuns teach us about this whole process?

Reflections on appengine-mapreduce

Google have recently released a framework for performing map-reduce type operations using Google App Engine and task queues. While it's an elegant solution for doing batch processing on AppEngine, it's not a substitute for 'full-stack' map-reduce frameworks like Hadoop.

I've been doing a bit of AppEngine development recently. I have many praises for the platform, and in many respects it's perfect for my current project - Beste.st. But I am coming up against a serious headache that many developers have also seemed to struggle with - managing data. To cut a long story short, performing bulk operations on data in the AppEngine datastore is very, very painful. Which makes managing any kind of real product in AppEngine very painful as well.

This is partly due to the lack of decent tools out there for bulk data management - a situation that will hopefully correct itself in time. But it's also due to the fact that it's impossible in AppEngine to run any process for longer than 30 seconds. That's right - any process. Even background ones.

Think about the issues this causes when you need to do some sort of background processing, even of a fairly trivial nature (like, say, deleting all entities of a certain type). What do you do if you can't delete a million records in 30 seconds? It quickly begs the question - what's the point of building on highly scalable infrastructure if you can't do anything practical with the data afterwards?

Google's original (mid-2009) solution to this was a thing called Task Queues - an SQS like service that requires you to break down a long running task into chunks (that must each take less than 30 seconds to run), and put them into a queue that is then managed by AppEngine. This helps of course, but actually creates another problem in that you must figure out how to compartmentalise your job into 30 second slots ahead of time (due to 30 second limit you can't have any kind of over-monitor process to regulate this for you).

Enter appengine-mapreduce. You'd expect Google to come along with something like this, after all they invented map-reduce to process enormous search engine indexes. Mapreduce is a scalable data processing paradigm that complements the scalable data store. At first glance you might expect that what Google would do here is expose some of their own MapReduce infrastructure to developers to unleash on their own private data-sets, but it turns out that's not what's happening.

Google's Mike Aizatsky gave some good reasons during a recent tech-talk as to why they didn't take this approach. So, instead of a low level service, Google has given us simply a library, that provides a completely independent implementation of map-reduce with Hadoop-like APIs, and using task queues to schedule the work. They've only delivered the Mapper API so far, but we're assured the app-engine team are hard at work on the Shuffle and Reducer steps.

When I first read this I was pretty stoked. Although it seemed like a bit of a brute-force way of solving simple problems, it did at least guarantee they would be solved in a scalable way. And it was nice to be given an opportunity to work with MapReduce that didn't require setting up a cluster of machines and installing and maintaining Hadoop.

And it works surprisingly well for tasks that require iterating over large numbers of datastore entities. Ikia Lan from the appengine team gives a great tutorial for Java devs on how to perform basic tasks like batch deletions and transformation. I set about using the library to solve a pressing problem for me - building a site-map of 1000 or so pages on Beste.st, and I had a running solution in production in a couple of hours. Not bad.

Congrats to the appengine team on the library, it's effective and functional. But while it's great that app-engine now has a viable solution for data processing, what it still doesn't have is an efficient one.

This is because part of the premise of working with distributed data is that, particularly for those tasks that have a high data-to-CPU ratio (like word counting for example) is that performance is typically bounded by disk and network I/O rather than CPU throughput. In other words, when you have a lot of data to crunch, it makes sense to move the processsing task to where the data is (ie. on the same machine as where the data is stored on disk) rather than move the data to where the processing is.

This is where full-stack map-reduce frameworks really shine, because these frameworks are aware of how their data is partitioned across a cluster, as well as how their cluster is configured (which machines are running down the corridor, which machines are in the same rack etc.), and can thus make appropriate decisions around directing jobs to data or data to jobs. This in turn makes the overall processing effort much more efficient.

Unfortunately the appengine-mapreduce library is no more privileged than any other code executing on AppEngine. So, unless the Google infrastructure is really smart - there's no guarantee that the CPU you are running map/reduce tasks on is optimised to be close to the disk that the data sits on. This means a lot of cycles behind the scenes, which you're paying for, to shunt data around Google's infrastructure.

So in summary? If you're already an AppEngine developer, I'd seriously consider making the Mapper API part of your arsenal. It will allow you to bring scalable data processing to your app in a big way, and in doing so it brings AppEngine another step closer to being a really compelling platform.

On the other hand, if you're looking at AppEngine as simply a tool for offloading your map-reduce tasks efficiently, you might still want to consider the more traditional solutions like Hadoop, which offer rack and data partitioning awareness.