Saturday, August 28, 2010

My presentation at Sydney GTUG, August

UPDATE:The fine folks at Google/GTUG managed to film the presentation [spoken very quickly], as well as host a live Wave of the talks.



Hey folks, I'm giving a presentation at the Sydney Google Tech Users Group on August 30, talking about my experiences building http://beste.st/ with Google App Engine, Google Web Toolkit, appengine-mapreduce, and a few other things with the word Google in the title.

If you're interested and in the Sydney area, come on down! Details here. It's held at Google Sydney's schmantzy new offices.

There will be cake.

Wednesday, August 11, 2010

What 180 nuns can teach us about hapiness

The video of my talk at Sydney Ignite 4, on Nuns and Happiness. My first time speaking at Ignite, and it was a great experience.

While negative emotions like anger or fear serve an obvious purpose in helping our species survive, why did we evolve positive emotions, like gratitude, compassion and forgiveness? What purpose could such emotions have beyond making us feel good? How could they help us survive, and thrive, as a species?

More importantly, what could 180 nuns teach us about this whole process?

Reflections on appengine-mapreduce

Google have recently released a framework for performing map-reduce type operations using Google App Engine and task queues. While it's an elegant solution for doing batch processing on AppEngine, it's not a substitute for 'full-stack' map-reduce frameworks like Hadoop.

I've been doing a bit of AppEngine development recently. I have many praises for the platform, and in many respects it's perfect for my current project - Beste.st. But I am coming up against a serious headache that many developers have also seemed to struggle with - managing data. To cut a long story short, performing bulk operations on data in the AppEngine datastore is very, very painful. Which makes managing any kind of real product in AppEngine very painful as well.

This is partly due to the lack of decent tools out there for bulk data management - a situation that will hopefully correct itself in time. But it's also due to the fact that it's impossible in AppEngine to run any process for longer than 30 seconds. That's right - any process. Even background ones.

Think about the issues this causes when you need to do some sort of background processing, even of a fairly trivial nature (like, say, deleting all entities of a certain type). What do you do if you can't delete a million records in 30 seconds? It quickly begs the question - what's the point of building on highly scalable infrastructure if you can't do anything practical with the data afterwards?

Google's original (mid-2009) solution to this was a thing called Task Queues - an SQS like service that requires you to break down a long running task into chunks (that must each take less than 30 seconds to run), and put them into a queue that is then managed by AppEngine. This helps of course, but actually creates another problem in that you must figure out how to compartmentalise your job into 30 second slots ahead of time (due to 30 second limit you can't have any kind of over-monitor process to regulate this for you).

Enter appengine-mapreduce. You'd expect Google to come along with something like this, after all they invented map-reduce to process enormous search engine indexes. Mapreduce is a scalable data processing paradigm that complements the scalable data store. At first glance you might expect that what Google would do here is expose some of their own MapReduce infrastructure to developers to unleash on their own private data-sets, but it turns out that's not what's happening.

Google's Mike Aizatsky gave some good reasons during a recent tech-talk as to why they didn't take this approach. So, instead of a low level service, Google has given us simply a library, that provides a completely independent implementation of map-reduce with Hadoop-like APIs, and using task queues to schedule the work. They've only delivered the Mapper API so far, but we're assured the app-engine team are hard at work on the Shuffle and Reducer steps.

When I first read this I was pretty stoked. Although it seemed like a bit of a brute-force way of solving simple problems, it did at least guarantee they would be solved in a scalable way. And it was nice to be given an opportunity to work with MapReduce that didn't require setting up a cluster of machines and installing and maintaining Hadoop.

And it works surprisingly well for tasks that require iterating over large numbers of datastore entities. Ikia Lan from the appengine team gives a great tutorial for Java devs on how to perform basic tasks like batch deletions and transformation. I set about using the library to solve a pressing problem for me - building a site-map of 1000 or so pages on Beste.st, and I had a running solution in production in a couple of hours. Not bad.

Congrats to the appengine team on the library, it's effective and functional. But while it's great that app-engine now has a viable solution for data processing, what it still doesn't have is an efficient one.

This is because part of the premise of working with distributed data is that, particularly for those tasks that have a high data-to-CPU ratio (like word counting for example) is that performance is typically bounded by disk and network I/O rather than CPU throughput. In other words, when you have a lot of data to crunch, it makes sense to move the processsing task to where the data is (ie. on the same machine as where the data is stored on disk) rather than move the data to where the processing is.

This is where full-stack map-reduce frameworks really shine, because these frameworks are aware of how their data is partitioned across a cluster, as well as how their cluster is configured (which machines are running down the corridor, which machines are in the same rack etc.), and can thus make appropriate decisions around directing jobs to data or data to jobs. This in turn makes the overall processing effort much more efficient.

Unfortunately the appengine-mapreduce library is no more privileged than any other code executing on AppEngine. So, unless the Google infrastructure is really smart - there's no guarantee that the CPU you are running map/reduce tasks on is optimised to be close to the disk that the data sits on. This means a lot of cycles behind the scenes, which you're paying for, to shunt data around Google's infrastructure.

So in summary? If you're already an AppEngine developer, I'd seriously consider making the Mapper API part of your arsenal. It will allow you to bring scalable data processing to your app in a big way, and in doing so it brings AppEngine another step closer to being a really compelling platform.

On the other hand, if you're looking at AppEngine as simply a tool for offloading your map-reduce tasks efficiently, you might still want to consider the more traditional solutions like Hadoop, which offer rack and data partitioning awareness.

Wednesday, May 26, 2010

Design goals for Pluto - an open source OT library for node.js

Inspired by the resurgent interest in both OT and Node.js, and some recent projects such as LakTEK's recent announcement of a collaborative editing tool in Node.js, I'm announcing my own little node.js project - Pluto.

Pluto will be an Operational Transform library written in JavaScript, and optimised to run on Node.js and greatly simplify the development of collaborative web applications. The library should be sufficiently abstract that it can handle more or less any document that you could persist in a browser (such as word processing document, spreadsheets, drawing programs, game states etc.), and not be bound to any particular transport system (such as WebSockets or long polling).

Pluto's High Level Architecture


I've sketched out how I think this library could work, and how it can take away the meat of the OT problem away from a developer. In essence, a web page that would normally have events from the user (onPress, onClick etc.) relayed to UI elements in the DOM, they are instead captured by a PlutoDocument object that translates them into OT-expressible mutation commands.

These are then sent to the PlutoClient object (where the actual OT magic happens), and the PlutoClient object relays back any changes that need to be made to the document (also as OT commands, but these may come not just from the client but also from other clients editing the same document) and these are translated into actual changes in the client's DOM.


WWhy care about Operational Transform?


If you've taken a look at how Google Wave works, you'll have come across the concept of Operational Transform. And if you've used Google Wave you'll see why it kicks ass. Basically, OT is a mechanism for keeping complex documents in sync across different locations, without locking. This last part is crucial, because it means that lots of people can make large numbers of changes simultaneously on a document and not have to stop working because someone else is also making changes at the same time.

Although Google Wave is geared for managing Waves, the underlying OT implementation is actually more generic - and can be used to collaboratively manage almost any kind of structured document. The intent behind Pluto is to provide this low level OT management functionality, while leaving the binding of it to a specific type of document up to application developer.

Two other great features of OT is that since it expresses documents as a series of mutations (a kind of collaborative command pattern), it's fairly straightforward to implement history support to systems that use it. Furthermore, using a technique called composition you can compile a sequence of many small individual OT operations into larger, more granular operations that are still equivalent. This means while you can collaboratively edit a document character by character, you don't necessarily need to store or relay each character change out to every other client.

OT is a big field and it's been around for some time - there are many different applications of the algorithm that vary in performance characteristics. The mechanism that Wave (and Pluto) follows is a client-sever implementation where the server holds a canonical document state, and the clients are responsible for staying up to date with it. This pushes a whole bunch of the grunt work onto the client, which is OK since it's much easier for a single client to figure out how to catch up to the server, than a server trying to keep track of the state of many clients. This will help Pluto stay true to it's efficiency goal.

For a detailed theoretical explanation on how this type of OT works, I strongly recommend Daniel Spiewak's brilliant introduction to Operational Transformation.

Why Node.js?


Node.js, if you haven't heard, is an event driven Javascript runtime environment. An oversimplified description would be to say it's Google's V8 engine ripped out of Chrome, with some custom extensions that allow it to talk to things like TCP sockets and local file systems instead of a browser's DOM.

Any potentially blocking operation (such as reading from a disk) is accessed through node via a callback. So instead of writing code like:


fileContents = fm.ReadFile("/somefile.txt");
processedContents = fp.Process(FileContents);


You typically express it like:


fm.readFile("/somefile.txt", function(fileContents) {
  fp.Process(fileContents, function(processedContents) {
    // Do more stuff..
  });
});


These aren't real code examples but you get the idea. The advantage of using Node and these sequence of callbacks is that node will handle the queuing and memory allocation while the process is waiting, which means in situations where large numbers of operations are requesting access to a shared resource, it can be handled much more efficiently.

The Javascript language is naturally well suited to this style of writing code, and of course it makes sense for concurrent servers to be event driven.

Theres' a lot of choice for concurrent server frameworks like Twisted, Tornado, EventMachine etc., many of which are far more mature - but apart from the reasons above Node is a standout candidate for this project for two reasons.

Reason 1 - Node's concurrency performance is awesome

I won't get drawn into the inevitable benchmarking flamewar - I'm not qualified, but these early stats look promising.

Efficiency concurrency is obviously crucial to a project that enables collaboration - and means we can hopefully delay worrying about scalability.

Reason 2 - It's written in Javascript

OT is an algorithm that relies strongly on complex complementary operations being performed on both the client and the server. So it helps a lot to be able to write the same code to deploy on the client and the server. Given the number of browser JS inconsistencies I might be nieve in this view, but it's a start.

(as an aside, Google Wave solved this problem by going the other way - they wrote their OT implementation in Java and compile those JS libraries to Javascript using the excellent Google Web Toolkit).

DESIGN GOALS

The exact implementation will no doubt vary as we go, but will stay true to the following key design goals.

  • Be efficient by optimising application design for non-blocking concurrency and pushing as much concurrent state management to the client. This should allow a single server to handle a large number of clients mutating the same document, without the need for federation.
  • Support document playback
  • Be independent of transport and persistance (although we'll probably focus on key/value support for the latter)
  • Be largely independent of the document being worked on - such that the library can be utilised for a wide range of applications
  • Be open source (probably an MIT-style license)
The expected deliverables of the project will be:
  • A comprehensively tested library that can support OT between browsers on a wide range of applications, independent of transport mechanism and storage
  • A reference implementation of the above, that includes a working application (probably a document editor of some kind), as well as support for transport

What next?


This project is a long way from completion - there's not even a repo for it yet. But I'm announcing this now because I want your help to get this moving.

If you're thinking about building something similar, I'd love to hear from you so that hopefully we can join forces - or at least swap war stories.

If you're just curious or enthusiastic about the project, comment, wave, tweet or drop me a line to show your support (which will hopefully inspire me to give up more weekends to devote to this project).

If you like making cute logos, this project needs one.

And if you're interested in sponsoring the development of a tool like this, I'd obviously love to hear from you.

And of course, if you like it, spread the word.



Sunday, April 18, 2010

Technical Debt - A non-technical managers guide

There are a few concepts that in my opinion any manager in a software company must understand, whether they are themselves of a technical background or not, and one of these is Technical Debt.

Technical debt is a phrase coined by the computer scientist Martin Fowler a few years ago. It surmises that just as when one borrows money they incurr a financial penalty that must be serviced later, as technology product grows in features and complexity it proportionaly becomes more difficult to extend and build upon unless the appropriate software design standards and architecture are maintained.

Occasionally it may be necessary to short change these practices in order to ship on time, but the "debt" incurred in this process must be "repaid" with time to re-factor and clean up the mess later, else suffer an ongoing cost via a higher defect rate, longer development time for new features, and ultimately developer attrition.

The analogy is an apt one - like financial debt, technical debt isn't fundamentally a bad thing. However conventional wisdom would have it, just as a financial debt should be paid down as soon as practical. to minimise the long term cost, so too should unnecessary complexity be stripped from software whenever possible through good architecture and regular refactoring in order to minimise the ongoing cost of managing future releases of the software. If technical debt is not accounted for, the result can be "bankrupt" projects that require complete re-writes in order to remain viable. This situation can be devastating for a startup that is rapidly iterating over just one one or two products.

Do you have techincal debt in your project? Here's a few Symptoms...

Progress is Slow. After a few cycles of feature packed releases, your software is now too buggy and/or difficult to maintain that no progress is being made, there's a good sign you've got debt on your hands. If you're using Scrum to manage your team, this is often manifest as a plateau on your burn-down charts.

Your defect rate is increasing. Often this is a corollary of the first symptom. As your product is burdened with unnecessary complexity due to rushed work or poor design decisions, mistakes are increasingly made. This same complexity means the mistakes are less likely to be detected until after they are in production.

Nobody wants to work on your product. This is often more noticeable in larger organisations, as the ambitious and talented developers look to younger, less tainted projects that are more fun to work on. Even in smaller companies, good developers have been known to resign over bad code.

How worried should I be about technical debt?

Just like financial debt, accruing technical debt can be a necessary and even appropriate strategy in product development, as long as it is in fact a strategy.

There's often a trade off that must be made between shipping a feature quickly and shippping one with minimal debt attached, and the right answer is not simply the latter. The right answer it will depend on the buisness as well as technical context. A prototype or early beta might reasonably trade high technical debt in return for quick implementation of new features. Getting these decisions right is what separates the wheat from the chaff of developers, and is absolutely crucial to the long term success of any software product and team.

Get it wrong and you'll end up with a product so loaded with technical debt it will be impossible to pay it off, or at the other end of the spectum an even worse fate - a product so beautifully engineered that it ran out of cash before it could even be shipped! Great teams therefore don't seek simply to minimise technical debt so much as manage it for appropriate conditions.

But a great team can only do this under the right conditions. Crucially, the whole team should have an appreciation for the significance of what they are doing, and the impact of, say dropping features or missing a deadline. Further, since technical debt unlike financial debt is difficult to quantify, the decisions in managing it must be made or at least informed by competent software engineers. Their opinions need to be understood and respected by anyone managing such a team. In a perfect world, every dev team would instincivley know when to pick the quick and dirty solution over the beatifully engineered one, or when to slow feature development to focus on fixing legacy dirty code. And in turn, a management team that measures progress by more than just features.

This last point is difficult for non-technical managers to follow, especially if they are inexperienced. Features are tangible positive outcomes, with obvious benefits. Technical debt by contrast isn't a tangibly obvious problem often until it is too late to do anything about it.

If you are not a technical manager then there's a few things you can do.

1. Make sure your engineers are accountable to business goals not software goals. Make sure your devs understand why they are building the features they are. Include them as much as possible in feature design and prioritisation. In a really well functioning team, developers will understand the buisness and it's customers so well that they can do these themselves.

2. Create a culture where it's okay for the guys on the ground to push back. While great developers recognize debt, nobody likes saying no to the boss. Good developers will argue that minimizing technical debt should be an eternal priority for the buisness, and (rightly) ensure they have the time and tools to do it. Make sure your team knows they are responsible not just for shipping features, but for informing you on the technical debt incuured in doing so. If they don't think something should be shipped, make sure a genuine conversation results around that.

3. Worry about it. It's a real problem, and your development team are probbably facing it - even if they are not telling you. If they aren't facing it, maybe they need more work to do :) Management must genuinely understand and buy in to the problem before it can be solved.

4. If you must incurr technical debt now, make sure you have a plan in place to pay it off at some suitably near point in the future, and stick to it.

Ultimately, it comes down to good management, and trust. Awesome projects projects are inevitably built on trust and a mutual understanding of the often competing priorities between a great development team, and the business they are working for.

Monday, March 29, 2010

A hunch on what's next for Facebook... the trinity of presence

Facebook is a great company, and one that's been kicking real goals. It's become the second most trafficked site in the world with a growing cachet of repeat users, and with only ~300 developers. Impressive stuff. Perhaps more importantly from an advertisers' standpoint Facebook have far richer contextual data and taste graphs on their user base than any of their competitors to use when targeting ads and interests.

And yet in spite of this amazing poll position and comparatively lean operation, the company was purportedly only cash flow positive in the back end of last year and speculative projections put FBs 2010 revenue stream somewhere in the billion dollar mark.

A billion dollars might sound like a lot, but it's a long way behind Yahoo ($6.5bn last quarter) despite being more highly trafficked, and lags well behind Google (23.7bn). Now, one could argue that Facebook's recent success belies a fundamental problem, it just cannot efficiently pry cash from it's users' hands as efficiently as the major search engines.

There's doubtless not a root cause of this problem, and thus no single solution. Some of the more superficial reasons that will in due course be solved - lack of advertiser depth, lack of efficiency in the targeting algorithms (I'm still occasionally getting ads in Polish!), and lack of depth in it's ad products.

But doubtless the biggest issue that FB has is simply that users do not want to transact on the site. You go to Facebook to catch up with the latest on your friends wedding, you go to Google when you're researching which flatscreen to buy. You go to Google when you want to transact and are exhibiting purchase intent.

I think the theme at f8 this year will be squarely focused on solving this problem. If I were running Facebook, I'd be looking at building my own ad network, run in-house by Facebook, that could leverage the user profile data when a user is logged in to decide which advertisements to display to 3rd party sites on it's network. This would allow FB to leverage it's considerable contextual data and extract revenue at a time when users were showing transaction intent on a 3rd party site, and not requiring the 3rd party to actually share any confidential user data.

With this in mind, it's interesting to see what's been leaked ahead of the upcoming f8 conference. Crucially, we've seen talk of the 'Open' social graph project, and more recently the Facebook bar. What these both point to is a desire to extend Facebook beyond the garden of Facebook.com, and in effect to provide a global rolodex/feed reader infrastructure for the rest of web. It's a compelling proposition to publishers - include our handy tools on your site, and you'll see an instant uplift in impressions and distribution as your user-base distribute, discuss and share your content via Facebook. It's like the 'Publish on Facebook' button on steroids.

But more importantly for Facebook, this gives the site liscence to share your logged in presence with that site. A recent addition to the Privacy Policy is particularly telling:

In order to provide you with useful social experiences off of Facebook, we occasionally need to provide General Information about you to pre-approved third party websites and applications that use Platform at the time you visit them (if you are still logged in to Facebook). ... The term General Information includes your and your friends’ names, profile pictures, gender, connections, and any content shared using the Everyone privacy setting.

What's the rationale for FB in doing this? Apart from digging it's tentacles even deeper into the web, and making even more sites dependent upon a Facebook lifeline to engage it's audience, it will also extend FB's presence into a realm well beyond social networking or the conventional use cases of Facebook.

All that's missing is a way to make money off it. Perhaps an Ad network that can help media sites actually monetize the additional distribution will be the missing piece in the trinity? Roll on April 21.

Thursday, February 11, 2010

Google Wave is more interesting as infrastructure than as a product

For the last 10 years or so, one of the biggest engineering challenges for big web companies was scaling data processing and data storage as more and more users adopted their systems, spent more time on line, and generated more data. Truly linear scaling of the order of magnitude these guys have had to deal with was a hard problem to solve.

Google have been successful as a product company, as much as anything because they've been continually been able to take innovations developed in response the scalability challenges they faced with one problem, and re-apply those solutions to many other domains. MapReduce, the Google File System, and BigTable are all examples of this in action. The result is an army of developers who simply don't have to worry about many scalability problems - the underlying infrastructure 'just takes care of it'.

One of the newer challenges though for these organisations is what some are branding the 'Real Time Web'. Enabled mainly by the prevalence of broadband, RTW is an application model which data emerges and changes in real time, live, as you're looking at your screen. Online chat, facebook's live stream, PubSubHubbub and pretty much all of FriendFeed are all examples of this in action. The key really from an engineering perspective is that you PUSH data to a client's screen, rather than frequently poll for changes.

More recently, the RTW has been applied to collaborative domains, where Google Wave is the archetypical example, although applications such as Google Docs, SubEthaEdit, and even something like Team Fortress would also qualify. In such applications, you have a effectively a single document (such as a wave, or a spreadsheet) to which deltas are being constantly applied from many different participants. These deltas also need to be broadcast back out to other participants too so their models remain consistent. These deltas might be adding characters, changing cell formulas, or whacking a guy from the blue team with a baseball bat - they are changes to state of a common, shared object.

It's very cool, and enables all sorts of new interactions on the web, but it's not easy. Amongst other things, the I/O-contentious shared nothing architectures that power most of the web today just do not scale well for real-time applications. And when you move your application from simply having to pass chat messages around, to maintaining and keeping track of the state a shared object, programmers need new software models to address a whole bunch of new thorny issues like concurrency and eventual consistency.

Imagine a situation, for example, when someone in Ireland deletes a word from a document, while simultaneously someone from Africa editing the same document changes the spelling of the word, and because of network latency, neither's machine knows about each other's actions until a couple of seconds after they happened. It's tricky - and it gets trickier as you start to roll in the same linear scaling problems mentioned before.

Problems like these are why most 'Real Time' collaborative applications aren't really that. It's also why Google Docs sucks.

The good news is that the underlying (non UI) component of Wave - the Wave Federation Server - a piece of infrastructure - largely solves this problem quite elegantly. WFS is basically a way for lots of nodes to make changes to the same object at the same time, and for the whole thing to be kept coherent, even taking latency into account. And when you start thinking about waves as a data store you quickly realise you can store just about anything with them. Documents, spreadsheets, game maps - the works. And a neat side effect of the whole federation thing is that you've got a scalability model baked right in.

Perhaps this then will be the first winner of the inevitable clash between Google Wave and the rest of the Google Apps suite. Rather than Wave replacing Docs as a product, perhaps they'll simply incorporate Wave's state management model into Docs itself. And hell, while they're at it they could do the same to chat, GMail, Picassa, Blogger, Buzz... pretty much anything that they wanted to make truly collaborative and real time. It's not trivial to refactor the data and event models of applications used by billions of people all over the world, but then it is Google we're talking about. And the result - whole new generation of app developers will be building real time shared state applications, and the underlying infrastructure will just take care of it. Plus a considerable loss of suckage in Docs.

The really cool thing though, is that unlike BigTable, GFS and MapReduce, the Wave Federation Server is open source. But rather than thinking about it as it's billed - a way to make your own Wave server, perhaps we need to think about it as part of the infrastructure stack for the next generation of real time web applications?