Friday, 17 May 2013

Book Review : The Art of Scalability

About the author: Theo Schlossnagle is the founder and CEO of OmtiTI.

The Book
This book aims to be a comprehensive, technology stack-agnostic compendium of strategies and guidelines to achieving scalability objectives for internet applications. It is quite thin (262 pages) and came out in 2007, when the DevOps meme was not around in its current form.  

Why I like this book
I like this book because it's oriented towards building a solid foundation on topics related to scaling. Compare this book with 'Web Operations: Keeping the Data on Time' (published in late 2010), and you'll find the book under discussion to be more grounded in fundamental principles, and the latter more oriented towards new trends. Now there's nothing wrong with the 'latest-trend' books, but it's better if one reads this kind first to get a good grounding.

Overview of Chapters:
The first three chapters cover basic principles, managing release cycles and operations teams. 

A big part of chapter 4 is devoted to explaining the difference between high availability and load balancing. There's no coverage of Cloud based options here – this is for you if you manage your own datacenters. Also, cloud based options will invariably be tied to specific vendors. Different HA options are considered with almost academic rigour. 

Chapter 5 examines load balancing options at different layers of the OSI network stack.

Chapter 6 is a mini-guide to building your own Content Delivery Network. From calculating your expected traffic, cost estimates, inter-node synchronization in a cluster to choosing the OS and having an HA network configuration – it's an interesting journey. It brings out the challenges which are invisible to most of us who push our static content to a third party CDN and forget about it. There's a section on DNS issues as well covering Anycast.

Chapter 7 covers five caching techniques. True to the general theme of the book, it does not talk about specific technologies but about theory that can be studied and applied to the problem at hand. An example of speeding up a news website is used to illustrate how to deploy and tune memcached (for that specific site's design).

In Chapter 8, we see an overview of distributed databases, including an overview of different database replication strategies. Managing, storing, aggregating and parsing logs is a challenge we all face – this is covered in Chapter 9. This chapter is dated now as there have been many advances on this topic.

Overall, a must-have for anybody who is interested or works in scaling internet facing applications.

Amazon US URL:

Indian bookstores:

Saturday, 4 May 2013

Thoughts on "A Note on Distributed Computing"

A Note on Distributed Computing by Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall is a widely cited paper. I have been reading and trying to understand it for sometime. It's available here -

The title of the paper is innocuous but it's much more than a "note". It analyzes the key differences between local and distributed computing, and explains why attempts to unify their programming models are misguided because of the fundamental differences underlying them.

The authors used to be part of the erstwhile Sun Microsystems when they wrote it - it dates from 1994. Later some of them were members of the JINI technology team and also wrote RMI, and if you look at the Java RMI source code,  you can see some of their names.

But in 1994 when this paper was written, Java had not emerged yet. There was no J2EE and CORBA was still young.

I wish to share my thoughts after reading it, and the realization that the opinions expressed in it influenced the design of Java's RMI.


Briefly, unification = unification of the local and the distributed programming models. Note that we are talking about distributed object oriented systems here.

The unification attempts assume that objects are essentially of a single top level type (like in Java), which might span different address spaces on the same or different machines (like different JVMs on the same or different machines in the case of Java) and they can communicate in the same way irrespective of where they are located. In other words, location (same JVM versus another JVM in another country) is merely an implementation detail that can be abstracted away behind the interfaces used to communicate between two objects without any side effects.

Such a (hypothetical) system would have the following characteristics

  1. Program functionality is not affected by the location of the object on which an operation has been invoked. Or viewing it from a slightly higher level, there is a single design as to how a system communicates irrespective of whether it's deployed in one address space or in multiple ones. 
  2. Maintenance and upgrades can be done to individual objects without affecting the rest of the system. 
  3. There is no need to handle failure and performance issues in the system design.
  4. Object interfaces are always the same regardless of the context (i.e. remote or local)
The authors contend that all these statements are flawed. I'll not attempt to go into those details - the paper explains them well.

The paper then goes onto examine the 4 areas where local and distributed computing differ drastically:

    Memory Access
    Partial Failure

Those of us who have worked on distributed enterprise and internet software have come across these. These 4 differences cannot be papered over to present a 'unified' view of objects which lie on different machines.


Java RMI
If you look at RMI, you can see its design influenced by the assumption that the above 4 points are invalid.
  • "Remote" objects have to extend the java.rmi.Remote interface. "Remote" objects - objects that can be invoked from another JVM - are different from local objects.
  • Remote (inter-JVM) method calls have to explicitly handle the java.rmi.RemoteException, which is a checked exception, thus highlighting the fact that a distributed call is subject to modes of failure that are non-existent in a local call. In fact, it extends and the javadoc is explicit about network issues "is the common superclass for a number of communication-related exceptions that may occur during the execution of a remote method call".
Let's look at #2 again. From the paper:
"As long as the interfaces between objects remain constant, the implementations of those objects can be altered at will".
Premonition of SOA, anyone? This concept would be familiar today to anybody who is acquainted with the fundamental principles of service oriented system design (replace 'object' with 'service'). But since these statements are challenged and refuted later in the paper, the question naturally arises - how come SOA is successful?  

SOA assumes that things are independent and distributed services, and any invocation of a service assumes that there are failure modes which exist because of the communication's distributed nature. This builds on the same RMI concept as having to explicitly throw RemoteException when making a remote (distributed) call. This same concept is taken into consideration while writing any SOA system, which is another way of saying that the authors of the paper were correct.

Note: A short and readable summary of Java RMI is to be found in Jim Waldo's book Java: The Good Parts.