Saturday, 23 November 2013

Graphite Tip: Disabling data averaging while viewing graphs

Graphite, the superb graphing tool, has gained a lot of popularity lately and with good reason. It's flexible, fairly easy to setup, very easy to use and has a thriving community with plugins for many monitoring systems. It can store any kind of numeric data over time.

By default, Graphite stores data in WhisperDB, a fixed size database with configurable retention periods for various resolutions. What this means is that you can store higher resolution data (say data for every 5 seconds) for a shorter period of time (e.g. 1 month) and then store the same data at the lower resolution (say for every hour) beyond that time period. The data will be consolidated based on the the method you configure (sum, average). This behaviour of Graphite is well known.

What is not so well known is that Graphite also does consolidation when you view the graphs. This happens when the number of data points is more than the number of pixels. In such cases, the Graphite graph renderer will consolidate the data into one point using an aggregation function. The default aggregation function is average. So you might end up seeing smaller values than you expect.

Here's an example of a graph where there are more data points than pixels. The actual peak value was a little over 200, but you cannot see it here due to averaging.

Here is the same graph (same data for the time span) where the image width has been increased* (== more pixels). You can see the peak is almost 200.

Click to view larger

Sometimes this behaviour may not be what you want. To see the "actual" data points irrespective of what size your image is, Graphite's URL API provides a property called minXStep. To use it simply add the property as a request parameter (with value 0) in the graph URL. From the documentation:
To disable render-time point consolidation entirely, set this to 0 though note that series with more points than there are pixels in the graph area (e.g. a few month’s worth of per-minute data) will look very ‘smooshed’ as there will be a good deal of line overlap.

The same graph with minXStep=0 now looks like this:

A bit "smooshed" but with the exact data that was collected.

* Pass width=x as a request parameter to the graph URL, x in pixels

Monday, 30 September 2013

Revoking private key access to EC2 instances, and other random tips

Consider the following scenario

  • You have many EC2 instances running production code
  • Access to those instances is using a passphrase-protected key
  • A member of your operations team who has access to the key leaves so you have to change the key. Or, you need to change the existing key as a matter of some internal security policy.

How do you do it?

  • Generate a new keypair
  • Add the public key to the EC2 instances' <login user's home dir>/.ssh/authorized_keys
  • Remove the old public key from the same authorized_keys file
  • Done. The old key is useless now.
  • This is not actually revocation

Some things to note about AWS keypairs
  • EC2 metadata for the instance(s) will continue to show the original keypair name it was created with, whatever keys you add or remove from authorized_keys. The original public key may not even exist on the instance anymore, if you have gone through the steps above, but the metadata will still show it. This is because AWS has no way of knowing that you changed the authorized_keys file.
  • You can upload keys generated by yourself to the AWS console and they will be available for use while launching EC2 instances. Your generated keys have to be RSA keys of 1024, 2048 or 4096 bits.
  • AWS keypairs are said to be confined to a single region. This is true only if you consider the default state of affairs. You can get around it.
    • For keys that you generate, you can  import them to all the regions you want using the AWS console or the CLI tools. 
    • For keys that AWS generates, you can take the public key from an EC2 instance launched with that key, and import that in a similar manner to all the regions you want. The private key is available for download when you generate the key.

Friday, 27 September 2013

Private Cloud Options with Amazon Web Services - Part 1

Amazon Web Services is the largest IaaS provider, according to this Gartner report, in terms of compute capacity. AWS also has a wider geographical presence than other similar companies.  

AWS offers an option to have a private cloud inside their public cloud. You can run this as a small personal cloud, or use one of Amazon's connectivity offerings to connect it securely to your existing infrastructure. This is an overview of the private cloud options with AWS, followed by an overview of the various connectivity options.

Private Cloud Options
When you launch a regular EC2 instance, it has a public IP address. It is always reachable from the public internet whether you want it or not. You can configure the instance's AWS security group (the inbuilt firewall) to allow access to specific ports only, but this may not serve your security needs. You might want traffic to flow only between your instances and not from the internet.

The obvious way to do this is to not have IP addresses which are reachable from the internet, i.e., use private IP addresses. Which is exactly what VPC offers.

IP Ranges
A VPC is like a private network inside Amazon's cloud where you can create smaller subnets and instances inside them. While creating a VPC, you'll need to define the range of IP addresses that the VPC will cover.

The basic unit of a VPC is a subnet - a logical network where you can create instances, define the range of private IP addresses that the instances inside it will have and create routing tables to define how traffic is routed to and from the subnet.

Kinds of Subnets
There are two kinds of subnets you can create inside a VPC
Private : EC2 instances created inside it cannot talk to the internet and vice versa.
Public : EC2 instances created inside it can access the internet and can also be made accessible from the internet.

Private and public are just names and not inbuilt properties. What actually makes them "private" and "public" are the routing tables you create and assign to the subnets. So you must first create the subnets, then create the tables, assign them to the subnets and finally give them descriptive names. If you use the VPC wizard, it will do this for you. You can create multiple subnets of each type.

Communication between a public subnet and the internet
If you want your instances to access the internet, you have an option of adding an "internet gateway" to a subnet. The internet gateway here is an AWS abstraction. You would add this to your public subnet (or subnets). Once you assign a gateway, you must assign an elastic IP to an instance inside that subnet. This instance is the one that would be able to communicate with the outside world.

VPC places a limit on the number of elastic IPs (5). If you have many instances which need to access the internet, you would put all of them behind a single instance with an EIP instead of assigning each an EIP, and use NAT to access the internet from the "hidden" instances.

Communication between a private and a public subnet
Setting up communication between a private and a public subnet is a straightforward configuration in the routing table.

A typical example of using both private and public subnets in a VPC is from the AWS documentation:

Here, the database servers are extra-secure inside a private subnet, while the webservers are in the public subnet, as they have to serve traffic to end users. 

The "private" nature of a VPC is not limited to the network alone. Inside a VPC, you have the option of launching a regular EC2 instance, which is a virtual machine on a host shared with other guest VMs. You can also choose to launch a dedicated instance - which is a truly dedicated machine used only by your instance, giving you isolation at the hardware level as well. Costs are slightly higher for dedicated instances.

A VPC lets you setup your own private cloud with isolation at the hardware and the network levels. I'll explore the various connectivity options between VPCs and your own datacenter in the next post.

Friday, 17 May 2013

Book Review : The Art of Scalability

About the author: Theo Schlossnagle is the founder and CEO of OmtiTI.

The Book
This book aims to be a comprehensive, technology stack-agnostic compendium of strategies and guidelines to achieving scalability objectives for internet applications. It is quite thin (262 pages) and came out in 2007, when the DevOps meme was not around in its current form.  

Why I like this book
I like this book because it's oriented towards building a solid foundation on topics related to scaling. Compare this book with 'Web Operations: Keeping the Data on Time' (published in late 2010), and you'll find the book under discussion to be more grounded in fundamental principles, and the latter more oriented towards new trends. Now there's nothing wrong with the 'latest-trend' books, but it's better if one reads this kind first to get a good grounding.

Overview of Chapters:
The first three chapters cover basic principles, managing release cycles and operations teams. 

A big part of chapter 4 is devoted to explaining the difference between high availability and load balancing. There's no coverage of Cloud based options here – this is for you if you manage your own datacenters. Also, cloud based options will invariably be tied to specific vendors. Different HA options are considered with almost academic rigour. 

Chapter 5 examines load balancing options at different layers of the OSI network stack.

Chapter 6 is a mini-guide to building your own Content Delivery Network. From calculating your expected traffic, cost estimates, inter-node synchronization in a cluster to choosing the OS and having an HA network configuration – it's an interesting journey. It brings out the challenges which are invisible to most of us who push our static content to a third party CDN and forget about it. There's a section on DNS issues as well covering Anycast.

Chapter 7 covers five caching techniques. True to the general theme of the book, it does not talk about specific technologies but about theory that can be studied and applied to the problem at hand. An example of speeding up a news website is used to illustrate how to deploy and tune memcached (for that specific site's design).

In Chapter 8, we see an overview of distributed databases, including an overview of different database replication strategies. Managing, storing, aggregating and parsing logs is a challenge we all face – this is covered in Chapter 9. This chapter is dated now as there have been many advances on this topic.

Overall, a must-have for anybody who is interested or works in scaling internet facing applications.

Amazon US URL:

Indian bookstores:

Saturday, 4 May 2013

Thoughts on "A Note on Distributed Computing"

A Note on Distributed Computing by Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall is a widely cited paper. I have been reading and trying to understand it for sometime. It's available here -

The title of the paper is innocuous but it's much more than a "note". It analyzes the key differences between local and distributed computing, and explains why attempts to unify their programming models are misguided because of the fundamental differences underlying them.

The authors used to be part of the erstwhile Sun Microsystems when they wrote it - it dates from 1994. Later some of them were members of the JINI technology team and also wrote RMI, and if you look at the Java RMI source code,  you can see some of their names.

But in 1994 when this paper was written, Java had not emerged yet. There was no J2EE and CORBA was still young.

I wish to share my thoughts after reading it, and the realization that the opinions expressed in it influenced the design of Java's RMI.


Briefly, unification = unification of the local and the distributed programming models. Note that we are talking about distributed object oriented systems here.

The unification attempts assume that objects are essentially of a single top level type (like in Java), which might span different address spaces on the same or different machines (like different JVMs on the same or different machines in the case of Java) and they can communicate in the same way irrespective of where they are located. In other words, location (same JVM versus another JVM in another country) is merely an implementation detail that can be abstracted away behind the interfaces used to communicate between two objects without any side effects.

Such a (hypothetical) system would have the following characteristics

  1. Program functionality is not affected by the location of the object on which an operation has been invoked. Or viewing it from a slightly higher level, there is a single design as to how a system communicates irrespective of whether it's deployed in one address space or in multiple ones. 
  2. Maintenance and upgrades can be done to individual objects without affecting the rest of the system. 
  3. There is no need to handle failure and performance issues in the system design.
  4. Object interfaces are always the same regardless of the context (i.e. remote or local)
The authors contend that all these statements are flawed. I'll not attempt to go into those details - the paper explains them well.

The paper then goes onto examine the 4 areas where local and distributed computing differ drastically:

    Memory Access
    Partial Failure

Those of us who have worked on distributed enterprise and internet software have come across these. These 4 differences cannot be papered over to present a 'unified' view of objects which lie on different machines.


Java RMI
If you look at RMI, you can see its design influenced by the assumption that the above 4 points are invalid.
  • "Remote" objects have to extend the java.rmi.Remote interface. "Remote" objects - objects that can be invoked from another JVM - are different from local objects.
  • Remote (inter-JVM) method calls have to explicitly handle the java.rmi.RemoteException, which is a checked exception, thus highlighting the fact that a distributed call is subject to modes of failure that are non-existent in a local call. In fact, it extends and the javadoc is explicit about network issues "is the common superclass for a number of communication-related exceptions that may occur during the execution of a remote method call".
Let's look at #2 again. From the paper:
"As long as the interfaces between objects remain constant, the implementations of those objects can be altered at will".
Premonition of SOA, anyone? This concept would be familiar today to anybody who is acquainted with the fundamental principles of service oriented system design (replace 'object' with 'service'). But since these statements are challenged and refuted later in the paper, the question naturally arises - how come SOA is successful?  

SOA assumes that things are independent and distributed services, and any invocation of a service assumes that there are failure modes which exist because of the communication's distributed nature. This builds on the same RMI concept as having to explicitly throw RemoteException when making a remote (distributed) call. This same concept is taken into consideration while writing any SOA system, which is another way of saying that the authors of the paper were correct.

Note: A short and readable summary of Java RMI is to be found in Jim Waldo's book Java: The Good Parts.

Sunday, 28 April 2013

"Upgrading" to Fedora 18

I have been running Fedora 16 on my work laptop. It was EOL'ed early this year, which means no more upgrades, including for things like Firefox. There was no option but to upgrade. I had two choices - opt for something with long term support like Ubuntu LTS or try the new Fedora (and try the newer one in 6 months).

I opted for the latter, since I've been using Fedora for a while, hoping that I would not have to do a Windows-style post installation cleanup. 

No such luck. The things that were broken, still are.

Some highlights from the experience:

Nepomuk, Akonadi
Disable them? Sure. They are disabled in System Settings, but insist on starting up anyways. Uninstall them? Not possible. They're so tightly coupled with KDE that uninstalling them uninstalls all of KDE. The developers don't seem to be listening to the users here. 

Disk space
My installation ran out of disk space 20 minutes after I rebooted post-installation. There seemed to be some continuous process in the background which was eating up space. Some investigation identified the culprit.
[talonx@****** apps]$ pwd
[talonx@****** apps]$ find . -type f -size +50000k -exec ls -lh {} \; | awk '{ print $9 ": " $5 }'
./nepomuk/repository/main/data/virtuosobackend/soprano-virtuoso.log: 151G
./nepomuk/repository/main/data/virtuosobackend/soprano-virtuoso.db: 68M
Yes, it created a log file of 151G within 20 minutes. What kind of application does that? What about basic stuff like log file rotation?

Another of those which does not go away, and causes endless irritation.

Fedora is not something that I would prescribe to new Linux users. Others have pointed out that its instability and some features are probably the result of staying at the cutting edge. Granting that, it remains difficult to get it to a state where even people like software developers can use it to be productive.