On HTTP Last-Modified and ETag

24 May 2005
16:39

Anne van Kesteren has a tendency to dive into some of the details of basic web technologies such as HTTP, HTML and CSS. Yesterdays´ post is about HTTP cache revalidation with conditional methods, using the headers Last-Modified and ETag, and the status code 304 Not Modified:

There are a few headers I want to introduce to you. Bear with me, as they are important. Last-Modified returns the date when the retrieved page was, well, last modified. This is a trivial thing, but when I first implemented it I forgot comments. And after that I forgot to check if the post was perhaps modified after the last comment was made. Don’t you do the same! The whole purpose behind writing things up is that others don’t make the same mistake. Or at least, that’s the purpose of this entry.

Besides Last-Modified we need ETag. For reasons not entirely clear to me by the way, but as it seems to work and was simple to implement I decided not to care. The day I will pay for that probably comes sooner than expected so I hope one of you can enlighten me on this. (Does it have to do with HTTP 1.0 versus 1.1?)

I'll try to explain here why Last-Modified is sometimes not enough, and why you may want to use ETags (which is short for entity tags).

Why do entity tags exist?

Because often, the representation of a resource in a dynamic web site depends on some factors in addition to the time when the content was last modified:

  • Whether the remote user is logged in somehow, and under which name. For example, do you have a little "Logged in as john" widget on your pages when the user is logged, but "Register" and "Login" links when he's not? Last-Modified is not enough then.
  • Any preferences or settings that are not part of the URL, such as options stored in the users' session on the server. For example, do you let your users choose their time zone and then adjust the display of time data throughout the site to match that time zone? Last-Modified won't cut it here.

Basically, the Last-Modified header is based on the model of static documents, and isn't flexible enough to express the cache validation aspects required by many web applications.

You could hack the above facets into the Last-Modified header, basically by using the time of the youngest of any of the following events: the user logged in or out, the user changed a setting stored in his session, the content has been changed. But that would result in more cache misses than actually necessary. And you'd need to store time stamps for those events. If you aren't doing so already, adding it is probably just silly, because you could've just used entity tags in the first place.

A real world example

Some time ago, I added support for entity tags to Trac. To see this in action, have a look at Changeset 1715. When the page is first requested, the HTTP headers will include the following line:

   ETag: W"cmlenz/1116859300/inline-U3"

The entity tag here consists of three parts, separated by slashes:

cmlenz
The login name of the remote user (you'll probably see "anonymous" here).
1116859300
That's the time of the changeset, in seconds since 1970. No need to format as a date here, as the tag is opaque to the user agent.
inline-U3
The diff is displayed "inline" as opposed to "side by side" (a bit of a misnomer, but I haven't come up with a better term here). And it should display 3 lines of context around every change. These settings are stored in the users session on the server.

The tag is prefixed with a "W" because it's a weak validator: When returning a 304 Not Modified response on a request with a If-None-Match header referring to a weak entity tag, you don't guarantee that the resource hasn't changed since the cached version was retrieved, but you do say that no substantial changes have been made. In this case, we're using a weak validator because some of the text and titles of wiki links might have changed due to changes somewhere else in the system (a ticket was closed, a new wiki page created, etc).

Now if you change any of the diff options on the changeset page, you'll get a resource with a different entity tag. If I logout and change the diff options to show 5 lines of context and ignore blank lines and white space changes, the entity tag looks like this:

   ETag: W"anonymous/1116859300/inline-U5-B-b"

It should have become clear that this is much easier than trying to base cache revalidation solely on a single date. And it should also have become clear that if you don't alter the representation of a page based on login state and/or data stored in server-side sessions, you don't need entity tags. Just use Last-Modified.

What about the Vary header?

If you don't use server-side sessions but do have resources changing based on the login status of the user, theory suggests you could use the Vary header. Assuming you're using a cookie for user authorization, you'd add the following header to the response:

   Vary: Cookie

If you're using HTTP authentication (Basic or Digest), you'd instead add:

   Vary: Authorization

This should tell the user agent that if the Cookie or Authorization header changes, it must not use a cached response based on a different value for that header.

Very elegant and straight-forward, right? Unfortunately, I don't know of any browser that actually supports this. (See for example Mozilla bug #94123.)