Server Log Analysis Lessons

by Philip Greenspun for the Web Tools Review

I remember when my Web site was new, back in the winter of '93-94. I'd just put up Travels with Samantha and every day or two I'd load the whole HTTPD access log into Emacs and lovingly read through the latest lines, inferring from the host name which of my friends it was, tracing users' path through the book, seeing where they gave up for the day.

Now the only time I look at my server log is when my Web server is melting down. When you are getting 25 hits/second for 20 or 30 simultaneous users, it is pretty difficult to do "Emacs thread analysis".

There must be a happy middle ground.

Why Analyze?

The best thing you can do with a log analysis report is find out which of your URLs are barfing up errors. If you have hundreds of thousands of hits/day, casual inspection of your logs isn't going to reveal the 404 File Not Found errors that make users miserable. This is especially true when you have Web servers that log errors and successful hits into the same file.

You can also refine content. By poring over my logs, I discovered that half of my visitors were just looking at the slide show for Travels with Samantha. Did that mean they thought my writing sucked? Well, maybe, but it actually looked like my talents as a hypertext designer were lame. The slide show was the very first link on the page. Users had to scroll way down past a bunch of photos to get to the "Chapter 1" link. I reshuffled the links a bit and traffic on the slide show fell to 10%.

Finally, once your site gets sufficiently popular, you will probably turn off hostname lookup (I did it when I crossed the 200,000 hit/day mark). Unix named is slow and sometimes causes odd server hangs. Anyway, after you turn lookup off, your log will be filled up with just plain IP addresses. You'll probably want to run a log analyzer on a separate machine to at least figure out whether your users are foreign, domestic, internal, or what.

For me, log analyzers break down along two dimensions:

Whether or not the source code is available is extremely important in a new field like Web service. You don't really need the source code to a high-end WYSIWYG Word processor because these things have been around since 1975 and by now the authors have figured out what users require. However, the Web is new and authors can't anticipate your needs or the evolution of Web standards. If you don't have the source code, you are probably going to be screwed. Generally the free public domain packages come with source code and the commercial packages don't.

A substrate-based log analyzer makes use of a well-known and proven system to do storage management and sometimes more. Examples of popular substrates for log analyzers are perl and relational databases. A standalone log analyzer is one that tries to do everything by itself. Usually these programs are written in primitive programming languages like C and do storage management in an ad hoc manner. This leads to complex source code that you might not want to tangle with and ultimately core dumps on logs of moderate size.

Here's my experience with a few programs...

wwwstat

This is an ancient public-domain perl script, available for download and editing. I find that it doesn't work very well on my sites for the following reasons: I've fed wwwstat 50 MB and larger log files without once seeing it fail. There is a companion tool called gwstat that makes pretty graphs from the wwwstat output. It is free but you have to be something of a Unix wizard to make it work (I cried like a baby and got my friend Noah to install it).

There are a lot of newer tools than wwwstat in the big Yahoo list. A promising candidate is this Python program from Australia, but I haven't personally tried it. A lot of my readers seem to like analog (referenced from Yahoo), but again I haven't tried it.

WebReporter

This is a standalone commercial product, written in C and sold for $500 by OpenMarket. It took me two solid days of reading the manual and playing around with the tail of a log file to figure out how to tell WebReporter 1.0 what I wanted. For a brief time, I was in love with the program. It would let me lump certain URLs together and print nice reports saying "Travels with Samantha cover page". I fell out of love when I realized that When I complained about the core dumps, they said "oh yes, we might have a fix for that in the next release. Just wait four months." So I waited and let some friends at a commercial site beta test the new release. How do you like it? I asked. Their response was simple... "It dumps core."

I've been wanting to try the new release myself, but OpenMarket is so far ahead of the pack in making Internet commerce a reality that they leave old-timers like me in the dust. The obstacles I faced were the following:

Of course, I went through a similar experience getting the first version of WebReporter downloaded and installed. So for the moment I've given up on the product. My main reservation about WebReporter in particular is that it seems to require 50 or 60 phone calls/year plus money to keep it current. I've managed to serve about 700 million hits with server programs from Netscape and NaviSoft (now AOL) without really ever having to call either company.

My experience with WebReporter has made me wary of standalone commercial products in general. Cumulative log data may actually be important to you. Why do you want to store it in a proprietary format accessible only to a C program for which you do not have source code? What guarantee do you have that the people who made the program will keep it up to date? Or even stay in business?

Relational-database-backed Tools

What are the characteristics of our problem anyway? Here are some obvious ones: Do these sound like the problems that IBM thought they were solving in the early 1970s with the relational model? Call me an Oracle whore but it seems apparent that the correct tool is a relational database management system.

So brilliant and original was my thinking on this subject that net.Genesis guys apparently had the idea a long time ago. They make a product called net.Analysis that purports to stuff server logs into a (bundled) Informix RDBMS in real time.

Probably this is the best of all possible worlds. You do not surrender control of your data. With a database-backed system, the data model is exposed. If you want to do a custom query or the monolithic program dumps core, you don't have to wait four months for the new version. Just go into SQL*PLUS and type your query. Any of these little Web companies might fold and/or decide that they've got better things to do than write log analyzers, but Oracle, Informix, and Sybase will be around. Furthermore, SQL is standard enough that you can always dump your data out of Brand O into Brand I or vice versa.

Caveats? Maintaining a relational database is not such a thrill, though the fact that your Web site need not depend on the RDBMS being up (you can log to a file and then do a batch insert) makes it easier. I haven't tried the net.Genesis product myself, but I have built custom sites that use the NaviServer API to log into Oracle and Illustra RDBMS installations. You may want to benchmark your database before going wild with the real-time inserts.

More

New kludges show up every day. Keep checking the big Yahoo list.
Return to Web Tools Review home


philg@mit.edu

Reader's Comments

One of those "new kludges" that I've had experience with is analog. It's highly configurable as to what information you want reported, and it addresses all of the issues that Phil has with wwwstat except for non-reporting of referer and browser information: you can alias a file to be reported as another file (with wildcards, even), get distinct host information as well as stats by domain, and get it to do DNS lookups on raw addresses for you. The latest version even uses pretty GIF bargraphs.

-- Jin Choi, January 9, 1997
I lied, it even deals with all those extra logs.

-- Jin Choi, January 9, 1997
analog is by far the best freely available web stats program.

-- ray@carpe.net --, February 18, 1997
Like others, I really appreciate Analog as a stats program (my stats can be viewed by anyone at http://www.ca-probate.com/stats/stats.html).

But for a huge number of web site developers, it is impossible to use a program like Analog because the users have no access to their site logs! For those users, the only way to get statistics would be to use a "counter" or "tracker" service. For a list of such services, see http://www.ca-probate.com/counter.htm

http://www.ca-probate.com/counter.htm

-- Mark J. Welch, March 7, 1997

Add a comment | Add a link