The Webalizer

1. Main Headings

Main Headings

Hits

Hits represent the total number of requests made to the server during the given time period (month, day, hour etc.).

Files

Files represent the total number of hits (requests) that actually resulted in something being sent back to the user. Not all hits will send data, such as 404-Not Found requests and requests for pages that are already in the browser's cache.

Pages

Pages are those URLs that would be considered the actual page being requested, and not all of the individual items that make it up (such as graphics and audio clips). Some people call this metric page views or page impressions, and defaults to any URL that has an extension of .htm, .html or .cgi.

Tip:

By looking at the difference between hits and files, you can get a rough indication of repeat visitors, as the greater the difference between the two, the more people are requesting pages they already have cached (have viewed already).

Visits

Visits occur when some remote site makes a request for a page on your server for the first time. As long as the same site keeps making requests within a given time-out period, they will all be considered part of the same Visit. If the site makes a request to your server, and the length of time since the last request is greater than the specified time-out period (default is 30 minutes), a new Visit is started and counted, and the sequence repeats. Since only pages will trigger a visit, remotes sites that link to graphic and other non- page URLs will not be counted in the visit totals, reducing the number of false visits.

Sites

Sites shows the number of unique IP addresses/hostnames that made requests to the server. Care should be taken when using this metric for anything other than that. Many users can appear to come from a single site, and they can also appear to come from many ip addresses so it should be used simply as a rough gauge as to the number of visitors to your server.

KBytes

A KByte (KB) is 1024 bytes (1 Kilobyte). Used to show the amount of data that was transferred between the server and the remote machine, based on the data found in the server log.

2. Common Definitions

Countries

Countries are determined based on the top level domain of the requesting site. This is somewhat questionable however, as there is no longer strong enforcement of domains as there was in the past. A .COM domain may reside in the US, or somewhere else. An .IL domain may actually be in Israel, however it may also be located in the US or elsewhere. The most common domains seen are .COM (US Commercial), .NET (Network), .ORG (Nonprofit Organization) and .EDU (Educational). A large percentage may also be shown as Unresolved/Unknown, as a fairly large percentage of dialup and other customer access points do not resolve to a name and are left as an IP address.

Entry/Exit

Entry/Exit pages are those pages that were the first requested in a visit (Entry), and the last requested (Exit). These pages are calculated using the Visits logic above. When a visit is first triggered, the requested page is counted as an Entry page, and whatever the last requested URL was, is counted as an Exit page.

Referrers

Referrers are those URLs that lead a user to your site or caused the browser to request something from your server. The vast majority of requests are made from your own URLs, since most HTML pages contain links to other objects such as graphics files. If one of your HTML pages contains links to 10 graphic images, then each request for the HTML page will produce 10 more hits with the referrer specified as the URL of your own HTML page.

Response Codes

Response Codes are defined as part of the HTTP/1.1 protocol. These codes are generated by the web server and indicate the completion status of each request made to it.

Search Strings

Search Strings are obtained from examining the referrer string and looking for known patterns from various search engines. The search engines and the patterns to look for can be specified by the user within a configuration file. The default will catch most of the major ones.

Note: Only available if that information is contained in the server logs.

Uniform Resource Locator (URL)

All requests made to a web server need to request something. A URL is that something, and represents an object somewhere on your server, that is accessible to the remote user, or results in an error (i.e.: 404 - Not found). URLs can be of any type (HTML, Audio, Graphics, etc...).

User Agents

User Agents are a fancy name for browsers. Netscape, Opera, Konqueror, etc.. are all User Agents, and each reports itself in a unique way to your server. Keep in mind however, that many browsers allow the user to change its reported name, so you might see some obvious fake names in the listing.

Note: Only available if that information is contained in the server logs.

3. Frequently Asked Questions

3.1 The Webalizer doesn't show me Referrers or User Agents?

In order for the Webalizer to produce statistics for user agents (browsers) and referrers, that information needs to be in the log files produced by the web server. Most servers by default only produce CLF logs, which do not include the extra information. The way you have your server include this information depends on what server you are running. For apache, you need to edit the httpd.conf file (in the servers /conf directory) and...

For apache 1.2, add the line:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

For apache 1.3, use the line:

CustomLog /var/lib/httpd/logs/access_log combined

Other servers are similar. Refer to your servers documentation for additional information on how to enable referrers and user agents.

3.2 On what systems will The Webalizer run?

The Webalizer was designed on and for an Intel system running Linux, however was written to be as close to ANSI/POSIX specs as possible in order to be easily ported to other platforms. I currently only have access to Linux systems running on Intel and PowerPC hardware, so can verify that it runs on those. In addition, I have received lots of mail from users indicating that The Webalizer will run on just about any *NIX machine out there from AIX to XENIX. (Other platforms also supported, such as OS/2 and MacOSX.. Check the download page).

3.3 I get "ibgd not found" errors when compiling.

The Webalizer uses the gd graphics library written by Tom Boutell for producing its inline graphics. If you don't have this library or have it installed correctly, you will get this error. The Webalizer expects this library to be in the standard library path (ie: /usr/lib), so if you have it someplace else, you need to add an '-L[path]' flag to CFLAGS in the Makefile.

3.4 I get "No File or Directory" errors when compiling.

The Makefile supplied with The Webalizer expects to find the header files for the gd graphics library in /usr/local/include/gd. If they are located somewhere else, you can either create a symbolic link to them, or edit the Makefile and specify the correct location.

3.5 What is the difference between 'HITS' and 'FILES'?

Basically, HITS is the total number of HTTP requests that the server received during the reporting period. Any request made to the server is considered a hit. FILES is the number of hits that actually resulted in something being sent back to the user, such as an HTML page or image. 'Total Files' and '200 - OK' totals should be the same. If you add up the totals in the 'Hits by Response Code' section, it should be the same as the 'Total Hits' figure.

3.6 My logs are HUGE! Can I run The Webalizer on partial logs?

Yes. With the release of version 1.2x, The Webalizer now supports incremental processing. This allows you to rotate your logs as often as needed without the loss of statistical detail between runs. Use the "Incremental" keyword in your configuration file, or the "-p" command line switch to enable incremental processing.

3.7 Why does the country section show only 100% unresolved?

Most likely because your web server is not doing name lookups and simply logging IP addresses. In order to determine the top level domain of the remote host, the program needs a resolved hostname, not an IP address. The simple fix is to just turn on name lookups on your web server so it starts logging names. Otherwise, you can pre-process your logs with something like the logresolve program supplied with apache or similar utilities, or you can use the Webalizer's built in DNS lookup code (see 7b below).

3.8 My Server doesn't do name lookups. Will The Webalizer?

Yes. Version 2.00 and higher supports reverse DNS support. If you don't enable hostname lookups on your web server, you will get "100% Unresolved/Unknown" country totals. This is because your log files only have IP addresses and not names. While it is recommended that you let your web server handle the DNS lookups, DNS support can be used for those sites where DNS resolution is not an option.

3.9 I used the [Hide*] option, but it still shows up in the totals?

Using the Hide* options only prevent that object from being displayed in the 'Top' table generated by The Webalizer. It is still counted in the totals. Version 0.99 of The Webalizer now has "Ignore*" options, which allow you to completely ignore certain objects for statistical purposes.

3.10 I used the [Group*] option, but it still shows up?

Grouped items, by default, are not hidden. This allows you to display a group total as well as all the items that make up the grouping. If you don't want to see the individual items that match the group, then follow the "Group*" keyword with an identical Hide* one.

3.11 Changing the configuration file has no effect?

Which configuration file are you changing? The Webalizer looks in the current directory for a file named "webalizer.conf", which it will process before any other configuration files. If one is not present in the current directory, it will look for the file "/etc/webalizer.conf," and process it before any other configuration files. Some configuration options allow you to toggle settings on or off, while some cannot be reversed. If you, for example specify the configuration option "HideURL *.gif" in the system wide default file /etc/webalizer.conf, you cannot 'un-hide' that object using a local configuration file. In general, single sites should have a single configuration file, such as /etc/webalizer.conf. Larger sites that have multiple host/virtual hosts probably should use different configuration files for each host and not have a default "webalizer.conf" file.

3.12 My configuration file is being read twice.

Do Not use "-c webalizer.conf" on the command line. This file is always read if found, regardless of any other configuration files that may be specified. If you do specify it on the command line, it will be read twice.

3.13 I get "Error adding xxx node, skipping ..." errors. Why?

You ran out of memory. The error occurs when a malloc call is made to allocate free memory, and fails. You can increase your swap space, but the only real solution is to add more physical memory.

3.14 I get "Warning: Truncating oversized xxx" or "String exceeds storage size" warnings. Why?

Internally, The Webalizer has a fixed maximum size for various parts of the log record. If a particular field is longer than will fit, you will see these warnings. The most common is that for the "request" field on sites that have a lot of CGI interaction. They can be safely ignored. If you don't want to see warnings or errors, you can use the ReallyQuiet option (-Q command line switch) to suppress them.

3.15 Why don't the daily visit totals add up to the monthly total?

You cannot add up the daily visit totals and compare them to the monthly total, they are different reporting periods. For example, if someone visits your site at 11:45pm and stays until 12:15am, the monthly total would show one visit, while the daily totals will show two (one for each day).

3.16 Why do my reports show more "Sites" than "Visits"?

Visits are only triggered when a valid request is found for a "page", as defined by your "PageType" setting (or a URL that ends with a slash, which is also considered a page type). Sites however, are counted regardless of the request type. It is very common to have more sites than visits, particularly if you host non-pagetype URLs on your site that are linked to from the outside. If you are not hosting URLs that are linked to from outside sites, then make sure your PageType setting is correct. The default is .htm, .html and .cgi extensions, unless you specify otherwise.

3.17 How can I process multiple logs from a server farm?

I have multiple load-balanced servers (or I'm using DNS round-robin to accomplish the same thing) and I want to generate one webalizer report for the whole farm, but each server generates its own log file. When I run webalizer on each of the logfiles in turn, it ignores a lot of the records because it thinks they're out of order!

You need to merge all of the logfiles together so that webalizer sees the records in chronological order. One good way to do that on the fly is with mergelog, a quick common logfile sorter. An example:

mergelog .log LogLog | webalizer

Another method is to simply combine your logs and then sort them into chronological order. Here is a simple shell script that uses the GNU sort utility to sort an already combined log file:

#!/bin/sh
if [ ! -f $1 ]; then
echo "Usage: $0 "
exit
fi
echo "Sorting $1"
sort -t ' ' -k 4.9,4.12n -k 4.5,4.7M -k 4.2,4.3n -k 4.14,4.15n -k 4.17,4.18n -k 4.20,4.21n $1 > $1.sorted

3.18 How can I easily process multiple virtual hosts?

There are many ways to process multiple virtual hosts on the same machine. The easiest way I have found, provided that each host generates its own log file, is as follows:

  1. Create a central directory for your configuration files. (I use "/etc/webalizer")
  2. Make a configuration file for each virtual host and place them in the central directory. Each configuration file should have at least the "HostName" (domain), "OutputDir" and "LogFile" configuration settings specified. You probably will want to specify other settings specific to the domain, such as "HideReferrer", "HideSite" and maybe some others as well. Name the file the same as the domain name, and end it with a ".conf" extension, so you can easily tell what vhost the configuration is for.
  3. To process all your virtual sites with a single command, a simple shell command can now be used:

    for i in /etc/webalizer/*.conf; do webalizer -c $i; done

After you have it set up, to add a new host, all you need to do is create a new configuration file and put it in the directory. It will be automatically picked up the next time you run the command.

3.19 I am having problems compiling w/DNS support.

Lots of people have problems compiling DNS support into the Webalizer, which is why you have to specifically enable it. If you don't really need DNS support, don't try to compile it in. The vast majority of sites can get by with simply turning on hostname lookups on their web server, which will do the DNS resolution for you automatically. If you really need built in DNS support, it's really quite simple, however different distributions place the required headers and libraries in different places, so the configure script fails to find them. In a nutshell, you need the Berkeley DB library, and it needs to be configured with V1.85 API support. Most Linux distributions already have this library present. If you get errors about not finding the proper header file (db_185.h), locate it on your system and create a symbolic link to it in the /usr/include directory. You will also need to specify the correct library to use for the header. This may mean that you need to first run the configure script, then if the compile fails due to unresolved references, edit the Makefile and change the "-ldb" (or "-ldb1") reference to the correct library, such as -ldb-3.2 for RedHat, or just "-ldb" for Slackware 8. Each distribution either names the library something different, or puts it in a different location, so you will have to play around with it on your distro to get it to work. If the library is in some non-standard location, you may also need to specify its path using the "--with-dblib=..." switch when you run the configure script. Bottom line is, if you don't really need it (and most people don't), just compile without DNS support and let your web server do the name lookups for you.

Submitted by Victor Brilon:

To compile under RH 7.x or 8.x using DNS resolver, you must use this command line:

/configure --enable-dns with-dB=/usr/include/db1

You must have the db1 and the db1-devel RPMs installed in order to do this.

3.20 My stats are not updating, why?

In order to produce or update statistic reports, the Webalizer must be run. In most cases, if your stat report is not being updated, then the program isn't being run. If it is, then it may be encountering a problem during processing that needs to be taken care of. The easiest way to determine this is to manually run the program from the command line and observe the informational messages it produces. Make sure your config file does not prevent messages from being displayed (Quiet and ReallyQuiet options). Without these messages, there is no way to determine what, if any, problems the program may be having, and any attempt to correct the problem would simply be a random guess.

Last modified May 12, 2005 by Bradford L. Barrett

4. Simpletons Guide to Web Server Analysis

Welcome to the wonderful world of web server usage analysis! This guide is intended to provide the necessary background and insight to how web server analysis works, things to look for and things to watch out for. Specifically, this guide is intended for the users of the Webalizer, but can be applied to most any analysis package out there. If you are new to web server analysis, or just want to find out how things work, then this guide is for you.

4.1 Hit me Please!

Ok, so you got a web site and you want to know if anybody is looking at it, and if so, what they are looking at and how many times. Lucky for you, (most) every web server keeps a log of what it's doing, so you can just go look and see. The logs are just plain ASCII text files, so any text editor or viewer would work just fine. Each time someone (using a web browser) asks for one of your web pages, or any component thereof (known as URLs, or Uniform Resource Locator), the web server will write a line to the end of the log representing that request. Unfortunately, the raw logs are rather cryptic for everyday humans to read. While you might be able to determine if anybody was looking at your web site, any other information would require some sort of processing to determine. A typical log entry might look something like the following:

192.168.45.13 - - [24/May/2005:11:20:39 -0400] "GET /mypage.html HTTP/1.1" 200 117

This represents a request from a computer with the IP address 192.168.45.13 for the URL "/mypage.html" on the web server. It also gives the time and date the request was made, the type of request, the result code for that request and how many bytes were sent to the remote browser. There will be a line similar to this one for each and every request made to the web server over the period covered by the log. A "Hit" is another way to say "request made to the server," so as you may have noticed, each line in the log represents a Hit. If you want to know how many Hits your server received, just count the number of lines in the log. And since each log line represents a request for a specific URL, from a specific IP address, you can easily figure out how many hits you got for each of your web pages or how many hits you received from a particular IP address by just counting the lines in the log that contain them. Yes, it really is that simple. And while you could do this manually with a text editor or other simple text processing tools, it is much more practical and easier to use a program specifically designed to analyze the logs for you, such as the Webalizer. They take the work out of it for you, provide totals for many other aspects of your server, and allow you to visualize the data in a way not possible by just looking at the raw logs.

4.2 How does it all work?

Well, to understand what you can analyze, you really should know what information is provided by your web server and how it gets there. At the very least, you should know how the HTTP (HyperText Transport Protocol) protocol works, and its strengths and weaknesses. At its simplest, a web server just sits there listening on the network for a web browser to make a request. Once a request is received, the server processes it and then sends something back to the requesting browser (and as explained above, the request is logged to a log file). These requests are typically for some URL, although there are other types of information a browser can request, such as server type, HTTP protocol versions supported, modification dates, etc., but those types are not as common. To visualize the interaction between server, browser and web pages, lets use an example to illustrate the information flow. Imagine a simple web page, "mypage.html," which is a HTML web page that contains two graphic images, "myimage1.jpg," and "myimage2.jpg." A typical server/browser interaction might go something like this:

In the web server log, the following lines would be added:

192.168.45.13 - - [24/May/2005:11:20:39 -0400] "GET /mypage.html HTTP/1.1" 200 117
192.168.45.13 - - [24/May/2005:11:20:40 -0400] "GET /myimage1.jpg HTTP/1.1" 200 231
192.168.45.13 - - [24/May/2005:11:20:41 -0400] "GET /myimage2.jpg HTTP/1.1" 200 432

So what can we gather from this exchange? Well, based on the what we learned above, we can count the number of lines in the log file and determine that the server received 3 hits during the period that this log file covers. We can also calculate the number of hits each URL received (in this case, 1 hit each). Along the same lines, we can see that the server received 3 hits from the IP address 192.168.45.13, and when those requests were received. The two numbers at the end of each line represent the response code and the number of bytes sent back to the requestor. The response code is how the web server indicates how it handled the request, and the codes are defined as part of the HTTP protocol. In this example, they are all 200, which means everything went OK. One response code you may be very familiar with is the all too common "404 - Not Found," which means that the requested URL could not be found on the server. There are several other response codes defined, however these two are the most common.

And that, in a nutshell, is about all you can accurately determine from the logs. "But wait!" you might be screaming, "most analysis program have lots of other numbers displayed!", and you would be right. Some more obscure numbers can be calculated, like the number of different response codes, number of hits within a given time period, total number of bytes sent to remote browsers, etc.. Other numbers can be implied based on certain assumptions, however those cannot be considered entirely accurate, and some can even be wildly inaccurate. Other log formats might be used by a web server as well, which provide additional information above what the CLF format does, and those will be discussed shortly. For now, just realize that the only thing you can really, accurately determine is what IP address requested which URL, and when it requested that URL, as shown in the example above.

4.3 The Good, the Bad and the Ugly

So now you have a good grasp of how your web server works and what information can be obtained from its logs, like number of hits (to the server and to individual URLs), number of IP addresses making the requests (and how many hits each IP address made), and when those requests were made. Given just that information, you can answer questions such as "What is the most popular URL on my site?", "What was the next most popular URL?", "What IP address made the most requests to my server?", and "How busy was my server during this time period?". Most analysis programs will also make it easy to answer such questions as "What time of day is my web server the most active?", or "What day of the week is the busiest?". They can give you an insight into usage patterns that may not be apparent by just looking at the raw logs. All of these questions can be answered with completely accurate answers, based just on the simple analysis of your web server logs. That's the good news!

The bad news? Well, with all the things you can determine by looking at your logs, there are a lot of things you can't accurately calculate. Unfortunately, some analysis programs lead you to believe otherwise, and forget to mention (particularly commercial packages) that these are not much more than assumptions and cannot be considered at all accurate. Like what? you ask.. Well, how about those things that some programs call 'user trails' or 'paths', that are supposed to tell you what pages and in what order a user traveled through your site. Or how about the length of time a user spends on your site. Another less than accurate metric would be that of 'visits', or how many users 'visited' your site during a given time period. All of these cannot be accurately calculated, for a couple of different reasons.. Let's look at some of them:

4.4 But wait, there's more!

OK, so what have we learned here? Well, in short, you don't know who or what is making requests to your server, and you can't assume that a single IP address is really a single user. Sure, you can make all kinds of assumptions and guesses, but that is all they really are, and you should not consider them at all accurate. Take the following example; IP address A makes a request to your server, 1 minute later, IP address B makes a request, and then 10 minutes later, address A makes another request. What can we determine from that sequence? Well, we can assume that two users visited. But what if address A was that of a firewall? Those two requests from address A could have been two different users. What if the user at address A got disconnected and dialed back in, getting a different address (address B) and someone else dialed in at the same time and got the now free address A? Or maybe the user was sitting behind a reverse-proxy, and all three requests were really from the same user. And can we tell what 'path' or 'trail' these users took while at the web site or how long they remained? Hopefully, you should now see that the answer to all these things is a big resounding No, we can't! Without being able to identify individual unique users, there is no way to tell what an individual unique user does.

Other metrics you CAN determine

Now that you have seen what is possible, you may be thinking that there are some other things these programs display, and wondering about how accurate they might be. Hopefully, based on what you have already seen thus far, you should be able to figure them out on your own. One such metric is that of a 'page' or 'page view'. As we already know, a web page is made up of an HTML text document and usually other elements such as graphic images, audio or other multimedia objects, style sheets, etc.. One request for a web page might generate dozens of requests for these other elements, but a lot of people just want to know how many web pages were requested without counting all the stuff that makes them up. You can get this number, if you know what type of files you may consider a 'page'. In a normal server, these would be just the URLs that end with a .htm or .html extension. Perhaps you have a dynamic site, and your web pages use an .asp, .pl or .php extension instead. You obviously would not want to count .gif or .jpg images as pages, nor would you want to count style sheets, flash graphic and other elements. You could go through the logs and just count up the requests for whatever URL meets your criteria for a 'page', but most analysis programs (including the Webalizer) allows you to specify what you consider a page and will count them up for you.

Other information

Up to now, we have just discussed the CLF (Common Log Format) log format. There are others. The most common is called 'combined', and takes the basic CLF format and adds two new pieces of information. Tacked on the end is the 'user agent' and 'referrer'. A user agent is just the name of the browser or program being used to generate the request to the web server. The 'referrer' is supposed to be the page that referred the user to your web server. Unfortunately, both of these can be completely misleading. The user agent string can be set to anything in some modern browsers. One common trick for Opera users is to set their user agent string to that of MS Internet Explorer so they can view sites that only allow MSIE visitors. And the referrer string, according to the standards document (RFC) for the HTTP protocol, may or may not be used at the browsers choosing, and if used, does not have to be accurate or even informative. The apache web server (which is the most common on the Internet) allows other things to be logged, such as cookie information, length of time to handle the request and lots of other stuff. Unfortunately, the inclusion and placement of this information in the server logs are not standard. Another format, developed by the W3C (world wide web consortium), allows log records to be made up of many different pieces of information, and their location can be anywhere in the log entry with a header record needed to map them. Some analysis programs handle these and other formats better than others.

Analysis techniques

The only true way to get an accurate picture of what your web server is doing is to look at its logs. This is how most of the analysis packages out there get their information, and is the most accurate. Other methods can be used, with different results. One common method, which was widely popular for a while, was the use of a 'page counter'. Basically, it was a dynamic bit included in a web page that incremented a counter and displayed its value each time the page was requested. Normally, it was included in the page as if it were a standard image file. One problem with this method was that you had to include a different 'image' file for each page you wanted to track. Another problem occurred if the remote user had image display turned off in their browser, or could not display images at all (such as in a text based web browser). You could also easily inflate the number by just hitting the 'reload' button on your browser over and over again. Similar methods were developed using java and javascript, in an attempt to get even more information about the visiting browser, such as screen resolution and operating system type. Of course, these can easily be circumvented as well. Some companies set up systems that claim to track your server usage remotely, by including an image or javascript element on your site which would then contact the companies system each time the image or javascript element was requested. These all have the same problems and limitations. In all of these, you can simply turn off images and java/javascript and then browse the web site completely uncounted and unseen (except in the web server logs). Beware of these types of counters and remote usage sites, they are not quite as accurate as they may lead you to believe.

Conclusion

It should now be obvious that there are only certain things you can determine from a web server log. There are some completely accurate numbers you can generate without question. And then, there are some wildly inaccurate and misleading numbers you can garner depending on what assumptions you make. Want to know how many requests generated a 404 (not found) result? Go right ahead and count them up and be completely confident with the number you get. Want to know how many 'users' visited your web site? Good luck with that one.. unless you go 'outside the logs', it will be a hit or miss stab in the dark. But now you should have a good idea of what is and isn't possible, so when you look at your usage report, you will be able to determine what the numbers mean and how much to trust them. You should also now see that a lot can depend on how the program is configured, and that the wrong configuration can lead to wrong results. Take the example of 'pages'.. if your analysis software thinks that only URLs with a .htm or .html extension is a page, and all you have is .php pages on your site, that number will be completely wrong. Not because the program is wrong, but because someone told it the wrong information to base its calculations on. Remember, knowledge is power, so now you have the power to ask the proper questions and get the proper results. The next time you look at a server analysis report, hopefully you will see it in a different light given your new found knowledge.

Back to the top