1. Main Headings
Hits
Hits represent the total number of requests made to the server during the given time period (month, day, hour etc.).
Files
Files represent the total number of hits (requests) that actually resulted in something being sent back to the user. Not all hits will send data, such as 404-Not Found requests and requests for pages that are already in the browser's cache.
Pages
Pages are those URLs that would be considered the actual page being requested, and not all of the individual items that make it up (such as graphics and audio clips). Some people call this metric page views or page impressions, and defaults to any URL that has an extension of .htm, .html or .cgi.
Tip:By looking at the difference between hits and files, you can get a rough indication of repeat visitors, as the greater the difference between the two, the more people are requesting pages they already have cached (have viewed already).
Visits
Visits occur when some remote site makes a request for a page on your server for the first time. As long as the same site keeps making requests within a given time-out period, they will all be considered part of the same Visit. If the site makes a request to your server, and the length of time since the last request is greater than the specified time-out period (default is 30 minutes), a new Visit is started and counted, and the sequence repeats. Since only pages will trigger a visit, remotes sites that link to graphic and other non- page URLs will not be counted in the visit totals, reducing the number of false visits.
Sites
Sites shows the number of unique IP addresses/hostnames that made requests to the server. Care should be taken when using this metric for anything other than that. Many users can appear to come from a single site, and they can also appear to come from many ip addresses so it should be used simply as a rough gauge as to the number of visitors to your server.
KBytes
A KByte (KB) is 1024 bytes (1 Kilobyte). Used to show the amount of data that was transferred between the server and the remote machine, based on the data found in the server log.
Countries
Countries are determined based on the top level domain of the requesting site. This is somewhat questionable however, as there is no longer strong enforcement of domains as there was in the past. A .COM domain may reside in the US, or somewhere else. An .IL domain may actually be in Israel, however it may also be located in the US or elsewhere. The most common domains seen are .COM (US Commercial), .NET (Network), .ORG (Nonprofit Organization) and .EDU (Educational). A large percentage may also be shown as Unresolved/Unknown, as a fairly large percentage of dialup and other customer access points do not resolve to a name and are left as an IP address.
Entry/Exit
Entry/Exit pages are those pages that were the first requested in a visit (Entry), and the last requested (Exit). These pages are calculated using the Visits logic above. When a visit is first triggered, the requested page is counted as an Entry page, and whatever the last requested URL was, is counted as an Exit page.
Referrers
Referrers are those URLs that lead a user to your site or caused the browser to request something from your server. The vast majority of requests are made from your own URLs, since most HTML pages contain links to other objects such as graphics files. If one of your HTML pages contains links to 10 graphic images, then each request for the HTML page will produce 10 more hits with the referrer specified as the URL of your own HTML page.
Response Codes
Response Codes are defined as part of the HTTP/1.1 protocol. These codes are generated by the web server and indicate the completion status of each request made to it.
Search Strings
Search Strings are obtained from examining the referrer string and looking for known patterns from various search engines. The search engines and the patterns to look for can be specified by the user within a configuration file. The default will catch most of the major ones.
Note: Only available if that information is contained in the server logs.
Uniform Resource Locator (URL)
All requests made to a web server need to request something. A URL is that something, and represents an object somewhere on your server, that is accessible to the remote user, or results in an error (i.e.: 404 - Not found). URLs can be of any type (HTML, Audio, Graphics, etc...).
User Agents
User Agents are a fancy name for browsers. Netscape, Opera, Konqueror, etc.. are all User Agents, and each reports itself in a unique way to your server. Keep in mind however, that many browsers allow the user to change its reported name, so you might see some obvious fake names in the listing.
Note: Only available if that information is contained in the server logs.
3.1 The Webalizer doesn't show me Referrers or User Agents?
In order for the Webalizer to produce statistics for user agents (browsers) and referrers, that information needs to be in the log files produced by the web server. Most servers by default only produce CLF logs, which do not include the extra information. The way you have your server include this information depends on what server you are running. For apache, you need to edit the httpd.conf file (in the servers /conf directory) and...
For apache 1.2, add the line:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""
For apache 1.3, use the line:
CustomLog /var/lib/httpd/logs/access_log combined
Other servers are similar. Refer to your servers documentation for additional information on how to enable referrers and user agents.
3.2 On what systems will The Webalizer run?
The Webalizer was designed on and for an Intel system running Linux, however was written to be as close to ANSI/POSIX specs as possible in order to be easily ported to other platforms. I currently only have access to Linux systems running on Intel and PowerPC hardware, so can verify that it runs on those. In addition, I have received lots of mail from users indicating that The Webalizer will run on just about any *NIX machine out there from AIX to XENIX. (Other platforms also supported, such as OS/2 and MacOSX.. Check the download page).
3.3 I get "ibgd not found" errors when compiling.
The Webalizer uses the gd graphics library written by Tom Boutell for producing its inline graphics. If you don't have this library or have it installed correctly, you will get this error. The Webalizer expects this library to be in the standard library path (ie: /usr/lib), so if you have it someplace else, you need to add an '-L[path]' flag to CFLAGS in the Makefile.
3.4 I get "No File or Directory" errors when compiling.
The Makefile supplied with The Webalizer expects to find the header files for the gd graphics library in /usr/local/include/gd. If they are located somewhere else, you can either create a symbolic link to them, or edit the Makefile and specify the correct location.
3.5 What is the difference between 'HITS' and 'FILES'?
Basically, HITS is the total number of HTTP requests that the server received during the reporting period. Any request made to the server is considered a hit. FILES is the number of hits that actually resulted in something being sent back to the user, such as an HTML page or image. 'Total Files' and '200 - OK' totals should be the same. If you add up the totals in the 'Hits by Response Code' section, it should be the same as the 'Total Hits' figure.
3.6 My logs are HUGE! Can I run The Webalizer on partial logs?
Yes. With the release of version 1.2x, The Webalizer now supports incremental processing. This allows you to rotate your logs as often as needed without the loss of statistical detail between runs. Use the "Incremental" keyword in your configuration file, or the "-p" command line switch to enable incremental processing.
3.7 Why does the country section show only 100% unresolved?
Most likely because your web server is not doing name lookups and simply logging IP addresses. In order to determine the top level domain of the remote host, the program needs a resolved hostname, not an IP address. The simple fix is to just turn on name lookups on your web server so it starts logging names. Otherwise, you can pre-process your logs with something like the logresolve program supplied with apache or similar utilities, or you can use the Webalizer's built in DNS lookup code (see 7b below).
3.8 My Server doesn't do name lookups. Will The Webalizer?
Yes. Version 2.00 and higher supports reverse DNS support. If you don't enable hostname lookups on your web server, you will get "100% Unresolved/Unknown" country totals. This is because your log files only have IP addresses and not names. While it is recommended that you let your web server handle the DNS lookups, DNS support can be used for those sites where DNS resolution is not an option.
3.9 I used the [Hide*] option, but it still shows up in the totals?
Using the Hide* options only prevent that object from being displayed in the 'Top' table generated by The Webalizer. It is still counted in the totals. Version 0.99 of The Webalizer now has "Ignore*" options, which allow you to completely ignore certain objects for statistical purposes.
3.10 I used the [Group*] option, but it still shows up?
Grouped items, by default, are not hidden. This allows you to display a group total as well as all the items that make up the grouping. If you don't want to see the individual items that match the group, then follow the "Group*" keyword with an identical Hide* one.
3.11 Changing the configuration file has no effect?
Which configuration file are you changing? The Webalizer looks in the current directory for a file named "webalizer.conf", which it will process before any other configuration files. If one is not present in the current directory, it will look for the file "/etc/webalizer.conf," and process it before any other configuration files. Some configuration options allow you to toggle settings on or off, while some cannot be reversed. If you, for example specify the configuration option "HideURL *.gif" in the system wide default file /etc/webalizer.conf, you cannot 'un-hide' that object using a local configuration file. In general, single sites should have a single configuration file, such as /etc/webalizer.conf. Larger sites that have multiple host/virtual hosts probably should use different configuration files for each host and not have a default "webalizer.conf" file.
3.12 My configuration file is being read twice.
Do Not use "-c webalizer.conf" on the command line. This file is always read if found, regardless of any other configuration files that may be specified. If you do specify it on the command line, it will be read twice.
3.13 I get "Error adding xxx node, skipping ..." errors. Why?
You ran out of memory. The error occurs when a malloc call is made to allocate free memory, and fails. You can increase your swap space, but the only real solution is to add more physical memory.
3.14 I get "Warning: Truncating oversized xxx" or "String exceeds storage size" warnings. Why?
Internally, The Webalizer has a fixed maximum size for various parts of the log record. If a particular field is longer than will fit, you will see these warnings. The most common is that for the "request" field on sites that have a lot of CGI interaction. They can be safely ignored. If you don't want to see warnings or errors, you can use the ReallyQuiet option (-Q command line switch) to suppress them.
3.15 Why don't the daily visit totals add up to the monthly total?
You cannot add up the daily visit totals and compare them to the monthly total, they are different reporting periods. For example, if someone visits your site at 11:45pm and stays until 12:15am, the monthly total would show one visit, while the daily totals will show two (one for each day).
3.16 Why do my reports show more "Sites" than "Visits"?
Visits are only triggered when a valid request is found for a "page", as defined by your "PageType" setting (or a URL that ends with a slash, which is also considered a page type). Sites however, are counted regardless of the request type. It is very common to have more sites than visits, particularly if you host non-pagetype URLs on your site that are linked to from the outside. If you are not hosting URLs that are linked to from outside sites, then make sure your PageType setting is correct. The default is .htm, .html and .cgi extensions, unless you specify otherwise.
3.17 How can I process multiple logs from a server farm?
I have multiple load-balanced servers (or I'm using DNS round-robin to accomplish the same thing) and I want to generate one webalizer report for the whole farm, but each server generates its own log file. When I run webalizer on each of the logfiles in turn, it ignores a lot of the records because it thinks they're out of order!
You need to merge all of the logfiles together so that webalizer sees the records in chronological order. One good way to do that on the fly is with mergelog, a quick common logfile sorter. An example:
mergelog
Another method is to simply combine your logs and then sort them into chronological order. Here is a simple shell script that uses the GNU sort utility to sort an already combined log file:
#!/bin/sh 3.18 How can I easily process multiple virtual hosts? There are many ways to process multiple virtual hosts on the same machine. The easiest way I have found, provided that each host generates its own log file, is as follows:
for i in /etc/webalizer/*.conf; do webalizer -c $i; done After you have it set up, to add a new host, all you need to do
is create a new configuration file and put it in the directory.
It will be automatically picked up the next time you run the
command. 3.19 I am having problems compiling w/DNS support. Lots of people have problems compiling DNS support into the Webalizer, which
is why you have to specifically enable it. If you don't really need DNS support,
don't try to compile it in. The vast majority of sites can get by with simply
turning on hostname lookups on their web server, which will do the DNS resolution
for you automatically. If you really need built in DNS support, it's really
quite simple, however different distributions place the required headers and
libraries in different places, so the configure script fails to find them.
In a nutshell, you need the Berkeley DB library, and it needs to be configured
with V1.85 API support. Most Linux distributions already have this library
present. If you get errors about not finding the proper header file (db_185.h),
locate it on your system and create a symbolic link to it in the /usr/include
directory. You will also need to specify the correct library to use for the
header. This may mean that you need to first run the configure script, then
if the compile fails due to unresolved references, edit the Makefile and change
the "-ldb" (or "-ldb1") reference to the correct library, such as -ldb-3.2
for RedHat, or just "-ldb" for Slackware 8. Each distribution either names
the library something different, or puts it in a different location, so you
will have to play around with it on your distro to get it to work. If the
library is in some non-standard location, you may also need to specify its
path using the "--with-dblib=..." switch when you run the configure script.
Bottom line is, if you don't really need it (and most people don't), just
compile without DNS support and let your web server do the name lookups for
you. Submitted by Victor Brilon: /configure --enable-dns with-dB=/usr/include/db1 You must have the db1 and the db1-devel RPMs installed in order to do this. 3.20 My stats are not updating, why? In order to produce or update statistic reports, the Webalizer must be run. In most cases, if your stat report is not being updated, then the program isn't being run. If it is, then it may be encountering a problem during processing that needs to be taken care of. The easiest way to determine this is to manually run the program from the command line and observe the informational messages it produces. Make sure your config file does not prevent messages from being displayed (Quiet and ReallyQuiet options). Without these messages, there is no way to determine what, if any, problems the program may be having, and any attempt to correct the problem would simply be a random guess. Last modified May 12, 2005 by Bradford L. Barrett 4. Simpletons Guide to Web Server Analysis
4.1 Hit me Please! Ok, so you got a web site and you want to know if anybody is looking at it,
and if so, what they are looking at and how many times. Lucky for you, (most)
every web server keeps a log of what it's doing, so you can just go look and
see. The logs are just plain ASCII text files, so any text editor or viewer
would work just fine. Each time someone (using a web browser) asks for one
of your web pages, or any component thereof (known as URLs, or Uniform Resource
Locator), the web server will write a line to the end of the log representing
that request. Unfortunately, the raw logs are rather cryptic for everyday
humans to read. While you might be able to determine if anybody was
looking at your web site, any other information would require some sort of
processing to determine. A typical log entry might look something like the
following: This represents a request from a computer with the IP address 192.168.45.13
for the URL "/mypage.html" on the web server. It also gives the time and date
the request was made, the type of request, the result code for that request
and how many bytes were sent to the remote browser. There will be a line similar
to this one for each and every request made to the web server over the period
covered by the log. A "Hit" is another way to say "request made to the server,"
so as you may have noticed, each line in the log represents a Hit. If you
want to know how many Hits your server received, just count the number of
lines in the log. And since each log line represents a request for a specific
URL, from a specific IP address, you can easily figure out how many hits you
got for each of your web pages or how many hits you received from a particular
IP address by just counting the lines in the log that contain them. Yes, it
really is that simple. And while you could do this manually with a text editor
or other simple text processing tools, it is much more practical and easier
to use a program specifically designed to analyze the logs for you, such as
the Webalizer. They take the work out of it for you, provide totals for many
other aspects of your server, and allow you to visualize the data in a way
not possible by just looking at the raw logs. 4.2 How does it all work? Well, to understand what you can analyze, you really should know what
information is provided by your web server and how it gets there. At
the very least, you should know how the HTTP (HyperText Transport
Protocol) protocol works, and its strengths and weaknesses.
At its simplest, a web server just sits there listening on the network
for a web browser to make a request. Once a request is received, the
server processes it and then sends something back to the requesting
browser (and as explained above, the request is logged to a log file).
These requests are typically for some URL, although there are other types
of information a browser can request, such as server type, HTTP protocol
versions supported, modification dates, etc., but those types are
not as common. To visualize the interaction between server, browser and
web pages, lets use an example to illustrate the information flow. Imagine
a simple web page, "mypage.html," which is a HTML web
page that contains two graphic images, "myimage1.jpg," and
"myimage2.jpg." A typical server/browser interaction might go
something like this: In the web server log, the following lines would be added: 192.168.45.13 - - [24/May/2005:11:20:39 -0400] "GET /mypage.html HTTP/1.1" 200 117 So what can we gather from this exchange? Well, based on the what we learned
above, we can count the number of lines in the log file and determine that
the server received 3 hits during the period that this log file covers. We
can also calculate the number of hits each URL received (in this case, 1 hit
each). Along the same lines, we can see that the server received 3 hits from
the IP address 192.168.45.13, and when those requests were received. The two
numbers at the end of each line represent the response code and the
number of bytes sent back to the requestor. The response code is how
the web server indicates how it handled the request, and the codes are defined
as part of the HTTP protocol. In this example, they are all 200, which means
everything went OK. One response code you may be very familiar with is the
all too common "404 - Not Found," which means that the requested URL could
not be found on the server. There are several other response codes defined,
however these two are the most common. And that, in a nutshell, is about all you can accurately determine
from the logs. "But wait!" you might be screaming, "most analysis
program have lots of other numbers displayed!", and you would be
right. Some more obscure numbers can be calculated, like the number
of different response codes, number of hits within a given time
period, total number of bytes sent to remote browsers, etc.. Other
numbers can be implied based on certain assumptions, however those
cannot be considered entirely accurate, and some can even be wildly
inaccurate. Other log formats might be used by a web server as well,
which provide additional information above what the CLF format does,
and those will be discussed shortly. For now, just realize that the
only thing you can really, accurately determine is what IP address
requested which URL, and when it requested that URL, as shown in the
example above. 4.3 The Good, the Bad and the Ugly So now you have a good grasp of how your web server works and what
information can be obtained from its logs, like number of hits (to
the server and to individual URLs), number of IP addresses making
the requests (and how many hits each IP address made), and when
those requests were made. Given just that information, you can
answer questions such as "What is the most popular URL on my site?",
"What was the next most popular URL?", "What IP address made the
most requests to my server?", and "How busy was my server during
this time period?". Most analysis programs will also make it easy
to answer such questions as "What time of day is my web server the
most active?", or "What day of the week is the busiest?". They
can give you an insight into usage patterns that may not be apparent
by just looking at the raw logs. All of these questions can be
answered with completely accurate answers, based just on the simple
analysis of your web server logs. That's the good news! The bad news? Well, with all the things you can determine by looking at your
logs, there are a lot of things you can't accurately calculate. Unfortunately,
some analysis programs lead you to believe otherwise, and forget to mention
(particularly commercial packages) that these are not much more than assumptions
and cannot be considered at all accurate. Like what? you ask.. Well, how about
those things that some programs call 'user trails' or 'paths', that are supposed
to tell you what pages and in what order a user traveled through your site.
Or how about the length of time a user spends on your site. Another less than
accurate metric would be that of 'visits', or how many users 'visited' your
site during a given time period. All of these cannot be accurately calculated,
for a couple of different reasons.. Let's look at some of them: The HTTP protocol is stateless In a typical computer program that you run on your own machine, you can
always determine what the user is doing. They log in, do some stuff,
and when finished, they log out. The HTTP protocol however is different.
Your web server only sees requests from some remote IP address. The
remote address connects, sends a request, receives a response and then
disconnects. The web server has no idea what the remote side is doing
between these requests, or even what it did with the response sent to
it. This makes it impossible to determine things like how long a user
spends on your site. For example, if an IP address makes a request to
your server for your home page, then 15 minutes later makes a request
for some other page on your site, can you determine how long the user
had been at your site? The answer is of course, no. Just because
15 minutes expired between requests, you have no idea what the remote
address was doing between those two requests. They could have hit your
site, then immediately gone somewhere else on the web, only to come back
15 minutes later to request another page. Some analysis packages will
say that the user stayed on your site for at least 15 minutes plus some
'fudge' time for viewing the last page requested (like 5 minutes or so).
This is actually just a guess, and nothing more. You cannot determine individual users Web servers see requests and send results to IP addresses only. There
is no way to determine what is at that address, only that some
request came from it. It could be a real person, it could be some
program running on a machine, or it could be lots of people all using
the same IP address (more on that below). Some of you will note that
the HTTP protocol does provide a mechanism for user authentication,
where a username and password are required to gain access to a web site
or individual pages. And while that is true, it isn't something that
a normal, public web site uses (otherwise it wouldn't be public!). As
an example, say that one IP address makes a request to your server, and
then a minute later, some other IP address makes a request. Can you
say how many people visited your site? Again, the answer is no.
One of those requests may have come from a search engine 'spider', a
program designed to scour the web looking for links and such. Both
requests could have been from the same user, but at different addresses.
Some analysis program will try to determine the number of users based
on things like IP address plus browser type, but even so, these are
nothing more than guesses made on some rather faulty assumptions. Network topology makes even IP addresses problematic In the good old days, every machine that wanted to talk on the Internet
had its own unique IP address. However, as the Internet grew, so did
the demand for addresses. As a result, several methods of connecting to
the Internet were developed to ease the addressing problem. Take, for
example, a normal dial-up user sitting at home. They call their service
provider, the machines negotiate the connection, and an IP address is
assigned from a re-usable 'pool' of IP addresses that have been assigned
to the provider. Once the user disconnects, that IP address is made available
to other users dialing in. The home user will typically get a different
IP address each time they connect, meaning that if for some reason they
are disconnected, they will re-connect and get a new IP address. Given
this situation, a single user can appear to be at many different IP addresses
over a given time. Another typical situation is in a corporate environment,
where all the PCs in the organization use private IP addresses to talk
on the network, and they connect to the Internet through a gateway or
firewall machine that translates their private address to the public one
the gateway/firewall uses. This can make all the users within the organization
appear as if they were all using the same IP address. Proxy servers are
similar, where there can be thousands of users, all appearing to come
from the same address. Then there are reverse-proxy servers, typical of
many large providers such as AOL, that can make a single machine appear
to use many different IP addresses while they are connected (the reverse-proxy
keeps track of the addresses and translates them back to the user). Given
this situation, can you say how many users visited your site if your logs
show 10 requests from the same IP address over an hour? Again, the answer
is no. It could have been the same user, or it could have been multiple
users sitting behind a firewall. Or how about if your logs show 10 requests
from 10 different IP addresses? Think it was from 10 different users?
Of course not. It could have been 10 different users, could have been
a couple of users sitting behind a reverse proxy, could have been one
or more users along with a search engine 'spider', or it could be any
combination of them all. 4.4 But wait, there's more! OK, so what have we learned here? Well, in short, you don't know who or what
is making requests to your server, and you can't assume that a single IP address
is really a single user. Sure, you can make all kinds of assumptions and guesses,
but that is all they really are, and you should not consider them at all accurate.
Take the following example; IP address A makes a request to your server, 1
minute later, IP address B makes a request, and then 10 minutes later, address
A makes another request. What can we determine from that sequence? Well, we
can assume that two users visited. But what if address A was that of a firewall?
Those two requests from address A could have been two different users. What
if the user at address A got disconnected and dialed back in, getting a different
address (address B) and someone else dialed in at the same time and got the
now free address A? Or maybe the user was sitting behind a reverse-proxy,
and all three requests were really from the same user. And can we tell what
'path' or 'trail' these users took while at the web site or how long they
remained? Hopefully, you should now see that the answer to all these things
is a big resounding No, we can't! Without being able to identify individual
unique users, there is no way to tell what an individual unique user does. Other metrics you CAN determine Now that you have seen what is possible, you may be thinking that there
are some other things these programs display, and wondering about how
accurate they might be. Hopefully, based on what you have already seen
thus far, you should be able to figure them out on your own. One such
metric is that of a 'page' or 'page view'. As we already know, a web
page is made up of an HTML text document and usually other elements
such as graphic images, audio or other multimedia objects, style sheets,
etc.. One request for a web page might generate dozens of requests for
these other elements, but a lot of people just want to know how many
web pages were requested without counting all the stuff that makes them up.
You can get this number, if you know what type of files you may consider
a 'page'. In a normal server, these would be just the URLs that end with
a .htm or .html extension. Perhaps you have a dynamic site, and your web
pages use an .asp, .pl or .php extension instead. You obviously would
not want to count .gif or .jpg images as pages, nor would you want to
count style sheets, flash graphic and other elements. You could go
through the logs and just count up the requests for whatever URL meets
your criteria for a 'page', but most analysis programs (including the
Webalizer) allows you to specify what you consider a page and will
count them up for you. Other information Up to now, we have just discussed the CLF (Common Log Format) log format.
There are others. The most common is called 'combined', and takes the basic
CLF format and adds two new pieces of information. Tacked on the end is the
'user agent' and 'referrer'. A user agent is just the name of the browser
or program being used to generate the request to the web server. The 'referrer'
is supposed to be the page that referred the user to your web server. Unfortunately,
both of these can be completely misleading. The user agent string can be set
to anything in some modern browsers. One common trick for Opera users is to
set their user agent string to that of MS Internet Explorer so they can view
sites that only allow MSIE visitors. And the referrer string, according to
the standards document (RFC) for the HTTP protocol, may or may not be used
at the browsers choosing, and if used, does not have to be accurate or even
informative. The apache web server (which is the most common on the Internet)
allows other things to be logged, such as cookie information, length of time
to handle the request and lots of other stuff. Unfortunately, the inclusion
and placement of this information in the server logs are not standard. Another
format, developed by the W3C (world wide web consortium), allows log records
to be made up of many different pieces of information, and their location
can be anywhere in the log entry with a header record needed to map them.
Some analysis programs handle these and other formats better than others. Analysis techniques The only true way to get an accurate picture of what your web server is doing
is to look at its logs. This is how most of the analysis packages out there
get their information, and is the most accurate. Other methods can be used,
with different results. One common method, which was widely popular for a
while, was the use of a 'page counter'. Basically, it was a dynamic bit included
in a web page that incremented a counter and displayed its value each time
the page was requested. Normally, it was included in the page as if it were
a standard image file. One problem with this method was that you had to include
a different 'image' file for each page you wanted to track. Another problem
occurred if the remote user had image display turned off in their browser,
or could not display images at all (such as in a text based web browser).
You could also easily inflate the number by just hitting the 'reload' button
on your browser over and over again. Similar methods were developed using
java and javascript, in an attempt to get even more information about the
visiting browser, such as screen resolution and operating system type. Of
course, these can easily be circumvented as well. Some companies set up systems
that claim to track your server usage remotely, by including an image or javascript
element on your site which would then contact the companies system each time
the image or javascript element was requested. These all have the same problems
and limitations. In all of these, you can simply turn off images and java/javascript
and then browse the web site completely uncounted and unseen (except in the
web server logs). Beware of these types of counters and remote usage sites,
they are not quite as accurate as they may lead you to believe. Conclusion It should now be obvious that there are only certain things you can determine
from a web server log. There are some completely accurate numbers you can
generate without question. And then, there are some wildly inaccurate and
misleading numbers you can garner depending on what assumptions you make.
Want to know how many requests generated a 404 (not found) result? Go right
ahead and count them up and be completely confident with the number you get.
Want to know how many 'users' visited your web site? Good luck with that one..
unless you go 'outside the logs', it will be a hit or miss stab in the dark.
But now you should have a good idea of what is and isn't possible, so when
you look at your usage report, you will be able to determine what the numbers
mean and how much to trust them. You should also now see that a lot can depend
on how the program is configured, and that the wrong configuration can lead
to wrong results. Take the example of 'pages'.. if your analysis software
thinks that only URLs with a .htm or .html extension is a page, and all you
have is .php pages on your site, that number will be completely wrong. Not
because the program is wrong, but because someone told it the wrong information
to base its calculations on. Remember, knowledge is power, so now you have
the power to ask the proper questions and get the proper results. The next
time you look at a server analysis report, hopefully you will see it in a
different light given your new found knowledge.
if [ ! -f $1 ]; then
echo "Usage: $0
exit
fi
echo "Sorting $1"
sort -t ' ' -k 4.9,4.12n -k 4.5,4.7M -k 4.2,4.3n -k 4.14,4.15n -k 4.17,4.18n -k 4.20,4.21n $1 > $1.sorted
192.168.45.13 - - [24/May/2005:11:20:40 -0400] "GET /myimage1.jpg HTTP/1.1" 200 231
192.168.45.13 - - [24/May/2005:11:20:41 -0400] "GET /myimage2.jpg HTTP/1.1" 200 432