COP 2344 (Shell Scripting) Project #5
Process Apache Web Server Access Log

 

Due: by the start of class on the date shown on the syllabus

Description:

In this project we will learn how to extract useful information from log files.  The Apache web server maintains an access log that records all successful accesses to that website.  The information includes some of the HTTP 1.1 headers.  These include the exact URL the user used to find your site.

The Google search engine is often used just before finding our websites.  When a user goes through Google and clicks the URL to visit our site, Google adds the search phrase used as part of the URL (in the query string).  It can be very useful to know what search phrase was used to find a website.

In this project you will create a pipeline of filter commands to extract the Google search phrase from an Apache access log.  You can use any filter commands you wish.

The data is in the file /var/log/httpd/access.log (on Linux systems), with one log entry per line.

Since this file is restricted to root access on YborStudent, a readable copy to use can be found on (on YborStudent) at ~wpollock/access.log.  Use this file for your project.

Requirements:

Create a one-liner (a shell pipeline or grouped command) that shows the Google search phrase used, and a count of how many times each search phrase was used, in order of most to least frequent.  You must use the access.log provided.  It is up to you to decide if you wish to use case-sensitive or case-insensitive matching of search terms.

Additional Notes:

Examine the log file.  Firstly, develop a command that shows only the log entries (lines) with Google searches.  Such lines contain both “google.” and “search” in the URL before the query string starts.

Next you need to extract the one field of interest from these lines.  Examine a typical line from the access log (note this is a single line; wrapped lines show as “➥”):

72.158.245.66 - - [24/Feb/2007:08:15:49 -0500]
➥ "GET /2005/artists/gustavo_matamoros.html HTTP/1.1" 200 6267
➥ "http://www.google.com/search?tab=vw&hl=en&q=gustavo%20matamoros"
➥ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1;
➥ .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

Notice which field contains the search URL.

It helps to know something about URLs.  In the example above, the URL is this part of the log entry:

  http://www.google.com/search?tab=vw&hl=en&q=gustavo%20matamoros

Note it is surrounded with double quote marks in the log entry.  A URL has many parts (most are optional and may be missing): the protocol part is “http://”, The host part is “www.google.com”, the path part is “/search”, and the query string is everything after the question mark (“?”), or “tab=vw&hl=en&q=gustavo%20matamoros”.  The search terms entered by the user to Google are the part of the query string after “q=” up to the end of the URL (marked by the close quote mark) or the next part of the query string (marked by an ampersand): “gustavo%20matamoros”. Note this string is encoded; the search terms are “gustavo matamoros”.

After extracting this field, you need to extract the search phrase.  To do so it is helpful to know that this data is URL encoded (also called “percent encoding”).  Briefly, the data passed in a URL is called a query string.  It follows a question mark.  Each data item is separated with an ampersand (“&”).  Each item is of the form name=value.

Google puts the search phrase in an item with the name “q”.  You need to extract the value of this item (everything from the equals sign to the end of the item).  Note the end of the last item is marked not by an ampersand but the closing quote-mark.

Finally you still have to decode the result to know the search phrase.  In URL encoding, spaces are translated to plus signs (“+”).  Only a subset of ASCII is allowed in a URL.  The remaining characters are represented by a present symbol (“%”) followed by the ASCII value of the character (as a pair of hex digits).  So the search phrase “Hymie Piffl” will appear as “...q=Hymie+Piffl”, and “$10 + 7% tax” will appear as “...q=%2410+%2B+7%25+tax”.

While you can perform URL decoding fairly easily in Perl or Python, for this assignment you can take a short-cut and convert all occurences of “percent<hex-digit><hex-digit>”, plus signs, and even any quote marks ("), to a space.  This is because in this application, we only care about the words used in the search phrase.

Finally when you have the list of search phrases, you can do the usual “dance” to generate the most frequently used:  sort, count duplicates, and sort the result in reverse numerical order.

As a hint, here's the top results found with the model solution, and some other information:

$ accesslog.sh |head -n 5
     30 hawk radio
     18 reef sponge
     17 moving pitchers
     16 the hawk radio
     14 reef sponges

$ accesslog.sh |wc -l
214

$ accesslog.sh |awk '{sum += $1}; END {print sum}'
502

To be turned in:

A copy of your pipeline, and the results of running it against the log file supplied.

You can type or send as email to .  Please see your syllabus for more information about submitting projects.