In this project we will learn how to extract useful information from log files. The Apache web server maintains an access log that records all successful accesses to that website. The information includes some of the HTTP 1.1 headers. These include the exact URL the user used to find your site.
The Google search engine is often used just before finding our websites. When a user goes through Google and clicks the URL to visit our site, Google adds the search phrase used as part of the URL (in the query string). It can be very useful to know what search phrase was used to find a website.
In this project you will create a pipeline of filter commands to extract the Google search phrase from an Apache access log. You can use any filter commands you wish.
The data is in the file
(on Linux systems), with one log entry per line.
Since this file is restricted to root access on
YborStudent, a readable copy to use
can be found on (on
Use this file for your project.
Create a one-liner (a shell pipeline or grouped command)
that shows the Google search phrase used, and a count of
how many times each search phrase was used, in
order of most to least frequent.
You must use the
It is up to you to decide if you wish to use case-sensitive
or case-insensitive matching of search terms.
Examine the log file. Firstly, develop a command that shows only the log entries (lines) with Google searches. Such lines contain both “google.” and “search” in the URL before the query string starts.
Next you need to extract the one field of interest from these lines. Examine a typical line from the access log:
126.96.36.199 - - [24/Feb/2007:08:15:49 -0500] "GET /2005/artists/gustavo_matamoros.html HTTP/1.1" 200 6267 "http://www.google.com/search?tab=vw&hl=en&q=gustavo%20matamoros" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; InfoPath.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
Notice which field contains the search URL.
It helps to know something about URLs. In the example above, the URL is this part of the log entry:
Note it is surrounded with double quote marks in the log entry.
A URL has many parts (most are optional and may be
the protocol part is “
The host part is “
the path part is “
/search”, and the
query string is everything after the question mark
The search terms entered by the user to Google are the part of the
query string after
q=” up to the end of the URL (marked
by the close quote mark) or the next part of the query string
(marked by an ampersand): “
Note this string is encoded; the search terms are
After extracting this field, you need to extract the search phrase. To do so it is helpful to know that this data is URL encoded (also called “percent encoding”). Briefly, the data passed in a URL is called a query string. It follows a question mark. Each data item is separated with an ampersand (“&”). Each item is of the form name=value.
Google puts the search phrase in an item with the name
You need to extract the value of this item (everything
from the equals sign to the end of the item).
Note the end of the last item is marked not by an ampersand
but the closing quote-mark.
Finally you still have to decode the result to know
the search phrase.
In URL encoding, spaces are translated to plus signs
Only a subset of ASCII is allowed
in a URL.
The remaining characters are represented by a
present symbol (“%”) followed by the ASCII
value of the character (as a pair of hex digits).
So the search phrase “
Hymie Piffl” will appear
$10 + 7% tax” will appear as
While you can perform URL decoding fairly easily in
Perl or Python, for this assignment you can take a short-cut and
convert all occurences of
signs, and even any quote marks (
"), to a space.
This is because in this application, we only care about
the words used in the search phrase.
Finally when you have the list of search phrases, you can do the usual “dance” to generate the most frequently used: sort, count duplicates, and sort the result in reverse numerical order.
As a hint, here's the top two results found with the model solution:
$ accesslog.sh | head -n 3 parsing /home/wpollock/access.log... 30 hawk radio 18 reef sponge
A copy of your pipeline, and the results of running it against the log file supplied.
You can type or send as email to . Please see your syllabus for more information about submitting projects.