COP 2805 (Java II) Project
Building a Search Engine, Part I:  The UI

Due: by the start of class on the date shown on the syllabus

Background:

In computer science, an inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or set of documents (or database).  The purpose of an inverted index is to allow fast, full-text searches.  It is the most popular data structure used in document retrieval systems, used on a large scale (for example) in search engines.

Simple Inverted Indexes

There are two main variants of inverted indexes.  The simplest is an inverted file index (or just inverted file).  This type of index contains a list of references to documents, for each word.  For example, indexing these three (very short) documents:

Example Documents
FileContents
File0.txt it is what it is
File1.txt what is it
File2.txt it is a banana

results in the following inverted file index:

Inverted File Index for Example Documents
Word Set of Files containing word
a {2}
banana {2}
is {0, 1, 2}
it {0, 1, 2}
what {0, 1}

Typically, a small integer is used to represent each file.  So “0”, “1”, and “2” in this example refer to the documents “File0.txt”, “File1.txt”, and “File2.txt”.  The curly braces denote “sets”.  As you can see, the index is nothing more than a list of these sets, with one set for each word that appears in at least one document.  (The list of file names is part of the index, but not shown here.  This additional list (or array) is used to convert the numbers back to file names.) 

An AND search for the terms “what”, “is” and “it” would give the set (“∩” means set intersection):

{0,1} ∩ {0,1,2} ∩ {0,1,2} = {0,1}

In this example, the word “what” appears in files 0 and 1, the words “is” and “it” both appear in all three files.  With an AND search, we need the set of documents that have all three terms.  All three search terms appear only in documents 0 and 1.

An OR search for the same terms would give {0, 1, 2} (that is, all three documents).  With an OR search, the results include the documents that include at least one of the search terms.

An AND search for the terms “banana” and “what” results in no files found (we say the number of hits was zero).  There are no files that contain both of these words.

Full Inverted Indexes

A full inverted index is the same as the simple inverted index, but additionally contains the positions of each word within a document.  This offers more functionality (like phrase searches), but needs more time and space to be created.

The number pairs are the document numbers and the position of the word within that document, starting with zero for the first word).  So the entry:

banana: {(2, 3)}

means the word “banana” is in document 2 (File2.txt), and it is the fourth word in that document (position 3).  With the full inverted index, instead of a set of document numbers for each word, you have a set of pairs (document number plus position) for each word.  Here is the full inverted index for the three sample files:

Full Inverted Index for Example Documents
Word Set of (File,Position) Pairs
a {(2, 2)}
banana {(2, 3)}
is {(0, 1), (0, 4), (1, 1), (2, 1)}
it {(0, 0), (0, 3), (1, 2), (2, 0)}
what {(0, 2), (1, 0)}

As you can see, the full index is a list (probably sorted) of sets, one set per word; each set contains pairs (document number + position).  Real-world data structures are often complicated like this, with lists of sets of whatever.

For AND and OR searches, the position doesn't matter and can be ignored.  (Doing AND and OR searches is fairly straight-forward, and is left as an exercise for the reader.)

If we run a phrase search for “what is it”, we get “hits” for all three search terms in File0.txt and File1.txt.  But the terms occur consecutively only in File1.txt, so a PHRASE search should only show that one document.  How can you use a full inverted index to perform a PHRASE search?  There are several ways;  here's one way:

You need to produce the set of documents that contain all the search terms, in order.  You can determine this by first creating a result set of all the document+position pairs that contain the first word, in any position.  Only these documents can possibly contain the whole phrase.  (Note that a word such as “it” appears in two places in File0.txt, so either might be the start of the phrase.  You need to list both pairs in the result set.)  The result set at this point shows the pairs for “what” looks like this:

{(0, 2), (1, 0)}

This list tells us that only documents zero or one could possibly contain the phrase “what is it”.  Now we have to see which of these (if any) is followed by the next word in the phrase.  That is, we need to check if document zero contains the word “is” in position 3, and if document one contains that word in position 1.

Create a second result set that is initially empty.  For each document+position pair in the first result set, check if the next word (“is” in our example) appears in the same document, but in the following position.  If not, that entry can't possibly indicate the phrase in question.  But if so, you add the document+position pair for the second word to the second result set.  In the example phrase search, you need to see if document zero contains the word “is” at position 3, and if document one contains that word at position 1.  The new result set looks like this:

{(1, 1)}

As you can see, only document one can possible contain the phrase: it has “what” at position 0 and “is” at position 1.  Once the new result set has been built, you no longer need the original one.

Repeat the previous steps for each word in the phrase, until you either run out of words (and the resulting set of documents are just those that contain the phrase) or until you get an empty result set (in which case, the phrase doesn't appear in any of the indexed documents).  In our example, we need to see if document one contains the word “it” at position 2.  After this, the new result set looks like this:

{(1, 2)}

Having run out of words in our phrase, we are done!  Document one contains the phrase in positions 0 through 2.

Search Engine Projects Description:

This is a group project.  It is also a real-world, complex application.  It is my hope that when done, you will understand files, collections, and the process of software development much more fully than you would by reading the textbook alone.  You must kept your project as simple as possible, with any “extras” saved for a later version (if any).  Keep the user interface plain and simple.  Time management will be critical for success: set milestones for your project, with due dates.  (For example, milestone 1 might be to read and write the index file, milestone 2 might be the basic user interface, milestone 3 might be OR searching, and so on.)

Students should form their own groups of no more than four students per group (unless permission to form a larger group is given).  It will be up to you to determine how to organize and manage your group, and when, where, how, and how often to meet.

Build a simple GUI search engine that can do AND, OR, and PHRASE searches on a small set of text files.  The user should be able to say the type of search to do, and enter some search terms.  The results should be a list of files that match the search.  This should be a stand-alone application but not an applet.

It should be easy to add and remove files (from the set of indexed files), and to regenerate the index anytime.  When starting, your application should check if any of the files have been changed or deleted since the application last saved the index.  If so, the user should be able to have the inverted index file(s) updated.

(Note that with HTML or Word documents, you would need to extract a plain text version before indexing.  In this project, all the files are plain ASCII text.)

The inverted index must be stored in one or more file(s), and that file should be read whenever your application starts.  The file(s) should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents).  The file format is up to you, but should have a format that is fast and simple to search.  However, to keep things simpler, in this project you can assume that only a small set of documents will be indexed, and thus the index can be kept in memory.  All you need to do is be able to read the index data from a file at startup into memory, and write it back when updating the index.  Note, the names (pathnames) of the files as well as their last modification time must be stored as well.  It is your choice to use a single file or multiple files, plain text or XML to hold the persistent data.  (Don't use any DBMS however, just files.)  In any case, your file format(s) must be documented completely, so that someone else, without access to your source code could use your file(s).

You can define an XML schema for your file, and have some tool, such as Microsoft's XML Notepad utility or Notepad++, validate your file format for you.  XML may have other benefits, but it isn't as simple as plain text files.  In any case, don't forget to include the list of file (path) names, along with the index itself.

Part I Requirements:

The class must work in groups of three or four students per group.  Any student not part of a group must let the instructor know immediately.  In this case, the instructor will form the groups.

Your group will agree to use a single GitHub repo for this project.  Every student must make commits to this repo for their part of the project.  (So every member of the project must do their share of the code.)

This project has been split into three parts.  Each part counts as a separate project.  In this first part, your group must design and implement a non-functional graphic user interface for the application.  The user interface must support both searching as well as various index operations: adding files to the index, remove files from the index, and update the index (when files have been modified since they were indexed).

The user interface should be complete, but none of the functionality needs to be implemented at this time.  You should implement stub methods for the functions not yet implemented, and invoke them from your event handlers.  The stub methods can either return “canned” (fake but realistic) data, or throw an appropriate exception.

You can download a Search Engine model solution, to play with it and inspect its user interface, but please keep in mind you should not copy that user interface; instead, invent a better, nicer-looking one.

Preview of next projects:  In part II (files), your group will implement the file operations of the search engine.  That includes reading and writing the inverted index and file list from/to a file, reading other files a word at a time, and storing file metadata such as last modified time.  In part III (collections), you will implement the inverted index operations, including Boolean searching, adding to the index, and removing files from the index.  (The index is a complex collection of collections.)

Hints:

Keep your code simple.  You can always add features later, time permitting.  If you start with a complex, hard-to-implement design, you may run out of time.

How your group is organized is up to the group members.  Some suggestions include:

To be Turned in:

A working link to your group's GitHub repo used for this project and your individual peer ratings (see below).  Your project's final version should receive a Git tag of “Proj 3 UI”, so I know which version to grade.

Be sure the names of all group members are listed in the comments of all files.  You should use GitHub's issue tracker, wiki, email, Facebook, Skype, or any means you wish, to communicate within your group.  (It is suggested you hold short group meetings before or just after class.)

Grading will be done for the group's results and individual commits.  Individuals in the group will have their grades adjusted by peer ratings.  A rating of each team member's level of participation should be sent by individually by every member directly to the instructor.  Be sure to include yourself in the ratings!  The rating is a number from 0 (didn't participate at all), 1 (less than their fair share of the work), 2 (participated fully), or 3 (did more than their fair share of the work).  Additional comments are allowed if you wish to elaborate.

Send your group ratings, including the GitHub link, as email to (preferred).

Please see your syllabus for more information about projects, and about submitting projects.