COP 2805 (Java II) Project
Building a Search Engine, Part II:  Files

Due: by the start of class on the date shown on the syllabus

Background:

Please read the background information and full project description from Search Engine Project, Part I.

It should be easy to add and remove files (from the set of indexed files), and to regenerate the index anytime.  When starting, your application should check if any of the files have been changed or deleted since the application last saved the index.  If so, the user should be able to have the inverted index file(s) updated.

(Note that with HTML or Word documents, you would need to extract a plain text version before indexing.)  In this project, all the “indexible” files are plain text.  You are free to assume the system-default text file encoding, or assume UTF-8 encoding, for all files.

The inverted index must be stored in one or more file(s), and that file should be read whenever your application starts.  The file(s) should be updated (or recreated) when you add, update, or remove documents from your set (of indexed documents).  The file format is up to you, but should have a format that is fast and simple to search.  However, to keep things simpler, in this project you can assume that only a small set of documents will be indexed, and thus the index can be kept in memory.  All you need to do is be able to read the index data from a file at startup into memory, and write it back when updating the index.  Note, the names (pathnames) of the files as well as their last modification time must be stored as well.  It is your choice to use a single file or multiple files, plain text or XML to hold the persistent data.  (Don't use any DBMS however, just files.)  In any case, your file format(s) must be documented completely, so that someone else, without access to your source code could use your file(s).

You can define an XML schema for your file, and have some tool, such as Microsoft's XML Notepad utility or Notepad++, validate your file format for you.  XML may have other benefits, but it isn't as simple as plain text files.  In any case, don't forget to include the list of file (path) names, along with the index itself.

Part II Requirements:

The class must work in groups of three or four students per group.  Any student not part of a group must let the instructor know immediately.  In this case, the instructor will form the groups.

This project has been split into three parts.  Each part counts as a separate project.  In the first part, your group designed and implemented a (non-functional) graphic user interface for the application.

Your group will agree to use a single GitHub repo for this project.  Every student must make commits to this repo for their part of the project.  (So every member of the project must do their share of the code.)

In this part, you must implement the file operations of your search engine application.  That includes reading and updating your persistent data (that is, the inverted index as well as any other information you need to store between runs of your application, such as the list of files that have been indexed).  The main file operations are reading files to be indexed, a “word” at a time, and checking if the previously indexed files still exist, or have been modified since last indexed.

The maintenance part of the user interface should allow users to select files for indexing, and to keep track of which files have been added to the index.  For each file, you need to keep the full pathname of the file as well as the file's last modification time.  Your code should correctly handle the user entering in non-existent files, and unreadable files.  (How you handle those errors is up to your group.)

You can download a Search Engine model solution, to play with it and inspect its user interface, but please keep in mind you should not copy that user interface; instead, invent a better, nicer-looking one.

Preview of the last part of this project:  In part III, you will implement the index operations, including Boolean searching, adding to the index, and removing files from the index.  (The index is a complex collection of collections.)

Hints:

Keep your code simple.  You can always add features later, time permitting.  If you start with a complex, hard-to-implement design, you may run out of time.

How your group is organized is up to the group members.  Some suggestions include:

To be Turned in:

A working link to your group's GitHub repo used for this project and your individual peer ratings (see below).  Your project's final version should receive a Git tag of “SearchEngine Project - Files”, so I know which version to grade.

Be sure the names of all group members are listed in the comments of all files.  You should use GitHub's issue tracker, wiki, email, Facebook, Skype, or any means you wish, to communicate within your group.  (It is suggested you hold short group meetings before or just after class.)

Grading will be done for the group's results and individual commits.  Individuals in the group will have their grades adjusted by peer ratings.  A rating of each team member's level of participation should be sent by individually by every member, directly to the instructor.  Be sure to include yourself in the ratings!  The rating is a number from 0 (didn't participate at all), 1 (less than their fair share of the work), 2 (participated fully), or 3 (did more than their fair share of the work).  Additional comments are allowed if you wish to elaborate.

Send your group ratings, including the GitHub link, as email to (preferred).

Please see your syllabus for more information about projects, and about submitting projects.