DocSearcher is a search tool.

Overview

Download

Creating Indexes

Developer Info

Search Techniques

Support

Servlet

History

Changelog

ToDo

Creating searchable CDROM

Developer Info

This is a Java (Swing) Application.

To extract the source, unzip the source.jar file;
jar -xf source.jar

Notes:
The main class is called DocSearch.java.
The method that performs indexing is called createNewIndex
Other classes of interest are the wrapper objects for various file types;

WordProps ; for working with POI HDF API
ExcelProps ; for working with POI HSSF API
PdfToText ; for working with PDF Box API
RtfToText ; uses javax.swing.rtf API
OoToText ; uses java.util.zip API to unzip the star office / open office documents and extract the content XML file

DocSearcher creates and stores its indexes and all related files in the ".docSearcher" folder underneath the user's home directory. On a linux system this might be
/home/john/.docSearcher
and on a windows system it might be something like:
C:\Documents and Settings\username\.docSearcher

DocSearcher indexes are Lucene indexes with the following fields and types:

Field	Description	Indexing Properties
author	taken from the documents meta data	text
path	file handle	unindexed
mod_date	date document was last modified	text (Lucene DateField text object)
title	title obtained via meta data (if exists) otherwise a grab of the first few lines or characters	text
summary	first few lines of text	text
body	text of entire document (without meta data)	text
URL	if the index is created as a "web" index - DocSearcher will construct a URL for each file	text
keywords	taken from document meta data (if exits); mostly relevant on indexed web page documents	text
size	size in bytes	keyword
type	document suffix (htm, doc, pdf, etc...)	text

If you want to constuct a search JSP or servlet, the above table should be very helpful.

In addition, you may want to review the doSearch() method in DocSearch.java. This will show you how dates are handled and other meta data are searched. DocSearcher creates and stores its Lucene indexes in $user_home/.docSearcher/indexes directory.

I hope to have an example JSP ready soon.

Another resource that may be of assistance is reviewing the standard output of DocSearch.jar:
i.e. : java -jar DocSearch.jar

DocSearcher will display the Lucene search string that it builds from your GUI input so that you can see what this search string looks like to the Lucene API.

If you are curious how it performs index updates; please take a look at the source code DocSearcherIndex.java, and then look at the DocSearch.java method updateIndex(docSearcherIndex di) which performs the actual updates to the Lucene indexes.

I've attempted to tune this method to scale fairly well even on large indexes; but if you have suggestions on improvement - those are always welcome. ;)

Command Line Arguments

java -jar DocSearch.jar ["action"]  ["index" or log file name]

         ... where actions can be:


         update : which means update an index


         export : which means export an index to a zip file


         list : which lists the indexes


         analyze_log : which analyzes search log data (from a servlet)


         "Search:text to find" : which performs a search and outputs

                the text result to the console.