Date of Award

1999

Document Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Engineering Science (Interdepartmental Program)

First Advisor

Evangelos Triantaphyllou

Abstract

The main problem investigated in this dissertation is as follows: Given are two samples of documents each from one of two disjoint collections of documents. The question is how to obtain a set of patterns of text features that make a document in the two samples (and other unclassified documents) to be classified correctly in one and only one document class. A sample of 2,897 documents from the TIPSTER collection was used to investigate this problem. This problem was divided into the following four subproblems. The first subproblem consists of identifying the set keywords to describe the documents' content. Computational results of twenty experiments suggested that single-word keywords addressed the main problem effectively. The second subproblem requires a methodology to construct classification rules to infer the class of unclassified documents. A logical analysis approach called the One Clause At a Time algorithm (OCAT) is used to address this problem. Its accuracy is compared to the one of the Vector Space Model (VSM), a benchmarking methodology in document classification processes. Under identical experimental conditions, some computational results suggests that the OCAT algorithm is more accurate than the VSM to solve the main problem. The third subproblem consists of providing a methodology to construct better rules as more documents become available. This problem has been investigated using the OCAT algorithm under a guided and a random teaming approach. Computational results on three samples of 510 documents indicate that the guided teaming approach constructs more accurate rules. In the fourth subproblem an incremental version of the OCAT algorithm is required. The algorithm is needed to speed up the construction of the classification rules. Computational results on three samples of 336 documents each show that: (i) the classification rules become accurate more rapidly, (ii) the CPU times are substantially reduced, and ( iii) the rules become more complex as more documents were added to the experiment. In summary, the results of this research suggest with high confidence that the incremental OCAT algorithm can perform better than the VSM and that it can deliver better and faster results for the classification of large collections of documents.

ISBN

9780599636316

Pages

98

Share

COinS