Creating Conceptual Access:
Faceted Knowledge Organization in the unrev-ii email archives.    (01)

Kathryn La Barre and Chris Dent    (02)

klabarre@indiana.edu cjdent@indiana.edu    (03)

Indiana University
011 Main Library
1320 E. 10th St.
Bloomington, IN, USA 47405-3907
phone: 812-336-9136
http://ella.slis.indiana.edu/~klabarre/unrev_firstpage.html    (04)

Abstract: The email archive of the unrev-ii list is the basis for this ongoing project, to build an access tool for an email archive that also functions as a knowledge repository. Methods utilized in future iterations of the project will include traditional semantic analysis, clustering algorithms, and facet analysis.    (05)

1. Introduction    (06)

The scope of this project includes the conceptualization and construction of a prototype infrastructure designed to augment a community attempting to gain or to create enhanced access to a knowledge repository. Future iterations of this project will utilize traditional approaches for semantic analysis in order to generate clusters. These clusters could then be used as aids in the human process of facet analysis in order to generate a faceted access structure1 for an electronic archive or similar textual knowledge repository. Computational methods for determining conceptually related clusters amongst messages will be tested using the archive contents. Extracted clusters will then be used in an iterative process to augment human identification of the characteristics that make up facets and can provide conceptual access to the archive.    (07)

This inquiry is grounded in the following assumptions or expectations. Formal knowledge representations provide effective access to knowledge. The majority of human communication is composed of basic components: associative structures, analogical reasoning and persuasion. Discourse contained in an email archive that also functions as a knowledge repository demonstrates these same features. As such, semantic analysis tools may be effective in the identification of associative structures. The application of clustering software, in addition to semantic processing, may expose these associative connections. Human analysis of generated clusters may assist human researchers in identifying embedded concepts, which will then be used in the construction of a tool that enhances access to the knowledge repository.    (08)

The prototype infrastructure described in this paper will serve as a framework for exploration of several issues. Can facet analysis assist in the creation of an effective access tool? Are clusters generated by semantic analysis conceptually valid? Are some semantic analysis tools better suited for the pre-processing of the knowledge repository than others? Is the augmentation model that informs this work valid?    (09)

2. Background    (010)

In late 2001, participants on the unrev-ii listserv renewed a conversation in which the community wondered how best to create access to the conceptual content of their archived postings. The listserv was created to provide a discussion space for participants in the Unfinished Revolution Colloquium held at Stanford University (January until March 2000). Persons unable to participate in the colloquium and anyone interested in the activities of the Bootstrap Institute2 were invited to join the discussion. Recently the focus of the discussion group shifted to concrete actualization of Douglas Englebart's vision, and another discussion group superseded unrev-ii. The initial impetus for this inquiry is the fact that the Bootstrap Institute hopes to mine material from the unrev-ii list in order to create a Bootstrapping "handbook" for the creation of a Dynamic Knowledge Repository (DKR)3 and Open Hyperdocument System (OHS)4. The Bootstrap Institute (2002, online) envisions such a handbook as a tool useful "for more efficiently solving urgent, complex problems in the private and public sectors of world society."    (011)

The threaded contents of unrev-ii are accessible at http://www.bootstrap.org/dkr/discussion/. The group has periodically expressed interest in enhanced access to such elements as archived references to books, websites and related projects. The list contains various other kinds of discussions: social concerns, tool developments and announcements of progress being made by the Bootstrap Institute. These materials mark the unrev-ii archive as a valuable knowledge repository. As a point of comparison, both the PORT-l list and the Peirce manuscripts affiliated with PORT (Peirce Online Repository Testbed) function in much the same way. Each of these entities serves as a knowledge repository; we posit that same level of conceptual access requested by the unrev-ii community will prove useful to PORT. The archive of unrev-ii postings represents a fertile area for testing tools that may prove useful in providing conceptual access to such knowledge repositories.    (012)

3. Knowledge Repositories    (013)

When human discourse has been conducted with the express (or implied) purpose of collecting or creating a storehouse of knowledge, an archive of the discourse may be considered a knowledge repository. Enhanced access to such an archive is often a desirable goal. Knowledge repositories may or may not represent a specific closed domain. They are valuable as a source of reference and a locus for learning. Most discourse that has been archived with the goal of creating a knowledge repository predates formal knowledge representation structures or was created and consequently archived without knowledge of such structures. While newly-created annotative systems for existing discourse may be successfully structured around formal representations, the enormous labor costs involved in translating or transcribingthe discourse itself into formal structures presents a significant constraint. Systems that facilitate access can promote use of knowledge archives.    (014)

This paper describes the framework for a multi-level inquiry into the creation of such systems. The first level of analysis asks whether or not computerized tools for semantic analysis (such as vector space analysis and latent semantic analysis) in combination with a clustering algorithm can be harnessed to provide useful sets of clustered documents in such a knowledge repository. We assume that clusters generated in this fashion may represent material that is semantically associated. We also assume these documents may be conceptually related. In other words, each document reflects at least one concept, and the clusters potentially represent not only conceptually related documents, but also a category or set of categories for which the concepts have not yet been identified.    (015)

The terms concept and category, as used here, are not interchangeable. We adopt Tversky and Kahneman's distinction, that these terms represent two sides of the same coin, or the "inside" and "outside" view. The inside view, or concept, contains all of the characteristics and structure(s) that bind a concept together. The outside view, or category, contains some or all of the instances that may be potentially included in the category referred to by the concept. (Tversky & Kahneman, 1983, pp. 293-315). (Please see the glossary, below, for more detail on relevant terms.)    (016)

The second level of inquiry explores human review of the clusters using facet analysis. The process of facet analysis will help us uncover the existence of the characteristics that will determine the concepts and their related categories. This review will also identify the individual instances belonging to each category. The product of this second stage of analysis would then be a faceted access structure for use by anyone wishing to have enhanced access to the conceptual content of the archive.    (017)

4. Facet Analysis    (018)

[Facet analysis consists of] …the sorting of terms in a given universe of entities (field of knowledge) into homogeneous, mutually exclusive facets. (Facets can consist of characteristics, objects or attributes.) Each facet is derived from the parent universe by a single characteristic of division (Vickery, 1966 p. 36).    (019)

In this research framework, the process of facet analysis begins after the entire archive of email messages has been subject to semantic analysis, and clusters of documents produced. In order to demonstrate the process of analysis we have chosen a whimsical example for simplicity. At this stage of inquiry, we will apply facet analysis to each cluster in the dataset. Each cluster potentially contains a number of associative connections. It is these associations that can assist in identifying concepts contained within the cluster, as well as the characteristics (facets) that define each concept. For example, we examine the document set associated with one cluster, and note that common themes (but not necessarily words) in the cluster are red, blue, yellow, rough, shiny, scary, and skinny. We postulate that there are two levels of association in this cluster: all messages are talking about Martians (we corroborate this by viewing a sample of documents), and terms which are common to most of the messages are some combination of the words: red, blue, yellow, rough, shiny, scary, and skinny. We list all of the kinds of things we know or can know about Martians: appearance, location, color and effect on humans. This list represents the characteristics (facets) of Martians. We compare this to the common words existing in the cluster: red, blue, yellow, rough, shiny, scary, skinny. These are instances of the concept of Martians as they are represented by documents contained in the cluster. As part of facet analysis we have just deconstructed a concept (Martians) into a set of characteristics or facets (an exhaustive list of what we know about a given category – in this case - appearance, location, color, and effect on humans). In this process it is the characteristics (facets) of a given concept that are used to define the members of a given category. Each characteristic is assigned a term or label which is used to identify it. It is preferable that the term chosen be a term used in the dataset or one that is used by the community that created the knowledge repository.    (020)

Our next step is to sort the instances by characteristic (facet). Characteristics (facets) will be used to collect documents contained in the cluster to the proper category or categories. By the principle of division, one characteristic, of the many that compose a concept, is chosen to identify which instance belongs to a given category. Members of each category are instances of a given concept. Once sorted and assigned to a category, the instances become manifestations of our category. Each category is also the set of manifestations of a concept. This is possible because each concept contains an exhaustive set of characteristics (facets), and each characteristic (facet) is mutually exclusive and homogeneous.    (021)

Appearance    (022)

(characteristic/ facet)    (023)

shiny - manifestation    (024)

skinny-manifestation    (025)

rough-manifestation    (026)

Color    (027)

(characteristic/ facet)    (028)

red-manifestation    (029)

blue-manifestation    (030)

yellow-manifestation    (031)

Effect    (032)

(characteristic/ facet)    (033)

scary -manifestation    (034)

Faceted access structures can be implemented with a search interface that utilizes the facets that have been identified through facet analysis. This allows users to construct a search query by selecting one characteristic (facet) or a combination of facets. Users may also choose one dimension or set of dimensions from the characteristics. This places powerful control over the search process where it belongs, in the hands of the user. Users are not constrained by the need to learn a complex or artificial vocabulary, as terms used to label facets are drawn from the universe of documents and can be presented as a simple drop-down list. The faceted system provides both a point of entry for conducting a search, and a structure upon which traversal and browsing of information can be accomplished. It dispenses with the rigid hierarchical structure of traditional classification systems and many browsing interfaces.    (035)

5. Process Augmentation    (036)

The labor-intensive process of generating a faceted access structure is too complicated to fully automate with existing computational tools. In traditional settings a domain expert is needed to understand the full scope of the subject area for which access is being created. Computational expert systems exist for known, small and closed worlds of discourse, but in the more diverse world of informal knowledge repositories human understanding is needed to create full access. It is possible, and indeed preferable, to augment the human (or humans) generating the structure to hasten the overall process. Such augmentation involves analysis of the smaller tasks that make up the larger process, breaking things down until small tasks that may be automated are identified. Our analysis reveals automatic cluster generation as a possible source of assistance in the larger process of generating faceted access structures. Many available systems are capable of performing the semantic analysis used to create clusters. These must be evaluated to determine which, if any, are effective in augmenting the facet generation process.    (037)

We assert that the associative structures created by semantic analysis tools are not—because of their informal nature—amenable to fully accurate and comprehensive evaluation by computers. The associative structures represented by clusters of documents indicate analogical relationships among the documents. The nature of those relationships is not initially known but may be based on, indicate or assist in identifying and making explicit the embedded conceptual content of the documents. Identification of these associative relationships through the process of facet analysis permits identification of the instances that make up the characteristics that structure a faceted access system. In other words, concepts are hoisted out of the raw data in cluster format, and categorized according to analogical perceptions of similarity.    (038)

Analogical reasoning combines induction and deduction, and is source dependent. It is the reasoning process that underpins the iterative nature of this inquiry. Analogical structures are context-dependent, probabilistic, complex, may lack clear definition and are created by the process of induction. In contrast, formal structures are precise, well defined, often symbolic in composition and created by the process of deduction. This inquiry is grounded in the understanding that human reasoning and the process of knowledge representation require both in order to be fully comprehensible, accessible and successful.    (039)

Automated systems have had some success identifying associative relationships in closed domains or in systems where the objects under study are themselves represented by formal structures (Hall, 1989; Spanakoudakis & Constantopoulos, 1996). Human discourse knowledge repositories are neither closed nor formal. They are instead, dynamic systems that call for flexible approaches. While computational tools can expose the framework of associative structures in these repositories, humans and their analogical reasoning processes are required for interpretation. A system built upon analogical reasoning is especially suitable for knowledge discovery because such a system identifies the analogical link embedded in the associative connection, and stores it in the formal faceted structure for later access or retrieval. The existence of the associative connections is the locus of discovery, not the definition of these connections    (040)

6. System Details    (041)

We built a testbed for evaluating semantic analysis tools and the clusters they generate. The testbed involves several aspects: the archive of the unrev-ii mailing list, semantic processing which creates clusters of conceptually related documents and a web interface that allows for the retrieval and display of messages in the generated clusters.    (042)

The clusters generated by this process are subject to evaluation. The facet analysis evaluation, conducted by the authors of this paper, has already been described. During the next stage of inquiry, members of the unrev-ii list will be invited to view the clusters and experiment with using them as a rudimentary access structure. In a process similar to the one previously discussed, conducted as a further iteration of the inquiry, group members will be able to code messages for meaning in various ways, ranking messages by order of importance, assigning keywords or generating a short phrase or two to indicate aboutness.    (043)

An example interaction might proceed as follows: Latent Semantic Analysis, an associative approach to knowledge representation, is combined with a clustering algorithm, such as Ward’s method (Ward, 1963), to create a collection of potentially meaningful clusters. The meaning of the clusters is not initially known. An unrev-ii community member, visiting the web interface, retrieves the messages comprising a single cluster and reads them. The user then describes and ranks individual messages. Once finished, the user may also describe the entire cluster.    (044)

A subsequent visitor to the interface may use the descriptors created by previous visitors to access messages. Later visitors also may create new descriptors, rank existing descriptors, or add or remove individual messages from the cluster represented by an existing descriptor. In this iteration, highly ranked descriptors can become the basis for the concepts and characteristics that form the faceted access structure, or may be used to refine the concepts that have already been identified.    (045)

Thus far, experimentation has been divided among creating a database representation of the archive, latent semantic analysis of a small subset of the data and extraction of recommended resources submitted by the members of the list. The data, consisting of ~27MB mbox format mail archive, was initially parsed and injected into an Oracle 8 database by a Perl script, parser.pl (http://ella.slis.indiana.edu/~cjdent/parser.pl) 5. The text file contains 3679 messages and spans a time period of two years, January 2000 until January 2002. Mail messages that did not have a MIME Content-Type of text/plain were not installed in the database. This lowered the total number of messages from 3685 to 3131. There are a total of 1550 different subjects posted by154 individuals.    (046)

The web interface is designed to allow multi-dimensional traversal of the archive: through clusters; between clusters; as well as by author, subject and date. A rudimentary prototype of the interface is available on the Internet (http://www.burningchrome.com/~cdent/uviz/cgi/index.cgi).    (047)

7. Points of convergence between unrev-ii and the work of PORT    (048)

PORT is currently evaluating tools that attempt to distinguish means by which "inference processes in knowledge representation require human reasoning and which are better served by automated reasoning, as knowledge processing technology evolves, to accomplish effective partnerships between human and machine intelligence in any particular context of operation." (http://www.lml.acad.bg/iccs2002/PORT.htm) Our inquiry proceeds from a similar framework. To what extent can the unrev-ii postings be subject to automated processing that will yield conceptual clusters, which can then be utilized in the creation of an access structure? Our goal is to analyze the tools while also creating enhanced knowledge repository access. Since this is an intensively iterative process, increased access to the archive will both help and hinder our ability to evaluate the archive itself, and will concurrently shape the assumptions and methods we bring to the experiment. It is our intention that the application of both formal representational structures and associative structures, in combination with analogical reasoning processes, will ultimately yield flexible and dynamic hybrid access tools suitable for knowledge repositories.    (049)

8. Future directions    (050)

There are a number ways to approach creating access to an archive such as unrev-ii. One means relies on fully automated term and cluster generation. Another method relies on a combination of automated cluster generation with human intervention to label messages with facets. Yet another relies on human work most akin to traditional indexing or tagging of messages without computational assistance. Such tagging would occur either by the author at the time the message is generated, or by someone charged with maintenance of the archive.    (051)

Having the ability both to automatically generate clusters and to tag messages would serve to enhance conceptual access in a way that neither method can do alone. Post hoc analysis of the clusters by human evaluators can catch and make adjustments for errors in the automated processes.    (052)

The current state of this inquiry is quite preliminary. We face a number of challenges, primary among them, the format of the data. For various reasons, data from email archives is considered "dirty” as it contains a great deal of irrelevant text; different message lengths; and unusual text fragments created by typographical errors, domain specific terms and poor formatting. All of these can corrupt the vector space model. Stop word lists are also a challenge as the type to token ratio is quite high. Connections between messages are difficult to maintain in the database, as there are few standard conventions for replying to an email message. Relying solely on subject line offers no guarantee, since often the content of a message differs from the subject line. Another significant challenge is the sheer volume of the text in an archive of even a moderately active list. Many semantic analysis processes require extensive computer resources and processing time.    (053)

The future for inquiry involves evaluating and refining the processes outlined herein. Which semantic analysis tools generate the most effective clusters? Does faceting provide an effective access method for knowledge repositories? Is a human interlocutor an effective mediator between the associative structures of discourse and the formal structures required for access?    (054)

9. Conclusion    (055)

Our work attempts to build a path from the diverse world of human discourse, through the informal associative structures of semantic analysis, to effective human analysis of those associations and on to the formal access structures provided by faceted classification systems. Although much work remains, our initial investigations have shown much promise. The combination of computational tools to automate cluster identification with computer-based tools to assist human-based evaluation and refinement of those clusters may prove useful in generating conceptual access to the unrev-ii archive based on flexible faceted structures. Our investigations will reveal which tools and which tasks are most suited to machine intelligence, which are most suited to human intelligence and the best means by which to facilitate effective partnerships between the two. This understanding will transfer well to the Peirce archive envisioned by PORT participants.    (056)

The same framework described here can be carried into the creation of formal access structures for PORT materials. Human analogical reasoning is uniquely suited to efficient utilization of conceptual relationships embedded in the associative structures of semantic analysis, for such purposes as engendering new knowledge, and knowledge discovery. Such a dynamic and embodied approach transcends the re-representations of existing knowledge exposed by formal knowledge representations.    (057)

Glossary    (058)

Analogy: One of several basic human reasoning processes. Used to predict, explain, construct a framework of existing understanding. Analogies are composed of two parts: source (what is known) and target (novel experience or knowledge). (Dunbar, 2002).    (059)

Analogical reasoning: Involves mapping source features onto the target (see analogy above). (Gentner, 2000).    (060)

Category: Grouping of experiences, objects, or entities. A collection of instances, objects, or entities treated as if they are the same. Categories can consist of objects, kinds, people, events, and ideas.    (061)

Concept: Collocation of all the knowledge we have about a category. An idea that characterizes or identifies a set, or category, of objects. Concepts help us categorize. Concepts help define categories.    (062)

Characteristic: Can be used to group concepts or divide instances.    (063)

Facet: A characteristic (one of many) that composes a concept. Facets are extracted by analysis during the process of division; this consists of selecting one characteristic (facet) and dividing the entities of a given universe accordingly. The sum total of individual members of each category constitutes the manifestation of a facet. The facet represents one characteristic of a concept. Facets are composed of instances. Instances that have been categorized are manifestations.    (064)

Facet analysis: Process of identifying concepts and analyzing identified characteristics which compose the concepts but which have not yet been organized (instances). Consists of choosing a characteristic as a principle of division, and using each identified characteristic to order the instances into categories.    (065)

Facet synthesis: Once subjects are divided into component single-concept or single-characteristic categories parts, the single concept parts or single-characteristic categories can be used to provide subject access by combining them according to the interests and needs of the searcher.    (066)

Instance: A discovered feature that leads to a characteristic. If redness is discovered, color is known to be a characteristic that divides the universe.    (067)

Manifestation: Consists of all of the identified and categorized instances. Instances that have been categorized are manifestations. Listing of manifestations that exist in dataset should be exhaustive.    (068)

Term: A label applied to each individual facet, preferably drawn from the universe.    (069)

References    (070)

Atherton, P. (1965). Ranganathan's classification ideas: An analytico-synthetic discussion. Library Resources and Technical Services 9(4), 463-472.    (071)

Berry, M. et al. (1993). SVDPACKC (Version 1.0). Available at http://www.netlib.org/svdpack/index.html    (072)

Bootstrap Institute. (2002). Retrieved March 7, 2002, from http://www.bootstrap.org/    (073)

Dunbar, K. (2002). Analogy. Retrieved 3 July 2002, from http://www.psych.mcgill.ca/perpg/fac/dunbar/analogy.html.    (074)

Gentner, D., Holyoak, K.J., Kokinov, B. (Eds.) (2000). Analogy: Perspectives from cognitive science. Cambridge, MA: MIT Press.    (075)

Hall, R. P., (1989). Computational approaches to analogical reasoning: A comparative analysis. Artificial Intelligence, 39: 39-120.    (076)

Spanakoudakis G., Constantopoulos P. (1996). Elaborating Analogies from Conceptual Models. International Journal of Intelligent Systems, 11(11), 917-974.    (077)

Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review, 90, 293-315.    (078)

unrev-ii mailing list archive. (2002). Retrieved May 3, 2002 from http://www.bootstrap.org/dkr/discussion/subject.html    (079)

Vickery, B. C. (1960). Faceted classification. London: ASLIB    (080)

Vickery, B. C. (1966). Faceted classification schemes. New Brunswick, N.J: Graduate School of Library Services, Rutgers the State University.    (081)

Ward, J.H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58, 236-244.    (082)

Xao, M. & Börner, K. (2001). Information Visualization Software Repository. Retrieved March 14, 2002 from http://ella.slis.indiana.edu/~katy/L697/code/    (083)

1 For further discussions of faceting refer to Atherton (1965), and Vickery (1960).    (084)

2 http://www.bootstrap.org    (085)

3 http://www.eekim.com/talks/cap2002/index-3.html    (086)

4 http://www.bootstrap.org/ohs/index.jsp    (087)

5 The data has since been migrated to a MySQL database to take advantage of the text indexing systems provided therein.    (088)