Getting to Know the Uknown User

Chris Dent, L505 Essay 7

2001-02-18

Getting to Know the Unknown User

Ellis, Ford and Furner, in a monument to abstraction, describe the challenges found in indexing large, general-use databases. Their summary is “The problem is that of indexing for the unknown user” [44]. In the context of the World Wide Web the user is unknown because the database (the set of web pages) and the browsing user are both distant from the typical indexing system, the search engine. The search engine has knowledge neither of the content nor of the user and thus building connections between the two, especially connections which are built of more than just word frequencies, is difficult. The major publicly accessible search engines address these problems in a variety of ways.

Google.com has a technology that they call PageRank that is itself interesting and may also point to further developments to make indexing on the web more successful. The technology first determines a set of pages that match, exactly, the given search terms. That set is scored for relevancy by a variety of methods, including the “nearness” of the various terms. The set is then further scored by the PageRank system. Documentation at google.com states: “PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value.”

What PageRank does is determine how many other web pages link to the page currently being scored. If the number of links is high, it is assumed that the page is considered valuable by the link creating populace. The score is further refined based on the PageRank value of the pages where the links originate. It is assumed that a link from a page considered important makes the page to which a link is being made more important. The PageRank scoring is mixed with the other relevancy scores and a result set presented to the user.

Personal experience shows google.com to be quite effective. Linking activity by the world at large appears to help indicate a valuable page.

In effect what google.com is doing is making the unknown user a little bit more known. The software makes the assumption that the searching person (A) is similar to the aggregated persona represented by worldwide linking behavior (B). If A is looking for information about “linux laptops,” in the set of found documents what B thinks is most valuable is likely to be what A will think is most valuable. This is effectively the same profiling behavior used to make film, book and other recommendations at many online stores.

The amount of information available, at this time, to google.com about what A likes is limited to the search terms provided. That’s not really enough to create the overlapping views required to do accurate profiling. What google.com is doing is viewing a single map instead of two overlapping maps. What would happen if A’s preferences were available? Assume that google.com had a mechanism for remembering all searches and all selected links that a searching person has made since a record was kept. Then when A searched on “linux laptops” several things could happen:

· A’s profile could be compared with others. A similar profile that had also searched for “linux laptops” and taken particular links could help to score links that match the search terms.

· Profiles that are similar could be analyzed to create a controlled vocabulary used for synonyms or to provide suggestions to the searcher.

· With some involvement from the community followed links could be scored. Those scores could be retained in the profiles for later use.

· Direct marketing droids could have a field day.

· Unscrupulous investigators could ruin lives.

There are huge issues of privacy to be concerned with here. These issues are important not just for search-engine profiles but for any system in which profiling data may be used. In the future it is possible that profile information will not be distributed amongst services (i.e. google.com has the google.com profile for me, netflix.com has the netflix.com profile for me) but will instead be centralized with the individual. When a new service is encountered a conversation will occur between the user’s profile and the service. How that profile information is shared, owned, protected and used will be of major concern. The problem is that of indexing for the unknown user because the user must have the option of remaining unknown.

REFERENCES

Ellis, David; Ford, Nigel; Furner, Jonathan (1998). In search of the unknown user: indexing, Hypertext and the World Wide Web. Journal of Documentation, 54: 28-47.

Google Search Technology, http://www.google.com/technology/index.html.

Lester, Toby (2001). The Reinvention of Privacy. http://theatlantic.com/issues/2001/03/lester.htm.