5603

Web searching

Web search engine

Teoma is a search engine created by Rutgers professor Apostolos Gerasoulis and his associates. It uses a clustering algorithm to weigh relevancy ratings. Each indexed page is assigned to one or more "communities" - sets of pages about the same subject. Inbound links from pages in the community of related pages are ranked higher than similar links from outside the community. In addition to relevance weighting, this page classification is the basis for two additional features of the search engine user interface. The "Refine - suggestions to narrow your search list " list lets the user select the appropriate classification for a given request. Similarly, the "Resources - link collections from experts and enthusiasts" list presents pages that Teoma has essentially determined to be bibliographies (lists of resources) on the topic at hand.

Simple search

Name: Teoma simple search (www.teoma.com)
Operators: +, -
Default join: AND (results in pages where all the words are found)
Quoted phrases:Yes. The search screen also has a check box labelled "Find this phrase" that does the quoting for you behind the scenes. Nice UI touch.
Truncation symbol: None available. Their instructions say "Different word stems or endings can lead to different results. Try all endings."
Stemming: Not supported, see above.
Relevance algorithm: Each result list is grouped into "communities": groups of sites with common subject material. The communities are listed on the result page and a user may click on one to limit responses to only members of that community. The initial result set is ranked by a PageRank-like algorithm with the enhancement of weighing links from within a community higher than links from outside the community. The description is purposefully vague about how relative weights between communities are assigned.

Advanced search

Teoma offers an advanced search. The user interface tries to demystify the options, providing choices in lay language instead of computer or logic talk. Choices include:

Phrase selectivity: "must have", "must not have", "should have"
Phrase location: phrase is "anywhere on page", "in page title", "in URL"
Language: "Any language" or a specific language from a list of ten (Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish)
Site location: You can specify a domain or any of nine geographical regions (Africa, Central America, Europe, India or Asia, Middle East, North America, Oceana, South America, Southeast Asia)
Date page was modified: choice of range, specific date, before or after a given date.

Search examples

popcorn energy machineThis is what I started with... I was trying to think of three unrelated terms, but of course, there are lots of relationships among these words! Did some variations to explore the syntax rules; various quote marks; put them into the advanced search to explore how the pulldown and the options like "must have", "should not have", and "must not have" worked. Tried eliminating a keyword and the characteristics of the result list changed dramatically.
Tried a bunch of searches inspired by Dr. C. Jorgensen: running, happy, fun, sad... "conceptual" searches, hard searches in the image world. In the text world Teoma always finds something, but the results were rather scattered and the secondary search tools (Refine and Resources) were similarly unimpressive.
digital library collection development When entered as unquoted terms, this search resulted in a high quality list of digital library resources: NYPL digital library, SunSITE, Yale, IFLA, a CLIR report, DLib, Glasgow. The ninth entry on the list was the first I did not know. When the check box "Find this phrase" was checked, the list changed to a tighter focus on specific policy statements of organizations. Other searches: "digital library collection policy" only yielded two hits. "digital library collection management" had twelve hits and no refinements or resources. "digital library" "collection management" was overwhelming, 4,000+ hits, but the Refine and Resources that it generated were worthless.

The "Refine" and "Resources" sometimes provide powerful enhancements to web searching, but the idea seems more useful than this implementation delivers. Google has a feature similar to "Refine" but it is not as sophisticated. Teoma lists classification titles so the user knows what community is being selected. Google's "Similar pages" feature, in contrast, is a "classification by example". You pick an individual site and see pages that are similar, but you do not know the criteria or classification system determining the similarity. Teoma lets you make fine distinctions explicitly, where in Google you need to guess. The "Resources" feature has no equivalent in Google. I used Teoma as my default search machinery for four days by making it my browser home page. I found that I did not trust the results; if it was a search I really cared about, I ran the same search in Google.

Web directory

The "open directory project" is an open source directory constructed and maintained by volunteers.

Simple search

Name: open directory project (dmoz.org)
Operators: +, -, OR, AND, ANDNOT (at least they don't call it NAND), and field operator prefixes ( "t:" for title search, "d:" for description search, and "u:" for url search).
Default join: AND (results in pages where all the words are found)
Quoted phrases:Yes.
Truncation symbol: The * is a limited wildcard character usable at the end of a search term only, i.e. "GRADUAT*" Other positional forms like "*GRAD" and "GRAD*UA" are not supported.
Stemming: Not described; I'd assume it is only supported with the truncation symbol
Relevance algorithm: The dmoz search algorith is derived from Isearch and the descriptions say that relevance ranking is supported. I could not find a description of the algorithm on the web or in ACM Digital Library, ArticleFirst, IEEE Xplore, Internet & Personal Computing Abstracts, or ScienceDirect. I downloaded the code and couldn't find the module where relevance ranking was calculated.

Advanced search

The open directory project offers an advanced search, albeit with very few choices. Choices include:

Only show results in category: Limits results to Arts,Business,Computers,Games, Health, Home, Kids and Teens, News, Recreation, Reference, Regional, Science, Shopping, Society, Sports, World, and Adult
Search "Categories Only", "Sites Only", "Sites and Categories": Allows user to differentiate between searching for a category and searching for a site
Kids and teens sites: Three checkboxes ("Kids", "Teens", "Mature Teens"). In a rather awkward user interface, this filter is always present but only active when the category "Kids and Teens" is selected.
Random: Searching for a null term (blank input box) returns four categories chosen at random.
Relevance algorithm There is no description of the relevance algorithm on the website. The search engine started with Isearch .

Search examples

isearch relevance ranking: This is recursive but ineffectual, as you would guess: "No results found.". I tried permutations of this phrase as well. The phrase "isearch" pops up a few of the same stale results from Teoma.
digital library collection development: This give two directories, two sites. One site was irrelevant; one was an excellent find (The Digital Library Center at University of Tennessee) but not specifically on the topic of collection development in digital libraries. One directory was irrelevant (but interesting) and Reference:Libraries:Digital was close. Refining the search reveals why the original was not successful: "collection development" is not a category in dmoz. "Collection policy" is the phrase they use. However "digital library collection policy" returns zero hits. Truncate to "digital library" and the world opens up: 7 categories, 624 sites. Exploring leads to nothing: Top: Reference: Libraries: Library and Information Science: Digital Library Development is closest, but still does not contain what I'm looking for. Explore more. Conclusion: Collection policy in digital libraries is too specialized to have a category; Reference: Libraries: Library and Information Science: Technical Services: Collection Development is the best I'll get in dmoz.
directory crawl: The value of a directory is in its classification system rather than its search. I used dmoz for several reference needs. My elder son couldn't find his English-Spanish dictionary. When I searched for English Spanish dictionary, I rapidly located http://dmoz.org/Reference/Dictionaries/World_Languages/S/Spanish/English/.Trying to find it via directory crawl took much longer. Refereces/Dictionaries is easy, but then you have to decide between English and World Languages. English? Wrong answer. Once in World Languages, a new interface convention shows up, an alphabet, from which you must choose the first letter of the language. No prompt, no explanation; very hard to figure out. If you guess "S" you can easily find Spanish and you are all set. However, the search engine was easier and faster.

I haven't used it dmoz in years and thought it would be worth another look. It forms the foundation for Google Directories and other directories behind search engines. Since it is open, you and I can create and edit categories. I'm looking forward to seeing http://dmoz.org/Reference/Libraries/Library_and_Information_Science/Librarians/Kazmer,_Michelle. The directory structure is logical; the user interface is clean and easy to traverse. Downsides? The classification system is not too deep, so my topic (LCSH is Digital libraries Collection development) is not included. Over the three days I was using it for this paper, the server seemed very sluggish and at times failed to respond to http requests.

Metasearch service

A new metasearch tool provided by Vivisimo, a company who previously sold search software components and now offers consumer level search, Clusty is rich in features and newideas, works quickly, and has a nice interface. This offering was developed by computer scientists from Carnegie-Mellon. The CEO is Raol Valdes-Perez, who has published widely on a variety of topics.

Clusty's tabbed top lets a user select from nine sources: Web+, News, Images, Shopping, Encyclopedia, Gossip, eBay, Blogs, Slashdot. Choices like "Gossip", "Blogs", and "Slashdot" differentiate this search machine from the competition. There's even a customize function that lets you create your own tab and title, with a customized set of search sources.

Simple search

Name: Clusty (clusty.com)
Operators: +, -, OR, or, AND, and, NOT, not; domain:, host:, site:, link:, linktext:, text:, title:, and url:.
Default join: AND (results in pages where all the words are found)
Quoted phrases:Yes, exact phrase match
Truncation symbol: Not supported.
Stemming: Not supported
Relevance algorithm: Nothing on the Clusty websites describes this. Vivisimo has descriptions of two products used in Clusty: Clustering Engine and Content Integrator. The former takes a result list and categorizes it, using some variant of the vector space model that is not described. The Clustering Engine has a cool feature: a licensee can tailor the clustering weights for keywords, phrases, and terms specific to their trade or industy. The Content Integrator is a tool that quickly translates Clusty queries into queries for other search engines and assembles the results. Clusty is clearly built on these two modules.

Advanced search

Advanced search allows the user more control over sources, clustering, and type of content.

Sources: Choose from three clusters of sources: web, news, and extra. Clicking a cluster selects or deselects all the sources in the cluster. Users can also select or deselect individual sources. The web cluster searches GigaBlast, MSN, Lycos, Looksmart, Wisenut, Open Directory, and Overture. The news cluster searches Reuters, Clusty's NYTimes, Clusty's Yahoo News, CNN, USA Today, and BBCNews. (They don't explain what the "Clusty's" means for NYT and YN). The extra cluster searches BizRate, Images, LII,
eBay, FirstGov, and PubMed@NIH.
Clustering: The user controls cluster size (100, 200, 500 results) and timeout (2, 5, 10, 30 seconds)
Content type: The user controls language (50+ different choices) and obscenity level (Filtering, no filtering).

Search examples

vivisimo relevance ranking: This time recursion yields a panoply of sources, but still, the algorithmic description remains unfound. I suspect they consider a trade secret. I've looked in trade publications and computer science literature.
digital library collection development: This is the best search result from the variety of search engines I have used for this search. The immediately found sites are the highest quality ones: California DIgital Library policy pages, SunSITE, D-Lib, American Memories. The classification choices are excellent, with choices: University, Development Policy, Science, California Digital Library, Library of Congress, Library Research, Library Resources, Conference, Framework, Issues, and (more...).
defaults: the morning after the first Presidential debate, the gossip column defaults included a "Bush, Debate" category. At first I thought it was strange; shouldn't that be in the "News" tab? Then I started reading, and son of a gun! It is gossip about the debate! "News", on the other hand, had a "Kerry, Debate" category with substantive articles. Political bias in the Clustering Engine? The "Encyclopedia" tab had a links to articles in Wikipedia about the 2004 debate program, along with links to the candidates' pages. Other tabs display only a blank search box.
images: I'm wandering through gossip land (Paris is everywhere!) and I see this picture: ....What is Maureen Dowd doing in the gossip pages? The image search quickly shows me the source of my confusion: Melissa Etheridge or Maureen Dowd?

The nice thing about this tool is that once you find an initial category that is close to the type of information you seek, you navigate through the classification system to narrow or broaden your search. The results returned were excellent. I've been using it for several days and it seems to produce high quality results consistently. I have yet to find something that would make me want to return to Google as my default search engine.

5603: Introduction to Information Services

Web searching

Web search engine

Simple search

Advanced search

Search examples

Web directory

Simple search

Advanced search

Search examples

Metasearch service

Simple search

Advanced search

Search examples

Source Links

Historical note