Thursday, November 6, 2008

Week 10 Readings

Web Search Engines Parts 1 & 2

These two articles were very helpful in defining a lot of terms and explaining how search engines are able to provide high quality answers to queries. Explanation of data center set up with clusters and how they make it possible for distributing the load of answering 2000+ queries/second was very interesting. The discussion of the amount of web data out there that search engines crawl through is staggering and it makes sense that it is carried out using many, many machines specifically with this purpose. The 'politeness delay' was also interesting to learn about and makes a lot of sense to have them in the crawler algorithms.

The second part explained certain indexing tricks that make it easier to process phrases better and increase the quality of results. These are all things I take for granted when searching but it's great to know that there's this massive infrastructure working away when I Google, for instance, "cat paw growth" like I just did today! As always though, the Internet will never replace a vet's examination!

OAI Protocol for Metadata Harvesting article

This article provides a brief description of Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) and requires a 'relatively high level of familiarity with how the protocol works...' Even though I do not have familiarity with OAI, the article does describe the 2 parts of OAI (data providers and harvesters) and that the protocol provides access to parts of the "invisible web". Since different communities use this protocol to meet their different needs, obviously it has to be nonspecific. The article lists future enhancements and future directions. Since the article was written, I'm curious if whether or not these enhancements have been made. The most intesting future direction they mention is the importance of using a controlled vocabulary. They mention that the normalization of numerous different controlled vocabularies used by different data providers is 'prohibitively resource intensive but they do mention that in the future 'authority agencies' could use their thesauri to be able to access items in the repository. This probably should be a high priority because there's really no point in having all of this data in one place but not be able to find what you need easily.

Deep Web White Paper

This paper from 2001 describes the Deep Web and is the first known to actually study the Deep Web in quantifiable terms. The Deep Web is the large part of the web that is not searchable by using typical search engines like Google. As of their data in 2001, only .03% of web pages are searched when using search engines. This number has probably gone up since search engines probably have better crawlers now but by how much?

It is curious that Deep Web sites, at least in 2001, get far more traffic than surface web sites. At first I found this surprising since search engines do not typically turn up Deep Web sites based on a searcher's query and, hence, I thought they would receive far less traffic. I guess it is safe to assume that they get more hits because they are used by a specific group of people who do not get to them by using search engines. This is also surprising considering that the subject area of Deep Web sites doesn't appear to be very different than the surface web (Table 6). The fact that the Deep Web has more quality results than surface web is somewhat worrisome and I wonder if the amount of Deep Web pages have decreased because they are able to be crawled now as opposed to 7 years ago because of Google, Yahoo, etc. coming up with better algorithms for their crawlers. As information seekers, we obviously have to take care and figure out better ways to find relevant information in the Deep Web.

1 comment:

Kristina Grube Lacroix said...

I agree that the first two articles were very informative, they explained very well how search engines work. You wrote that the deep web only searched about .03% of web pages. This seems like such a surprising number, that there are so many pages out there that may be helpful, but are not used on a regular basis. I also think that this number might be smaller, since there have been 7 years since the article was written, and websearches may have improved since then.