The Good The Bad and The Ugly

Darkbird18's Internet Information Research:Internet Search FAQ 1/2 Update: 01/05/2022: Web Search Engines FAQS: Questions, Answers, and Issues



Internet Search FAQ 1/2
( Part1 - Part2 )
[ Usenet FAQs | Web FAQs | Documents | RFC Index | Forum ] faqs.org
Archive-name: internet/research-faq/part1
Posting-Frequency: last week-end of each month
Last-modified: 27 June 2003.
Copyright-Notice: see end.


View all headersPath: senator-bedfellow.mit.edu!bloom-beacon.mit.edu!nycmny1-snh1.gtei.net!news.gtei.net!newsfeed!wn13feed!wn12feed!wn14feed!worldnet.att.net!24.30.200.11!news-east.rr.com!bigfeed.bellsouth.net!news.bellsouth.net!peer01.cox.net!cox.net!news-hub.cableinet.net!blueyonder!internal-news-hub.cableinet.net!news-binary.blueyonder.co.uk.POSTED!53ab2750!not-for-mail
From: charlie@harris.u-net.com (Charlie Harris)
Newsgroups: misc.writing,misc.writing.screenplays,alt.answers,misc.answers,news.answers
Approved: news-answers-request@MIT.EDU
Subject: Internet Search FAQ 1/2
Followup-To: misc.writing
Summary: Part 1 of 2: This posting gives help for writers and others in using the Internet for research, giving suggestions as to which methods are best for different needs, and including worked examples.
Reply-To: charlie@harris.u-net.com
Expires: Sun, 27 Jul 2003 00:00:00 GMT
Message-ID: <3eff490a data-blogger-escaped-.31147928="" data-blogger-escaped-news.blueyonder.co.uk="">
X-Newsreader: Forte Free Agent 1.1/32.230
Lines: 1518
Date: Sun, 29 Jun 2003 20:21:39 GMT
NNTP-Posting-Host: 82.35.42.38
X-Complaints-To: abuse@blueyonder.co.uk
X-Trace: news-binary.blueyonder.co.uk 1056918187 82.35.42.38 (Sun, 29 Jun 2003 20:23:07 GMT)
NNTP-Posting-Date: Sun, 29 Jun 2003 20:23:07 GMT
Organization: blueyonder (post doesn't reflect views of blueyonder)
Xref: senator-bedfellow.mit.edu misc.writing:729659 misc.writing.screenplays:362837 alt.answers:68111 misc.answers:16112 news.answers:253500

View main headers
See reader questions & answers on this topic! - Help others by sharing your knowledge
Internet Search FAQ

Part 1 of 2

********************************************************************

Part 1

1. WHAT IS THIS FAQ?
2. DISCLAIMER
3. WHY USE THE INTERNET AT ALL?
4. HOW CAN I FIND...?
4.1 How Can I Find Specific files, text, multi-media or
people?
4.2 How Can I Find Specific information?
4.3 How Can I Find More General Background Information?
5. HOW CAN I FIND INFORMATION FASTER?
6. SHOULD I PAY FOR INFORMATION?
7. WHERE CAN I GET FURTHER HELP?
8. HOW CAN I VALIDATE WHAT I FIND?
8.1 How reliable is the Net?
8.2 What can I do about it?
8.3 How should Internet sources be cited?
9. WHAT ABOUT THE FUTURE?

Part 2

10. URLS FOR A RAINY DAY - useful links for research
11. END CREDITS

This FAQ is available on the web at along
with an archive of changes.

It is updated and posted roughly the first weekend of each month,
circumstances permitting, to: misc.writing, alt.movies.independent,
alt.union.natl-writers, misc.writing.screenplays, alt.answers,
misc.answers and news.answers.

At the same time, any changes to the FAQ, including new resource
links, are posted to all except the *.news groups in a separate
message "Internet Search FAQ - What's New", to save users having to
download the entire FAQ each month just to find out what's been added.
This can also be found at
.

Keen searchers are urged to check this out regularly for new ideas and
links, and those with clever browsers (Navigator and Internet Explorer
4 and above) can set them to "subscribe" to the page on a regular
basis - if they can work out how.

All suggestions and comments are welcome. Please send to
charlie@harris.u-net.com.

********************************************************************

1. WHAT IS THIS FAQ?

Although this posting was compiled originally for writers, it has
become increasingly clear that this FAQ (Frequently Asked Questions)
list is of use to anyone who wants to find their way around the Net.

It grew out of a cry for help that I sent out, in desperation. As a
professional writer, I wanted information of a variety of types. One
day I might want specific dates, another day just background
information. I wanted to know if I could use the Internet to find
these different types of information quickly and reliably. And I
wanted to know which of the many different bits of the Internet would
be good for which different type of search.

However, the vast majority of books, articles and Usenet postings do
not address the question from the point of view of the user, and tend
to be obsessed with either vague surfing or searching out free
software. The last thing I wanted was yet more software.

I was pleased to receive a number of responses to that original cry
for assistance - useful and supportive answers, which gradually became
the foundation of this FAQ.

The FAQ tries to look at the Net from the point of view of the user.
So it is divided into the kinds of questions that Net searchers might
have. It also includes "worked examples" where possible, to clarify
the methods that can be used. Finally there is a list of useful URLs
(Urls For A Rainy Day) which includes most of those mentioned in the
main text plus a load more, and is also available at
.

I haven't tried to explain what all the technical terms mean (eg: URL,
ARCHIE, FTP...) These are very adequately explained in a thousand
postings, books and magazines. The problem is knowing which to use in
which circumstances.

The Internet is constantly changing, and so I welcome any suggestions,
criticisms and additions. However, most Net users are snowed under
with URLs, etc, so please send personal recommendations, or that of
someone you know, and say why or how it is useful. (For example,
state that a particular URL is good for geographical queries, or how
you used Gopher to research background for your romantic novel).

************************************************************************

2. *DISCLAIMER*

Information, advice, URLS, e-mail addresses, etc, are generally
included on the recommendation of satisfied users. They are passed on
herewith without prejudice! I've not checked all of them out, and make
no guarantees that they are accurate, useful, or still appropriate, or
in fact ever were. I take no responsibility for any loss, damage or
waste of time in using them. Sorry. But please do tell me if an URL
turns out to be useless, or non-existent, so that the information can
be kept up- to-date.

************************************************************************

3. WHY USE THE INTERNET AT ALL?

3.1 If you want to use the Net effectively, you need to be prepared
for what it can and can't do.

The Internet is not a substitute for a good library. The Internet can
be very frustrating. The Internet is very variable. The Internet is
not well indexed. And the Internet is not comprehensive. So is it
worth using at all? Well...


3.2 The Internet is an additional source of information, which often
can't be found, or isn't as up-to-date, elsewhere.

"Searching for data on Internet can be frustrating but what you find
often can't be found in a library -- the same is true in reverse. I
didn't stop using the library when I started using the Internet."
(writer Laurence A.Moore)


3.3 The Internet is convenient, and supplies information in usable
form.

"One handy thing about Internet research is that when I'm done, the
results are on my computer. With the library, the best I can do is
photocopy what I find, or bring the books home and type the data in.

"Looking out the window above my computer, I see birds and
autumn-colored trees and calm, quiet, gently-falling rain. As soon
as I send this, I'm going to bring a mug of fresh coffee back from the
kitchen and take off on Internet. Can't do that at my local
library!"
(Laurence A.Moore)


3.4 However, the Internet has to be worked at. The "superhighway" is
still substantially under construction. As one writer put it: "the
Internet is an enormous library in which someone has turned out the
lights and tipped the index cards all over the floor." (Or, variously,
"Like trying to work off the librarian's notes after discarding the
card catalogue," Allen Schaaf).


3.5 Be realistic and focused about what you want to find. Do you
want a precise fact, or more general background material? How will
you know when you've found enough information - or when to stop
trying? Faced with the enormous size of the Net, it's tempting to
believe that the ideal link is just around the next corner, but some
types of information simply aren't there, while other information may
exist on the Net, but be extremely difficult to locate. Sometimes, to
be honest, there are easier ways: a phone call, the local bookshop, a
friend of a friend.

Nevertheless, the more you learn about the Internet, the more you
become aware of what it can and can't do. The most difficult way to
approach the Internet is when you already have a large and urgent
piece of research to conduct. Better to check out small areas of it
without stress, for a few minutes at a time, on a regular basis. Give
yourself a chance to play about with the Net when the pressure is off,
so that when the pressure is on you can find what you need quickly and
efficiently.

***********************************************************************

4. HOW CAN I FIND...?

What's the best and most efficient way to look for what I need? (Here
we look at some ways of finding the different kinds of information
that's on the Net.)


4.1 How can I find Specific Files, Texts, Media (images, sounds, etc)
or People?
-------------------------------------------------------


4.1.1 How can I find specific file by name?

The more precise you can be with your search, the better. So if you
have a precise filename, you've got the best chance of finding what
you want.

Many search engines and meta-search engines now have facilities for
searching for software files, etc). Try Google, for example
or many of the others listed in URLs For A
Rainy Day (9.3.4).

There are many books, articles, etc, on the Internet which show how to
search for specific filenames, using Archie, etc, so this is not dealt
with further in the FAQ. However, researchers rarely have a precise,
or even imprecise, filename. So....


4.1.2 How can I find a specific text?

There are an increasing number of web and FTP sites which hold public
domain copies of a wide range of classic texts, song lyrics, etc.
Some links are given in part two of this FAQ - URLs For A Rainy Day
(Section 9.7).

You can also link to some of these via:


There are history archives on the Internet and a number of libraries
on the Net. For example, David Brager suggests the Library of
Congress' American Memory section - "Large collections of primary
source and archival material relating to American culture and
history."

.

In addition, increasing numbers of search engines will allow you to
search across a number of search engines for specific items such as
lyrics. One such is OnlineSpy . See
Section 4.2.2 for discussion of other such "metasearch" engines and
9.3.4 for a list of metasearch engines to use.

4.1.3 How can I find a specific image, movie or sound?

Many metasearch engines, such as OnlineSpy (see above) will allow you
to search for images specifically - or even sounds or movie clips.
You may however need to be very precise with the terms you search with
(see 4.2.1 below for how to use search engines with precision).

One particularly useful site is Image Surfer
recently developed by Yahoo. Image Surfer is a search engine which
you can search by category or using search terms, but instead of
giving its answers in text form it produces a series of small
thumbnail images. Much the most useful image searcher I've yet seen,
Image Surfer's capacity is still small, but Yahoo promise it will grow
in size. Well worth checking out.

ImageFinder gives you a
number of different databases to search for a variety of types of
image - eg: the Smithsonian Photographic Collection or Colombia
University Image and Video Catalog.

Useful for both pictures and sound is the search engine HotBot
which provides tick boxes to allow your search
to include still images, video or audio sound clips, or even shockwave
animations. Said to be one of the best MP3 search engines at the
moment.

4.1.4 How can I find specific people?

There are many resources on the Net that can help you locate and even
make contact with specific people - famous or not, individuals or
companies. Whether they'll be of any use to you will depend on a
number of factors, not least geographical.

As with so much on the Internet, the vast majority of resources are
devoted to the USA. So there's little difficulty in finding
directories and databases with look-up or even reverse look-up
facilities covering just about every member of the US population,
alive or dead.

(Particularly intriguing, in passing, is Ancestry.com
which among its useful resources for
genealogical research allows you to find the social security number
and other details of any dead American.... and then offers a facility
to write a letter! Do they know of some postal service that we
don't?)

More wide-ranging are the directories of email addresses. However
these are far from all-inclusive, even assuming your target has an
email address. Some Internet Service Providers - such as CompuServe
and AOL used to provide a look-up service which included all
subscribers (and probably still do) but only for other subscribers, as
I understand.

For the rest, directories such as BigFoot
rely on finding email addresses of those who have web-pages or post
regularly to newsgroups. By no means does this include everybody.
Expect to have to try a number of sites before you find a lead.

In Urls For A Rainy Day - Section 9 - there are numerous search
facilities. 9.3.4, 9.3.5 and 9.11.2 give a number of meta-search
engines, people searchers and reference sites which offer specific
people-finding databases. Particularly useful are those such as
All-In-One or Langenberg
which have links to many different
"people" sites on one page.

There are also databases devoted to certain types, eg: politicians
(9.13.2).

Organisations are generally easier to find through a search engine.
But even then it is not always easy - especially if the organisation
doesn't have a web page of its own. However, David Brager writes to
inform us that if you know a domain name you can use it to find all
kinds of details, from contact e-mail and snail-mail addresses to
phone numbers at .

Whether looking for people or organisations, in difficult cases you
may need to try the more refined methods for finding information by
using Search Engines, or posting questions on Newsgroups or Mailing
Lists, as described in the next section.


4.2 How can I find Specific Information?
----------------------------------------------

(eg: dates and places. Or questions like: "what is a...?" "who
is...?")

4.2.1 SEARCH ENGINES are popular for this. You type in a key word
or phrase (such as Spain, or Spanish Civil War) and wait to see what
they provide.

The popularity of search engines on the Net can be changeable. When I
started this FAQ there was no clear winner. Then Alta Vista appeared
, and for some time beat all the others
hands down. For at least two years Google has taken over at the top
.

Google has many strong points, including simplicity, a lack of adverts
and the ability to check its own "cache" of pages if the page you're
looking for has temporarily disappeared. But no search angine is
perfect and different people have their different favourites. You can
find many other good search engines, each with its own paticular
strengths in our list of links - Urls for a Rainy Day.

The trick with using a search engine, is to know what each is good for
and to look carefully at the hints and tips that they offer. For
example some engines will only search for a precise phrase if you put
it in quotes - such as: "Spanish Civil War."

Planning is necessary for any search. Do some advance work with a
Thesaurus and list a fair number of relevent search terms. Remember
that search engines aren't like "Find" facilities on word processors.
So you can afford a scattergun approach, trying a number of related
words at the same time in case one of them hits home. For example: in
starting a search for items on dealing with tiredness you might type
the following related terms into the search box: fatigue overwork
tired exhausted exhaustion sleep.
Welcome to the Internet Search FAQ
How to Find Information, People, Data, Text, Pictures, Sounds and Almost Anything Else on the Net

FAQ Contents
WHAT IS THIS FAQ?
DISCLAIMER
WHY USE THE INTERNET AT ALL?
HOW CAN I FIND...?
How Can I Find Specific files, texts, multimedia or people?
How Can I Find Specific information?
How Can I Find More General Background Information?
HOW CAN I FIND INFORMATION FASTER?
SHOULD I PAY?
WHERE CAN I GET FURTHER HELP?
HOW CAN I VALIDATE WHAT I FIND?
How reliable is the Net?
What can I do about it?
How should Internet sources be cited?
WHAT ABOUT THE FUTURE?
URLS FOR A RAINY DAY - Loads of useful links for research of all kinds
END CREDITS
This Frequently Asked Questions guide was last modified 1 June 2013
Caught in the Net? Going nowhere on the Information Superhighway? Fear no more.

Help is at hand. The Internet Search FAQ is here to help you find what you want - and retain your sanity in the process ...and find hundreds of essential links for searching in Urls for a Rainy Day

The main FAQ page is designed to be used by anyone, no matter how much or little you know about searching the Net. Unlike books and pages written by experts, it is based on the kinds of questions that typical users ask: why should I use the Internet? What's the best way to find specific things, specific information, more general information? How can I speed up my searches? Will I get better results if I pay? How reliable is the information I find?

While if you're impatient to get started, then go straight to our essential links and start clicking.

We also look at how to find further assistance, and try to guess what changes are on their way (although with the speed the Net changes, they may be happening even as you read).

--------------------------------------------------------------------------------
News and New URLs

The latest on searching and our list of newly discovered resources to help you find your way around whatever subject you want to search on the Net... Click Here
For details of these and other new ways of finding information on the Internet go to our what's new page, updated regularly.

• FEATURE •
Web Search Engines FAQS:
Questions, Answers, and Issues
by Gary Price • Gary Price Research and Internet Consulting
Table of ContentsPrevious IssuesSubscribe Now!ITI Home
View Search Engines FAQ Chart
Only a few years ago, the phrase "Web search" did not exist. Then the term began to move rapidly into the awareness of information professionals, about as fast as a Japanese bullet train. Today, much, though not all, of the work we do revolves in one way or another around the Web.

With so much to keep on top of, precious time becomes even more precious. A couple of years ago I wrote an article trying to figure out a way to make the day 26 or 27 hours long. Unfortunately, that idea never reached the implementation stage, though it remains an idea worth considering. Even within the narrow bonds of 24/7/365, we must all still try to keep up to date about what is happening with Web search engines. The fact that they seem to change on a weekly, if not daily basis, is no excuse.

We as professionals do not use every search engine or Web directory daily, nevertheless, we have to know how each works and what data each does and does not contain. I fully understand that this is easier said than done but today, information access is a topic that everyone is aware of and talking about. Pick up any newspaper. Turn on the television. Everyday more and more articles and reports discuss searching the Web. Many of these articles and reports are written for and by non-information professionals. We have to stay ahead of our clients and patrons if we hope to help them. Excite or AllTheWeb may not be your search engines of choice, but I bet they are for someone you know. Our colleagues, co-workers, and friends come to us as the "search experts" and we must do our best to help. Our knowledge and understanding in this area are great ways to make our profession look good and to make our already valuable jobs even more valuable.

With this said, the following reviews the latest goings on in the search world and tries to provide some suggestions and tools to make you more knowledgeable and save you some time.
 

Price's Priceless Tips
The Web search world changes on what sometimes seems like an hourly basis. What follows are a few selected tips and resources for some of the most well-known of engines. This is just the tip of the iceberg. Resources like Search Engine Showdown and Search Engine Watch are essential for learning and keeping up with how these tools work and change over time.
Ten Things to Know About Google

1. The database that Google licenses to Yahoo! [http://google.yahoo.com] is not the same size: it's smaller than the Google.com database. It does not contain links to cached versions of pages. This database is also used to supply "fall-through" content (material not in Yahoo's own database). It is often found listed as "Web page" content.

2. Google utilizes the Open Directory Project database as its Web Directory [http://directory.google.com].

3. You can search stop words by placing a + in front of the word (ex. "+To +Be +Or Not +To +Be").

4. At the present time the Google database is refreshed about once every month.

5. You can limit your search to only .pdf files by using the syntax filetype:pdf.

6. Google is the only major search engine to crawl Adobe Acrobat .pdf files.

7. If you are a frequent Google searcher, save time by using the Google Toolbar [http://toolbar.google.com] and Google Buttons [http://www.google.com/ options/buttons.html].

8. A Boolean "OR" is available with Google. For it to function, capitalize the OR.

9. Google only crawls and makes searchable the first 110 k of a page. Long documents may have substantial content invisible to Google.

10. Entering a U.S. street address into the query box will return a link to a map of that address location. Typing in a person or business name, city, and state will also run the query to the Google phone directory. Several other combinations are available that will also query the phone directory service, including typing in the area code and number to run a reverse search [http://www.google.com/ help/features.html#wp].
 

Ten Things to Know About AllTheWeb

1. AllTheWeb licenses its database to Lycos. The identical database is searched and makes up some of the content on a Lycos results page.

2. Unlike Google and AltaVista, this search engine does not have a limit on the amount of content crawled on a Web page.

3. AllTheWeb indexes every word. Words traditionally considered as "stop words" are searchable.

4. AllTheWeb does not permit the use of Boolean operators.

5. If plus and/or minus signs are not used, AllTheWeb implies a plus sign in front of each term or phrase. This results in an implied "anding" of terms.

6. AllTheWeb is now promising a complete refresh of its database every 9-12 days.

7. AllTheWeb permits syntax to be used direct from the "basic" search page to limit a query. See http://www.alltheweb.com/ help/basic.html#special.

8. A query to the AllTheWeb text database simultaneously runs the search in the AllTheWeb Image, Video, MP3, and FTP databases. If it finds anything, these results are linked on the right side of the results page.

9. AllTheWeb offers a search engine dedicated to Mobile Web content [http://mobile.alltheweb.com].

10. Fast Search and Transfer (FAST), the company behind AllTheWeb, has deployed its software to power the Scirus science search engine from Elsevier.
 

Ten Things to Know About AltaVista

1. AltaVista is the only major search engine that allows a searcher to use the proximity operator, NEAR (in simple search) near (advanced search). Using this operator finds terms within 10 words of each other in either direction.

2. AltaVista indexes only the first 100 k of text on a page.

3. An asterisk (*) can be used in a phrase to represent an entire word. (Ex. "One small step for man, one giant * leap for mankind")

4. AltaVista News http://news.altavista.com] is "powered" by Moreover. This continuous feed of material can be searched using AltaVista syntax.

5. The use of the "sort by" box on the AltaVista Advanced interface allows you to give certain words or phrases a higher relevancy weighting.

6. Caveat: If you use Advanced Search, make sure to place some term or terms in the Sort-By box; otherwise, results return in completely random order.

7. AltaVista's directory comes from Looksmart.

8. AltaVista's advanced search does not allow for the use of + and — signs.

9. If you search AltaVista in the "simple" mode entering multiple terms without syntax, it will result in an "implied" OR. In the advanced mode, multiple terms are considered a phrase.

10. AltaVista software powers the Health Resources and Services (U.S. government) search engine. This means that all AltaVista syntax can be utilized there. This site also illustrates AltaVista capability of indexing full-text .pdf documents on the site-specific and intranet level [http://search.hrsa.gov].
 

Ten Things to Know About MSN Search

1. MSN (Microsoft Search Network) Search is "powered" by an Inktomi database. Remember that Inktomi licenses its database to many search sites. Each site gets a different "flavor" of the total database.

2. The MSN Advanced Search interface offers numerous limiting options via fill-in boxes and pull-down menus [http://search.msn.com/advanced.asp].

3. The Advanced Search interface permits limiting to pages at a certain depth in the site. For example, limiting to pages Depth 3 will limit the search to only pages no more than three directories deep from an entire site [e.g., http://www.testsearch.com/ Directory1/Directory2/Directory3/].

4. MSN Search allows use of the asterisk (*) as a truncation symbol.

5. According to the most current Search Engine Showdown rankings, MSN Search has the largest database of any Inktomi partner.

6. The directory portion of MSN search is powered by the Looksmart database.

7. On the Advanced Search interface, checking the "Acrobat" box will retrieve pages with links to pages that contain .pdf files. It does not search content "inside" these files.

8. Greg Notess points out that the same syntax available to limit Hotbot will also work with MSN Search [http://hotbot.lycos.com/ help/tips/search_features.asp].

9. Danny Sullivan notes that MSN also employs human editors to "hand-pick" key sites in the Web Directory and Featured Link sections of the site. Although most of the time the "Featured Links" represent major MSN advertisers, editors can add other content.

10. Selecting and search under the MSN "News Search" tab returns results predominantly from MSNBC.
 

Ten Things to Know About Northern Light

1. Make sure to study the Northern Light "Power" search page. It provides many limiting options without the knowledge of any syntax [http://nlresearch.northernlight.com/ power_research.html].

2. Instead of entering http://www.northernlight.com, use http://www.nlresearch.com to go straight to the Northern Light Research site. This site aimed at the enterprise market (but available to any searcher) contains access to several databases not available from the main URL. Most of these resources are fee-based. They include EIU Search and market research content from FIND/SVP and MarkIntel.

3. Northern Light provides FREE full-text access to a database of continuously updating news content from 56 newswires. Material stays in this database, available for free access, for 2 weeks. Then the content moves to the Northern Light Special Collection database.

4. Northern Light's Special Editions are subject specific portals that combine material from the "open Web" and NL's proprietary databases. Topics of Special Alerts include XML, managed care, and electronic commerce.

5. The Northern Light Special Collection currently contains content (fee-based, pay-per-document) from over 7,100 sources. A catalog of these publications is available at http://nlresearch.northernlight.com/ docs/specoll_help_catlook.html.

6. Northern Light allows the use of Boolean operators and + and - signs.

7. Multiple truncation symbols can be used in a query. Northern Light has two truncation symbols. The asterisk (*) for multiple letters and the percent symbol (%) for single or absent letters, e.g., medieval/mediaeval.

8. In addition to the limiting capabilities of the "Power" search page, NL has several terms available for field searching. These include text:, text:, and pub:. (This last prefix allows searching in a specific Special Collection publication title.) You can find a complete list at 
http://nlresearch.northernlight.com/ docs/search_help_quickref.html.

9. Northern Light's free "Alerts" feature is one resource you must know about. This feature allows you to set up search strategies in ANY/ALL of the NL databases and have those strategies searched up to three times daily. If any new material hits on the strategy, results will be delivered to you via e-mail. I use this tool to bring me a customized feed of news via the NL News Search database. Remember, the full-text content is free to access for 2 weeks.

10. Northern Lights "Geo Search" provides an opportunity to search the Web with keywords and U.S. and Canadian address information. Results also get the benefit of NL's organization with its "custom folders."
 

Ontologies, Controlled Vocabularies, XML, and Web Search Engines
I am very excited to see that controlled vocabularies and the building of ontologies have come into vogue.

Some of this "hipness" has been caused by the promise and excitement surrounding XML (eXtensible Markup Language). However, I am not sure if the coming of XML will help the general-purpose search engine, though it should clearly help specialized, focused, and Invisible Web engines become much more useful resources.

Why the hesitation?

The general-purpose engines, as we know and love them today, hypothetically index each page, massive amounts of data coming from just about anyone who wants to produce Web content and put it on a publicly accessible server.

The problem for implementation of a controlled vocabulary with this material is really one of creation. Who would create it? Who would maintain it? Who would do the cataloging? Would entire sites be cataloged at the page level or only a specific page (the top page)? Who would manage such a project? Where would the money come from?

Controlled vocabularies and XML show a great deal of promise for certain types of search engines because these types of engines can much more easily create and enforce a set of agreed upon standards. Many issues would need resolution before we could apply controlled vocabularies to make searching the massive amount of material on the open Web more effective.
 

The Future: 
New Tools on the Way
When you learn about new search tools and share that knowledge with others, you not only improve your own searching, but you help to make a better future for all searchers.

Here are some new search products that show a lot of promise, a few more potential "quick hits." With the vulnerability of the Internet industry of late, let's hope these products survive. Even if the actual companies do not survive, the technology is still worth knowing about. Have fun!!!

Three New General Purpose Search Engines
Competition for Google?


A New Image Search Tool


Real-Time Search
Patented technology to search resources updated in real-time.


Natural Language Search Technology
This product is getting a lot of attention.


Now let's see if you've learned your lessons. How long will it take before you've tried all these new promising sites out? The test clock starts...now!
 

This Article Contains Inaccuracies: 
Essential Reading
In the time it takes this article to move from the author to the editor to the publisher to the printer to you, undoubtedly something mentioned in this article will have changed. Some feature will have appeared, another vanished. The working searcher must simply make a policy of staying on top of those changes.

Those of you who need to keep current on the Web search world should monitor the following sites as often as possible. All these sites are free and most contain free e-mail newsletter and updates.

SearchDay
http://www.searchenginewatch.com/ searchday/
Written by Chris Sherman. Daily updates.

Search Engine Watch
http://www.searchenginewatch.com
A resource rich site that offers a free monthly newsletter.

Search Engine Showdown
http://www.searchengineshowdown.com
Librarian Greg Notess's site. Updated on a regular basis. Greg also manages the Search-L list.

ResearchBuzz
http://www.researchbuzz.com
Written and compiled by Tara Calishain. Daily updates.

TVC (The Virtual Chase) Alert
http://www.thevirtualchase.com
Written and compiled by Genie Tyburski. Daily updates.

The Virtual Acquisition Shelf and News Desk
http://resourceshelf.blogspot.com
Compiled by Gary Price. Daily updates.

Free Pint
http://www.freepint.com
Fortnightly newsletter edited by Will Hann. Also offers Web discussion boards.

News Breaks from Info Today
https://www.infotoday.com/newsbreaks/
General information industry coverage of breaking news, that often features news of the Web search world.
 

Scope Notes
Before we begin, we need to get a definition straight — a definition that I think many of us have thought about. What does "Web search" mean to the information professional? In the early days of the Web, it meant exactly how it sounds — material found on the open Web.

However, as we move forward, the term "Web search" has taken on new meanings. Does a Web search involve tools like Google or AltaVista to reach "open access" material? Does it mean using the Web as a vehicle to log on to proprietary databases such as Factiva or Dialog? Not too long ago, logging into proprietary services required individual connections to each one. Today, any Web browser with an Internet connection can reach those services. Perhaps it means both. This lack of common understanding can confuse some and trying to solve the issue is outside of the scope of this article.

This article will primarily focus on the "traditional" Web search, i.e., search engines that assist in locating open Web content. The approach I have taken is to try to answer the questions I seem to get, in one form or another, at every conference, every workshop, and in every day's stack of e-mail messages.
 

The Never-Ending Amount to Learn, No Sign of Slowing Down
The single most difficult issue for the Web searcher to face is the sheer volume and speed of change on both the Web and the search engines that try to cope with it. The sense of doom most searchers feel in struggling to keep pace occurs not because of any lack of intelligence, nor any lack of interest in the subject — far from it. Most often the cause is the reality of having only 24 hours in a day and the fact that life exists away from the computer.

I monitor what's going on in the Web search world on a daily basis and it's almost routine for something new to arrive or for something established to change each day. For example, at the time of writing this article, AllTheWeb had just undergone major changes, Google released an image search tool, and WISEnut, a new general search tool, had come on the scene. When you couple the dynamic nature of Web searching (both individual pages and entire resources coming and going) and the need to stay up-to-date with traditional electronic tools (which undergo plenty of changes as well), print resources, and other issues of the day (can you say "copyright" or spell "Tasini"!), there is so much to do and so little time to do it.

A lack of knowledge and understanding about how a particular search tool works, e.g., a new way to narrow your search, or ignorance of a more useful tool, e.g., a new search engine going online, can waste time and produce poor results.

What Should the Searcher Do?

I realize this is easier said than done, but Web searchers MUST devote at least 1-2 hours a week to stay current. This informal "continuing education" is crucial. Often, the knowledge you gain from these sessions will pay off handsomely with time saved and better query results in the future. The best way to learn to learn how a search engine works is by using it. Conducting preemptive research on a favorite topic makes it easy to spot differences both in terms of content and the way results are presented and at the same time to gather new resources for your own bookmarks or intranet sites. For a list of suggested sources to keep you current, see the "Essential Reading" sidebar.
 

Is an "Open Web" Search Engine Always the Place to Begin? What Type of Information Can I Count on Finding There?
Lately, I have spent a great deal of time thinking about this issue. As someone who often gives presentations about Web searching I have tried to provide session attendees with lists of what you can and can't find "on the Web" using a general-purpose Web search tool. Even in the most general sense, my attempts must fail. A few minutes after beginning, I inevitably realize that one can't boil down a dynamic universe of data like the Web into bullet points. Knowing, or better, understanding, where to start in this world of information resources is perhaps the most important information to know and share. There is no simple way of doing this. It takes time and commitment. I start learning about new resources by asking the most basic questions: What is this database or search engine? What would kinds of questions would it help me answer?

Often the open Web may not be the place to begin. While it's nice to get quality material free, how long did it take to get it? Would standing up and walking to a bookshelf produce a useful answer in a much shorter period of time? Would a commercial full-text search service scan the decade-long archives of 50 or 100 newspapers in a matter of minutes? At issue are the time and money it takes to reach your answer.

Even if you choose the open Web as your target, would a specialized or targeted search engine more easily find your answer, rather than one of the all-encompassing engines? Regardless, understanding how each search engine works and the many ways an engine allows you to limit and control searches will make general-purpose engines more productive and waste less of your time.

We need to do this "learning" much the same way we have always "learned" traditional databases and print resources. Think about how much focus information vendors like Factiva and Dialog place on training. Unfortunately, Web engine companies do not offer this kind of training, but the learning process remains crucial. For me, the best part about being an information professional is the knowledge of where to find an answer. This is knowledge that non-professionals desire and makes our already important jobs even more valuable, especially with so many new databases and new online resources becoming available.

What Should the Searcher Do?

Consider the open-Web more of a directory to answers and less of an all-knowing answer machine. Sometimes, this directory WILL become an authoritative reference book and provide you with a timely and authoritative answer. Other times it will assist by providing you with background knowledge that can make using a fee-based service or a print collection more productive. Don't forget — shifting from one format to another can be a two-way street. What you learn from a print or commercial online source can produce an effective search strategy for the open Web. A Web search engine may also provide you with specific names of people to contact. Remember, the telephone and e-mail will always be very important reference resources.
 

The Quality of Information: The Biggest Challenge to Web Searching
For this Web searcher, information quality constitutes the greatest challenge faced as both a searcher and a teacher. We live in an age when anyone can become a publisher. All they need is a Web connection, server space, and something to say and/or share. Once the content goes onto a server and once a crawler finds it, the Web search engines will make it available to everyone. Within minutes or days, anyone with Web access can find that information. Amazing! And frightening!

Once they have found it, the major challenge to searchers is evaluating content. They must judge its quality, and often very quickly, using the criteria that information professionals have always used to evaluate information. How does one do this? Well, this is the topic of other articles, books, and dissertations. The most important point is to take a step back, if only for a second, to ask yourself where this information is coming from and why it is being placed online. Since anyone can become a publisher with the Web as a publishing medium, the reputation and background of the site creator, their qualifications, etc., are crucial. I would strongly recommend taking a look at the resources our colleague Genie Tyburski makes available on her site for judging quality [http://www.virtualchase.com/ quality/index.html].

Evaluating information quality, something that our profession has always done, offers another in-road for sharing our skills with the public. Many who search the Web take whatever they find to be accurate, current, and worthwhile. As information professionals, we must protect them, often from themselves.

One more thing. In my opinion, the challenges that information quality pose for the Web searcher prove how important it is for our profession to include Web resources as part of our collection development. We must try to make the Web a more effective tool for researchers. The Web is a living organism and, unlike an annual reference book, can change at a moment's notice. In an already busy workday, finding time to search out Web resources in an organized manner can be difficult. But all of us need to have an idea of what is available and where to turn before we actually need the resource to answer a query. Just knowing a top-level site exists that may contain the answer will not suffice. We learn our print collections, let's learn our Web collections and bookmarks.

Easier said than done? Of course. Still, it remains a goal we should strive to attain.
 

The Domination of Google
Everyone, including me, loves Google. How could you not like it? In most cases, it delivers highly relevant results (though this does not always mean authoritative) in a short amount of time. When you add in features like Google Cache (a powerful way to find pages that might have just gone AWOL), you have a search engine that works and works well.

Google is simple to use at a basic search level, but still returns good results. This is why non-professional searchers love it so much. The clean, single box home page is simple for non-sophisticated searchers to understand. It doesn't even allow you to directly use all three Boolean operators to return results, yet it works! Wow! More advanced searchers will be interested to know that Google uses AND as a default between search terms, permits the use of OR (it must be in all caps), and can remove a word of phrase if you use a minus (-) sign.

What I like most about Google is its quest to improve on what it already has. Google always seems to be introducing something new and innovative. In February 2001, it started tracking portable document format (.pdf) material. The general public may not put a high demand on some of this content, but PDF documents offer information professionals masses of authoritative content from respected sources. At the time of writing, Google was still the only general search engine to make PDF files searchable on a large scale.

What Should the Searcher Do?

The advanced searcher must get to know and make use of Google at a more than "put the words in the box" level. It's very easy. Begin by looking at the Google Advanced Search page [http://www.google.com/ advanced_search.html], and at the same time learn the syntax that will allow you to limit your searches directly without having to use this page. To learn more about Google, especially on how it compares to other search engines, go to Greg Notess's Search Engine Showdown site [http://www.searchengineshowdown.com].

Here's hoping that Google continues to improve and add new useful features. Here is also hoping that Google continues to properly separate advertising content from result sets. Yet with all of Google's wonderful abilities, good searchers know that the must never make any single Web search engine the only tool used. No single engine makes "everything" searchable.
 

Understanding the Limitations of General Web Search Tools
No single Web search tool is the end-all/be-all. In fact, most have limitations that need careful consideration if you plan to use them regularly or teach others to use them. What do I mean by limitations? Here are a just a few of many possible examples:

  • Search spiders or crawlers (the software that brings back material to a database so you can search it) do not crawl the Web in real time. A page made available on the Web on Thursday could wait weeks before a crawler reaches it. The major search services are improving turnaround on recrawling and adding pages, but in general, expect to wait many days before a keyword search will return a recent page.
  • If a site or page is not linked to or submitted by someone (Webmaster, page author, etc.), it will not be accessible from a search engine. Engines primarily use these two methods of finding out about new sites and pages.
  • Simply because one, 1,000, or even more pages from a site are available does not mean that the engine makes every page of an entire site searchable.

What Should the Searcher Do?

Understand from the outset that these limitations exist and can effect your search results. Rely on more than one search engine. Make use of specialty search tools that often go "deeper" into a site to collect more content. Take advantage of "Invisible Web" resources. Use Web directories like the Librarians' Index to the Internet to "mine" specific sites. When you find something of value, bookmark it.
 

Using Invisible/Hidden Web Resources
Over the last couple of years, the phrase "the Invisible Web" has come into use; others call it the hidden or deep Web. However, for the most part all the terms are synonymous. Searchers need to know about the material in this section of the open Web. In many cases the material comes from well-known, authoritative sources, is available at low or no cost, but is not accessible using a Web search engine.

Resources you interact with, sites where you fill in a set of variables and then have a "custom" page returned to you are examples of an Invisible Web page. So is a site that contains data that you can use for free, but only after you register. Why don't the search engines access this material? The search spider software seeking out material to bring back to the database finds nothing to retrieve in these examples. In the case of the custom page, the material is not accessible until the user calls for it and the system creates the page on the fly. In the other example, search spiders from general-purpose Web search engines do not fill out registration forms. So once the spider hits a page that requires registration, the spider stops and moves on. None of the material below that registration interface is searchable from general engines. One other factor can block search engine access — the "no-robot" tag. Webmasters can check off that they don't want to be spidered and most of the good, responsible crawlers will respect that request whether for all or any portion of the content on a Web site. Sometimes, Webmasters — perhaps concerned about possible excessive usage — may block the spiders without fully considering how this decision can eliminate substantial audience for the material they have taken the time and trouble and expense of loading.

Prime examples of Invisible Web databases include American FactFinder from the U.S. Census, most Web-accessible library catalogs, and many of the databases available via GPO Access.

What Should the Searcher Do?

Know what is available before you need it. Of course, this takes time and practice. We do much the same when becoming aware of the databases from LexisNexis or Dialog. What makes this even a larger challenge is that there are thousands of these databases available and, unlike Dialog, no common search syntax. Use compilations of Invisible Web databases such as the one Chris Sherman and I have created to support our book [http://www.invisible-web.net]. Conduct Invisible Web collection development. Develop and learn your own collection. Using the "open Web" to attempt to find something with the boss breathing down your back is both difficult and inefficient.

One Further Thought
A great deal of research and time is devoted to making the information inside these Invisible Web databases more easily accessible from general-purpose Web search tools and other resources. The challenge is that many of these Invisible Web databases offer "custom" interfaces and database tools specifically to enable interaction with the data. Although the ability to crawl all of this data is coming and, in some cases, available now, without the proper limiting tools to harness this information, we could face even worse problems. We might make already massive uncontrolled databases the size of Google's, Excite's, or AltaVista's even larger, without the proper mechanisms to get the data out in a precise manner. In librarian speak, this translates into increasing recall, lowering precision.
 

Specialized, Focused, and Site-Specific Search Tools: Important and Necessary
I often get a bit unsettled when people and companies refer to the Invisible Web. What many understand as the Invisible Web encompasses content actually visible to general-purpose engines like Google and AltaVista. What many label as Invisible, deep, or hidden Web content actually refers to basic HTML material, easy for the general search engines to index and make accessible. Many of the databases that are often reported as Invisible Web are actually just beyond the reach of general Web search engine policies and procedures. More aggressive and focused or targeted Web crawlers may go where the general search engines have balked. For example, specialized search engines were the first to start handling .pdf formatted files.

To penetrate these resources, users should learn to turn to specialized or focused search engines, important and effective tools at getting to the best answer possible on the open Web. Well-known specialized Web search engines include Psychcrawler, PoliticalInformation.Com, and Inomics.Com, each of which focuses on a specific subject (psychology, political science, and economics, respectively). Site-specific engines refer to the search engines that many sites make available to cover their own material.

The general search tools can, and often do, crawl material that you can also find using a specialized, focused, and site-specific search engine. However, in some cases, the general search engines may not cover this material as well as the specialized ones. For example, the engines may not crawl the key sites in a timely manner or at a deep enough level. Bottom line: Coverage of this material by general search engines like Excite or AllTheWeb may be spottier than the specialized search tools.

Here are just a few of the reasons why this problem occurs:

  • Time Lag. Unless paid for, spiders visit pages unannounced. Material changed or added between the dates when the spider last crawled the content — as much as a month, a quarter, or longer — remains, for all practical purposes, invisible. News material is a good illustration. A normal page from the CNN site is technically searchable from any general-purpose engine. However, for some period, it will not be searchable through a general search engine.
  • Depth of Crawl. Simply because a search engine makes one, 10, or 100,000 pages of a site accessible does not mean that it has crawled the entire site. Some engines only take a certain amount of material and then move on.
  • Each Search Engine Database Is Unique. As the work of Greg Notess makes clear, each search engine database differs. What Google knows about, Excite may not have in its database. What AltaVista can find, AllTheWeb/Fast may not make accessible.
  • Dead-End Pages. If a basic HTML page sits on your server and is not linked from any other page that a search tool already knows about and you don't submit it, then it will, most likely, not be discovered and crawled. A site-specific engine can crawl every page sitting on an entire server and make the page searchable.

Why would you want to use one of these search engines? Several reasons. Smaller, more targeted databases make for greater precision though lower recall. Think about the world with only one massive Dialog database. Just as you select the correct database for the specific task, it works the same with specialized search engines.

Additionally, these resources often offer human interaction, with a knowledgeable editor telling the crawler where to go, how often to return, and how deep to crawl. I think this job of human database editor will become more and more important in the future. What a great new career for information professionals!

Finally, some of these specialized engines, the BBC News engine for example, [http://newssearch.bbc.co.uk/ ksenglish/query.htm], provide extra functionality, such as constant, even daily, updating and limiting options for search strategies.

What Should the Searcher Do?

Check out and use the good sources identifying and collecting specialized and focused databases. I like Profusion [http://www.profusion.com], labeled here as "Invisible Web" and the always reliable and always wonderful Librarians' Index to the Internet [http://www.lii.org], which covers a large amount of specialized and Invisible Web databases. Once you have found good tools in your areas of interest, use them and learn their features in depth.
 

Using Search Tools on Specific Sites and Possible Intranet Solutions
This is a simple idea that I think is often overlooked by searchers. We all know that information professionals should take full advantage of the special searching features, such as limiting, and other resources Web search tools offer. However, the fact that many general-purpose engines (AltaVista, Google, Ultraseek/Inktomi) are also licensed and available to search specific sites often goes unnoticed and unused. It shouldn't.

The power searcher should identify when a specific "site-search" tool is actually the same software as that of a general-purpose engine. Then we should make use of the syntax, limiting functions, etc., still available as if the engine was being used to search the entire Web.

Here are a few examples to illustrate my point: To read the full article click on this link: 

Web Search Engines FAQS: Questions, Answers, and Issues





Charles E. Wharry(Darkbird18):


The Internet Search FAQ is very important to have and to understand because without this knowledge on how to search and research the Net you will get lost and get taken advantage of. The Internet Search FAQ is one of the first documents I downloaded back in 1995 and it has helped do my research online and also has led me to many interesting websites about online research.



No comments: