Search engines are the single most used method for finding information on the Internet. Comscore.com's October 2008 research study reports that 'more than 750 million people age 15 and older – or 95 percent of the worldwide Internet audience – conducted 61 billion searches worldwide in August, an average of more than 80 searches per searcher.'

Ecommerce entrepreneurs have a huge need to win the search engine sweepstakes. The higher your listing falls in search engine ranking pages, the greater the number of viewers who will read your listing and visit your site.

Although the specific algorithms used by the major search engines (Google, Yahoo, MSN, etc.) are proprietary (though subject to intense investigation by Web watchers), the underlying principles of search engines are available to be studied. These principles are 'spidering', assessment and storage, retrieval and ranking. In this first of several articles on search engine optimization, we will look at the spidering and assessment/storage process.

Web spiders or crawlers

A web spider is an automated program that crawls the Web, gathering URLs and sending them back to a repository, where they are analysed and sorted. Web spiders make it much simpler and more efficient to search the Web because a lot of the work of gethering and sorting has been done days, weeks or even months before you search for that content.

A search engine uses many Web spiders to crawl the Web pages on the Internet, return contents and index the contents according to utility of the information.

Spiders operate according to a set of rules, e.g.:

  • A selection policy that states which pages to download;

  • A re-visit policy that states when to check for changes to the pages;

  • A politeness policy that states how to avoid overloading websites by accessing URLs too frequently;

  • A parallelisation policy that states how to coordinate distributed web crawlers, that is, how to avoid too many crawlers accessing the same site at the same time.

Once the spider has retrieved URLs and sent them back to its repository, the pages must be assessed for value.

Storage and Assessment

During page assessment, a second search engine program scans each page sent by the spider, analysing the content of the page, i.e., studying 'on-page' factors. This program indexes which words are used, how often they are used and whether or not there is special emphasis (bold, italicised, used in heading, part of a link). The results of this analysis are stored in the search engine's document index.

Some of the typical positive on-page factors include:

  • Keywords located in headings and meta tags;
  • Keywords in URL and domain name;
  • Keyword density (5-20%);
  • Keyword proximity (for 2+ keywords).

Negative on-page factors include:

  • Mostly graphics, little text;
  • Bad language;
  • Stolen material;
  • Keyword over-density.

The program later analyses 'off-page factors', i.e., links to other pages and other pages that link to it.

Positive link strategies include:

  • Incoming links from high-ranking sites;
  • Number of in-coming links;
  • Age of link;
  • Keyword presence in link.

Negative link strategies include:

  • Link buying;
  • Cloaking – show one link to spider, another to users;
  • Links to or from bad sites.

Once these analyses have been completed, the search engine can match a user query with web pages that have been dissected into 'component values' based on the search engine's particular logarithm. In our Part II article of Search Engines Deconstructed, we discuss how search engines establish page rank, which determines how the engine selects which results to display for your search.