Thursday, May 28, 2009

What Are Robots, Spiders, and Crawlers?

A robot, spider, or crawler is a piece of software run by search engine program, to build a textaul summary of a website’s content (content index). It creates a text-based summary of content and an address (URL) for each webpage. These are programmed to “crawl” from one web page to another based on the links on those pages. As this crawler makes it way around the Internet, it collects content (such as text and links) from web sites and saves those in a database that is indexed and ranked according to the search engine algorithm.

When a person searches, the keyword(s) they enter are compared with the available website content indexes. Due to the large number of webpages indexed, direct text-only-matching is rare, rather search engines use sophisticated logics (algorithms) to rank potential matches. For example, the underlying information hierarchy of a webpage (semantic markup) may be factored into the ranking a webpage is assigned.

As to what actually happens when a crawler begins reviewing a site, it’s a little more complicated than simply saying that it “reads” the site. The crawler sends a request to the web server where the web site resides, requesting pages to be delivered to it in the same manner that your web browser requests pages that you review. The difference between what your browser sees and what the crawler sees is that the crawler is viewing the pages in a completely text interface. No graphics or other types of media files are displayed. It’s all text, and it’s encoded in HTML. So to you it might look like gibberish.

The crawler can request as many or as few pages as it’s programmed to request at any given time. This can sometimes cause problems with web sites that aren’t prepared to serve up dozens of pages of content at a time. The requests will overload the site and cause it to crash, or it can slow down traffic to a web site considerably, and it’s even possible that the requests will just be fulfilled too slowly and the crawler will give up and go away.

If the crawler does go away, it will eventually return to try the task again. And it might try several times before it gives up entirely. But if the site doesn’t eventually begin to cooperate with the crawler, it’s penalized for the failures and your site’s search engine ranking will fall.

Reasons a URL may not be included in the index

Below is a list of common reasons that a document may not be indexed:
  • ROBOTS.TXT ACCESS DENIES: The site's "/robots.txt" file prevents access to the document.
  • YOUR PAGE IS UNDER CONSTRUCTION. If you can avoid it, you don’t want a crawler to index your site while this is happening. If you can’t avoid it, however, be sure that any pages that are being changed or worked on are excluded from the crawler’s territory. Later, when your page is ready, you can allow the page to be indexed again.
  • PAGES OF LINKS. Having links leading to and away from your site is an essential way to ensure that crawlers find you. However, having pages of links seems suspicious to a search crawler,and it may classify your site as a spam site. Instead of having pages that are all links, break links up with descriptions and text. If that’s not possible, block the link pages from being indexed by crawlers.
  • DYNAMIC PAGES: Dynamic pages are often ignored by the search engine spiders. In fact, any URL containing special symbols like a question mark (?) or an ampersand (&) will be ignored by many engines. Pages generated on the fly from a database often contain these symbols. In this situation, it's important to generate "static" versions of each page you wish to be indexed. In regard to the search engines, the simpler the page is, the better. Does this mean, for example, having a javascript to count visits to the page will prevent you from being indexed, or lower your rankings? No. It simply means that the search engine will most likely ignore the javascript and index the remaining areas of the page. There is evidence that going too far with fancy scripts and code on a page can hurt your rankings if the bulk of your page consists of java or VB scripts.
  • PAGES OF OLD CONTENT. Old content, like blog archives, doesn’t necessarily harm your search engine rankings, but it also doesn’t help them much. One worrisome issue with archives, however, is the number of times that archived content appears on your page. With a blog, for example, you may have the blog appear on the page where it was originally displayed, and also have it displayed in archives, and possibly have it linked from
  • some other area of your site. Although this is all legitimate, crawlers might mistake multiple instances of the same content for spam. Instead of risking it, place your archives off limits to crawlers.
  • REDIRECTS: If your site contains redirects or meta refresh tags these things can sometimes cause the engines to have trouble indexing your site. Generally they will index the page that it is redirecting TO, but if it thinks you are trying to "trick" the engine by using "cloaking" or IP redirection technology that it can detect, there is a chance that it may not index the site at all.
  • PRIVATE INFORMATION. It really makes better sense not to have private information (or proprietary information) on a web site. But if there is some reason that you must have it on your site, then definitely block crawlers from access to it. Better yet, password-protect the information so that no one can stumble on it accidently.

No comments: