How Search Engines work

search-engine-logosYou’ve probably been using search engines for years, as a searcher and likely in your role in Marketing. Sometimes the results you get make sense, other times they leave you puzzled.

Understanding search engine results is easier if you first understand the principles of how search engines work.

Here is a brief background on the basics.

Search engines today (like Google, Yahoo, and MSN) are text-based. They work by connecting words (queries) that you enter into a search box to a database of pages from the web (the index). The search engine then displays a list of URLs along with a summary of the pages which it believes are relevant to your query.

The search engine consists of three components, the spider, the index, and the query processor. The query processor is the runtime system that connects the query to the index and manages the browser interface to display the results.

Spider

The spider software (or crawler) sits on a server and sends out requests for web pages on the internet, similar to what your browser does. It then hands off the web page to the indexer. A search engine like Google has hundreds of spiders running simultaneously in order to continuously crawl the web to find new and changed pages.

Whereas your browser reads all the html and scripts on a web page in order to display it properly, the crawler is primarily interested in the text components of the page. The crawler for the most part ignores JavaScript, CSS, Flash, other meta tags, and images. To give you an idea of what the search engine sees, try the Spider Simulator from SEO Chat at the bottom of this page.

The crawler also records all the hyperlinks on the page it is reading and places those links on its request list. It then continues working its way down the request list of pages to fetch and give to the index. Recording these links is important not only so that the spider can crawl all the sites on the web, it also needs to capture the links and their anchor text (the text on the page that describes the link) for ranking purposes.

Spiders work to crawl the web frequently enough to keep up with changes to web pages, but it’s a large task. Instead of crawling all pages on the web with the same frequency, the crawler partitions the task. Pages that change infrequently are visited less frequently, perhaps once every six weeks, whereas pages on news sites may be visited many times each day. If you change your pages infrequently you can expect that the spider will not visit often.

Index

The index is a large database. The raw index is a list of web pages organized by domain. For each domain the index lists all the pages on the site and all the relevant information on the page. This includes the text on the page, the links, and the anchor text for the links. Now the search engine can tell what words are associated with each URL.

The search engine analyzes the text and the html markup on the page to understand what the page is about and to understand its relevance to a particular query. It also looks at who the page links to (forward links) and who links to the page (back or inbound links) for relevance. It is interested not only in the credibility of the links, but also the trustworthiness of the “neighborhood” where the links reside. The search engines invest considerable energy in this analysis, looking at over a hundred factors to determine relevance and ranking.

The next step in the indexing process is to invert the raw index and create a runtime index. The search engine uses the runtime index to connect search queries with matching pages in the index. The search index can be thought of as a list of every word accompanied by a code for all the pages that contain that word. Now when you query the search engine for “chocolate toffee bar”ť it can immediately find all the URLs that are associated with the keyword phrase.

Query processor and user interface

The query processor software receives the search query you enter in the search box, sends it to the runtime index and displays the response from the index in the search engine response page (SERP). The search results include a title, a snippet, and the URL of the page. The title is usually taken directly from the title element on the page. The snippet, or description, is often the first 150 characters of the meta description tag on the page.

The first step for the query processor is to analyze the query and to decide how best to retrieve a match in the index. The query processor tries to interpret what you mean in your query and find the best possible matches. It looks at possible misspellings, word variants (e.g. plural/singular), the presence of a phrase, word order, and possible stop words (extremely common words that occur so frequently that they are ignored).

The search results are ranked by relevance for the search query. The search engine usually retrieves only a portion of the results that it has stored in the index. For example, Google retrieves the first 1000 results and ranks them.

Recommended

Spider Simulator from SEO Chat to see what a particular web page looks like to the search engine.

On Search, The Series, Tim Bray, 2003 – An excellent series of online essays on how text-based search engines work.

The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed our Culture – John Battelle – 2005 – founding editor of Wired and the Industry Standard tells the story. Good description of the how the Google search engine works in an early chapter.

Matt Cutts Blog – Gadgets, Google, and SEO – Matt is an engineer at Google and so he not only understands how Google works, he also knows about changes that Google makes to it algorithms. He can’t share all of it with you, but he shares as much as he can. An important blog to read if you want to understand Google.

SearchEngineLand articles and blogs

SEOMoz – lots of articles and tools for beginner to advanced search engine optimizers from Rand Fishkin and company.

Some other related posts you might find useful:

  1. Improve your search rankings with a lean and competitive website
  2. 5 Steps to a Core Website That Ranks and Converts
  3. How to Get Started with Google Analytics
  4. How B2B companies get found online
  5. Can marketing and sales be lean? Part Two
About David Crankshaw

Web Analytics for B2B companies. Improve demand creation by increasing your website traffic, sales leads and revenue. Connect with David on Google+

Speak Your Mind

*