Building a PHP search indexer

By | February 12, 2008

I’ve been working on a demo website called MovServDex for quite some years now. I’m calling it a demo website, but it’s really a fully featured website on TV shows and movies. In the latest version I have decided to add a search engine. In this post I’ll shed some light on how you can create a PHP script that will ‘crawl’ the web for pages.

Before I continue please note that this is not meant to be a replacement for a real search engine like Google. But it may be useful for you to use on your own website.

One of the first things you must do is structure how you want to store the data you index from the web. I won’t go into much detail on this right now since that’s probably a university study on it’s own. But I will go into detail on:

  • How to feed pages into an existing index
  • Extracting useful links from indexed pages
  • Extracting nice information on the page like title, description, etc
  • Filtering out useless content of pages

Feeding new pages into the index

The simplest thing should be the manually entering of new pages to index. But if you’re anything like me then you can make something simple something complex. What you need to consider is how you want to do this. Do you want the engine to index only the given page, or perhaps pages it links to as well.

If you have these two possibilities then your engine will look something like this:

[code language=’php’]class Engine {
  function indexPage($page_url, $subpages_included = false)
  {}
}[/code]

Looks simple doesn’t it, well that’s cause so far it is. In the next couple of paragraphs I will slowly keep adding code to this class to show how I build my search engine. 😉

So let’s start by adding the first feature that you will need to have. When adding a new page to the index you will need to get that page. That’s the first thing we are going to be building. Something I build to prevent accessing non existing Urls is a checker that verifies if a page exists. I do this as follows:

[code language=’php’]function verifyUrl($page_url)
{[/code]

Open a connection to the Url provided by the engine (user)

[code language=’php’]
   $real_url = parse_url($page_url);
   $use_port = isset($real_url[‘port’]) ? $real_url[‘port’] : 80;
   $socket = fsockopen($real_url[‘host’], $use_port, $errno, $errstr, 30);
[/code]

If we successfully opened a connection sent the url we need to the webserver.

[code language=’php’]
   if ($socket){
     $sent_header = “HEAD “. @$real_url[‘path’] .
       ” HTTP/1.0rnHost: “. @$real_url[‘host’]”rnrn”;
     fputs($socket, $sent_header);
[/code]

As long as we don’t have the header information we need from the webserver we will keep reading the response.

[code language=’php’]
     while (!feof($socket)) {
       if ($rec_header = trim(fgets($socket, 1024))) {
         $ar = explode(‘:’, $rec_header);
         $key = array_shift($ar);
         if ($key == $rec_header)
          $headers[] = $header;
         else
          $headers[$key] = substr($rec_header, strlen($key)+2);
         unset($key);
       }
     }
  }
[/code]

Check if the response from the server was that the page actually exists and is OK.

[code language=’php’]
  if (stripos($headers[0], ‘200 OK’)>0)
    return true;
  return false;
}
[/code]

Now that we have a function to check if Url’s exist and don’t return a 404 (meaning ‘uh I lost the page’) we can start writing the function that fetches the actual content of the url.

Reading the content of the url is pretty simple just by adding the following code:

[code language=’php’]function getUrlContents($url_string)
{[/code]

Prevent the fetching of none existing pages so the index never gets polluted.

[code language=’php’]  if (!$this->verifyUrl($url_string))
     return;[/code]

Now fetch the content and strip some basic information from it so we have some details on the page.

[code language=’php’]
  $page_content = @file_get_contents($url_string, ‘rb’);
  if (!$url_content) return;

  $urls =array();
  @preg_match_all(‘/]*>/’, $page_content, $href_tags);
  foreach ($href_tags[0] as $href_tag) {
     @preg_match_all(‘/http(s)?://[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?/’, $href_tag, $url);
     if (isset($url[0]) && isset($url[0][0]))
        $urls[] = $url[0][0];
     unset($url);
   }
}[/code]

Now that all the links are extracted from the page and stored in an array we could use those to fetch those pages, strip the links and continue. This would automatically fill the index using a simple methodology.

From this point on you may want to store the entire content of the page in the index together with the links on the page. So far for the basics on feeding new pages into the index.

More about building a PHP search indexer

For now that is enough on how to build your own search indexer. In the near future I will expand on this article with:

  • Extracting useful information from a web page
  • Storing the indexed pages in a useful manner
  • Calculating static weights of web pages

Leave a Reply