<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Narnio&#187; Building One</title>
	<atom:link href="http://www.narnio.com/category/search-engines/building-one/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.narnio.com</link>
	<description>A day in the life of a software engineer</description>
	<lastBuildDate>Sat, 04 Feb 2012 18:31:54 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2</generator>
		<item>
		<title>Building a PHP search indexer</title>
		<link>http://www.narnio.com/2008/02/12/building-a-php-search-indexer/</link>
		<comments>http://www.narnio.com/2008/02/12/building-a-php-search-indexer/#comments</comments>
		<pubDate>Tue, 12 Feb 2008 15:18:32 +0000</pubDate>
		<dc:creator>Jongerius</dc:creator>
				<category><![CDATA[Building One]]></category>
		<category><![CDATA[Search Engines]]></category>
		<category><![CDATA[Webdevelopment]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[website]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://www.narnio.com/2008/02/12/building-a-php-search-indexer/</guid>
		<description><![CDATA[I&#8217;ve been working on a demo website called MovServDex for quite some years now. I&#8217;m calling it a demo website, but it&#8217;s really a fully featured website on TV shows and movies. In the latest version I have decided to add a search engine. In this post I&#8217;ll shed some light on how you can [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been working on a demo website called <a  href="http://www.movservdex.com">MovServDex</a> for quite some years now. I&#8217;m calling it a demo website, but it&#8217;s really a fully featured website on TV shows and movies. In the latest version I have decided to add a search engine. In this post I&#8217;ll shed some light on how you can create a PHP script that will &#8216;crawl&#8217; the web for pages.</p>
<p><strong>Before I continue please note that this is not meant to be a replacement for a real search engine like Google. But it may be useful for you to use on your own website.</strong></p>
<p><span id="more-176"></span></p>
<p>One of the first things you must do is structure how you want to store the data you index from the web. I won&#8217;t go into much detail on this right now since that&#8217;s probably a university study on it&#8217;s own. But I will go into detail on:</p>
<ul>
<li>How to feed pages into an existing index</li>
<li>Extracting useful links from indexed pages</li>
<li>Extracting nice information on the page like title, description, etc</li>
<li>Filtering out useless content of pages</li>
</ul>
<h2>Feeding new pages into the index</h2>
<p>The simplest thing should be the manually entering of new pages to index. But if you&#8217;re anything like me then you can make something simple something complex. What you need to consider is how you want to do this. Do you want the engine to index only the given page, or perhaps pages it links to as well.</p>
<p>If you have these two possibilities then your engine will look something like this:</p>
<pre class="brush: php">class Engine {
  function indexPage($page_url, $subpages_included = false)
  {}
}</pre>
<p>Looks simple doesn&#8217;t it, well that&#8217;s cause so far it is. In the next couple of paragraphs I will slowly keep adding code to this class to show how I build my search engine. <img src='http://www.narnio.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>So let&#8217;s start by adding the first feature that you will need to have. When adding a new page to the index you will need to get that page. That&#8217;s the first thing we are going to be building. Something I build to prevent accessing non existing Urls is a checker that verifies if a page exists. I do this as follows:</p>
<pre class="brush: php">function verifyUrl($page_url)
{</pre>
<p>Open a connection to the Url provided by the engine (user)</p>
<pre class="brush: php">
   $real_url = parse_url($page_url);
   $use_port = isset($real_url[&#039;port&#039;]) ? $real_url[&#039;port&#039;] : 80;
   $socket = fsockopen($real_url[&#039;host&#039;], $use_port, $errno, $errstr, 30);
</pre>
<p>If we successfully opened a connection sent the url we need to the webserver.</p>
<pre class="brush: php">
   if ($socket){
     $sent_header = &quot;HEAD &quot;. @$real_url[&#039;path&#039;] .
       &quot; HTTP/1.0\r\nHost: &quot;. @$real_url[&#039;host&#039;]&quot;\r\n\r\n&quot;;
     fputs($socket, $sent_header);
</pre>
<p>As long as we don&#8217;t have the header information we need from the webserver we will keep reading the response.</p>
<pre class="brush: php">
     while (!feof($socket)) {
       if ($rec_header = trim(fgets($socket, 1024))) {
         $ar = explode(&#039;:&#039;, $rec_header);
         $key = array_shift($ar);
         if ($key == $rec_header)
          $headers[] = $header;
         else
          $headers[$key] = substr($rec_header, strlen($key)+2);
         unset($key);
       }
     }
  }
</pre>
<p>Check if the response from the server was that the page actually exists and is OK.</p>
<pre class="brush: php">
  if (stripos($headers[0], &#039;200 OK&#039;)&gt;0)
    return true;
  return false;
}
</pre>
<p>Now that we have a function to check if Url&#8217;s exist and don&#8217;t return a 404 (meaning &#8216;uh I lost the page&#8217;) we can start writing the function that fetches the actual content of the url.</p>
<p>Reading the content of the url is pretty simple just by adding the following code:</p>
<pre class="brush: php">function getUrlContents($url_string)
{</pre>
<p>Prevent the fetching of none existing pages so the index never gets polluted.</p>
<pre class="brush: php">  if (!$this-&gt;verifyUrl($url_string))
     return;</pre>
<p>Now fetch the content and strip some basic information from it so we have some details on the page.</p>
<pre class="brush: php">
  $page_content = @file_get_contents($url_string, &#039;rb&#039;);
  if (!$url_content) return;

  $urls =array();
  @preg_match_all(&#039;/&lt;a[^&gt;]*&gt;/&#039;, $page_content, $href_tags);
  foreach ($href_tags[0] as $href_tag) {
     @preg_match_all(&#039;/http(s)?:\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:\/~\+#]*[\w\-\@?^=%&amp;\/~\+#])?/&#039;, $href_tag, $url);
     if (isset($url[0]) &amp;amp;amp;&amp;amp;amp; isset($url[0][0]))
        $urls[] = $url[0][0];
     unset($url);
   }
}</pre>
<p>Now that all the links are extracted from the page and stored in an array we could use those to fetch those pages, strip the links and continue. This would automatically fill the index using a simple methodology.</p>
<p>From this point on you may want to store the entire content of the page in the index together with the links on the page. So far for the basics on feeding new pages into the index.</p>
<h2>More about building a PHP search indexer</h2>
<p>For now that is enough on how to build your own search indexer. In the near future I will expand on this article with:</p>
<ul>
<li>Extracting useful information from a web page</li>
<li>Storing the indexed pages in a useful manner</li>
<li>Calculating <em>static</em> weights of web pages</li>
</ul>
<hr/>Copyright &copy; 2012 <strong><a  href="http://www.narnio.com">Narnio</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact legal@jong-soft.com so we can take legal action immediately.]]></content:encoded>
			<wfw:commentRss>http://www.narnio.com/2008/02/12/building-a-php-search-indexer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Problems In Searching</title>
		<link>http://www.narnio.com/2008/01/10/problems-in-searching/</link>
		<comments>http://www.narnio.com/2008/01/10/problems-in-searching/#comments</comments>
		<pubDate>Thu, 10 Jan 2008 08:37:56 +0000</pubDate>
		<dc:creator>Jongerius</dc:creator>
				<category><![CDATA[Building One]]></category>
		<category><![CDATA[General Rant]]></category>
		<category><![CDATA[Search Engines]]></category>
		<category><![CDATA[Webdevelopment]]></category>

		<guid isPermaLink="false">http://www.narnio.com/2008/01/10/problems-in-searching/</guid>
		<description><![CDATA[A while ago I started building a search engine for a demo website I have. Nothing to fancy or anything, but good enough to index pages and return results. I&#8217;ll post an article some time from now about how you can build a search indexer. I&#8217;m pretty far along but have encountered some problems with [...]]]></description>
			<content:encoded><![CDATA[<p>A while ago I started building a search engine for a demo website I have. Nothing to fancy or anything, but good enough to index pages and return results. I&#8217;ll post an article some time from now about how you can build a search indexer.</p>
<p>I&#8217;m pretty far along but have encountered some problems with indexing web pages. The first was that I somehow was not able to index sub pages successfully. As it turns out this had to do with the fact that my link extractor made a mistake. It included quotes as part of the Url.</p>
<p>The second problem I encountered was that the indexer crawled the same page multiple times in one refresh. First time when they were fetched from the existing index and then once every time another page linked to it. To solve this I added a list of pages already visited, which actually worked <img src='http://www.narnio.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .</p>
<p>The last problem, which so far I haven&#8217;t solved, is that the indexer seems to be unable to find new pages. This happens when the index reaches a fixed size. I have no idea why at this point. But it may have something to do with a bug in the crawler.</p>
<p>So hopefully I will have all of these problems fixed before my next article on how to build a search indexer.</p>
<hr/>Copyright &copy; 2012 <strong><a  href="http://www.narnio.com">Narnio</a></strong>. This Feed is for personal non-commercial use only. If you are not reading this material in your news aggregator, the site you are looking at is guilty of copyright infringement. Please contact legal@jong-soft.com so we can take legal action immediately.]]></content:encoded>
			<wfw:commentRss>http://www.narnio.com/2008/01/10/problems-in-searching/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

