Top posts

Posts in Building One


Building a PHP search indexer

Posted by Jongerius under Building One, Search Engines, Webdevelopment
1 Star2 Stars3 Stars4 Stars5 Stars6 Stars (No Ratings Yet)
Loading ... Loading ...

I’ve been working on a demo website called MovServDex for quite some years now. I’m calling it a demo website, but it’s really a fully featured website on TV shows and movies. In the latest version I have decided to add a search engine. In this post I’ll shed some light on how you can create a PHP script that will ‘crawl’ the web for pages.

Before I continue please note that this is not meant to be a replacement for a real search engine like Google. But it may be useful for you to use on your own website.

(more…)


Problems In Searching

Posted by Jongerius under Building One, General Rant, Search Engines, Webdevelopment
1 Star2 Stars3 Stars4 Stars5 Stars6 Stars (No Ratings Yet)
Loading ... Loading ...

A while ago I started building a search engine for a demo website I have. Nothing to fancy or anything, but good enough to index pages and return results. I’ll post an article some time from now about how you can build a search indexer.

I’m pretty far along but have encountered some problems with indexing web pages. The first was that I somehow was not able to index sub pages successfully. As it turns out this had to do with the fact that my link extractor made a mistake. It included quotes as part of the Url.

The second problem I encountered was that the indexer crawled the same page multiple times in one refresh. First time when they were fetched from the existing index and then once every time another page linked to it. To solve this I added a list of pages already visited, which actually worked ;) .

The last problem, which so far I haven’t solved, is that the indexer seems to be unable to find new pages. This happens when the index reaches a fixed size. I have no idea why at this point. But it may have something to do with a bug in the crawler.

So hopefully I will have all of these problems fixed before my next article on how to build a search indexer.