A while ago I started building a search engine for a demo website I have. Nothing to fancy or anything, but good enough to index pages and return results. I’ll post an article some time from now about how you can build a search indexer.
I’m pretty far along but have encountered some problems with indexing web pages. The first was that I somehow was not able to index sub pages successfully. As it turns out this had to do with the fact that my link extractor made a mistake. It included quotes as part of the Url.
The second problem I encountered was that the indexer crawled the same page multiple times in one refresh. First time when they were fetched from the existing index and then once every time another page linked to it. To solve this I added a list of pages already visited, which actually worked ;).
The last problem, which so far I haven’t solved, is that the indexer seems to be unable to find new pages. This happens when the index reaches a fixed size. I have no idea why at this point. But it may have something to do with a bug in the crawler.
So hopefully I will have all of these problems fixed before my next article on how to build a search indexer.