creating a search engine for fun

Search engines have become a wasteland of ads and AI generated content.

Google once stood as the champion of the internet, unmatched in it's quality of results and speed. In the modern day Google has become bloated and slow. Like all things Google has succumbed to the rot economy, lowering search quality.

Being a software engineer search is an extremely important tool. Documentation is often dense, and in some cases non-existent. Without a powerful search engine finding solutions or specific idiomatic code examples becomes impossible.

In an effort to remedy this I've set out to build my own personal search engine. While this may seem like a monumental task, most of the concepts that were once new and shiny (vectorized databases) have become commonplace. Web scrappers, while still not exactly trial have become more widely understood, especially in the age of AI scrapping every corner of the internet for training data.

Requirements:

Free: it must be free to operate and free to use. Kagi was a great option for a paid search engineer but I believe I can build something for free that has "good enough" quality. This also means I don't want to use things like Google's api as that costs money per x amount of searches.

Fast: like its alternatives it must be fast. It's 2024 no one wants to wait minutes for results. For this I require a language that uses async await to serve as many users as possible from a single server with limited threads. Being that I want it cheap the server will not be fast or have lots of ram meaning I'd prefer a compiled language. I will likely have to do some web scraping so the language will need to be performant.

Safe: I want this to be legal.... I don't want to be sued or violate TOS. I also believe in an open internet that does not track. For that I am making the decision to not have accounts or log searches. I also want my code to be as secure as possible, which at my skill level means no C or Cpp.

For these reasons, as wells as me being interested in learning the language as a major factor, I am choosing to write this in Rust.

I am still relatively young in my career and even fresher in the Rust ecosystem. Before jumping straight to building a search engine there's a few tricky I can employ to get some fast results.

Search provider Kagi is well known for their high quality results and actively uses results from Google, DDG, and Bing as one of their data sources to achieve this. I already daily drive DDG for most of my searches and far prefer them to modern Google. However, Google's image search is vastly superior. Google's scale allows them to record what images a user clicks when searching for a specific term, boosting that image the next time someone searches for that item making it a self feeding loop. Eventually this leads to a high quality source of data for what images belong to what search term. Of course this is only one of the many things google does to find related images, but is one of the main reasons why Google's image search almost always wins out.

In step one of the plan my engine will serve results from DDG and images from Google

DDG's main site is quite dense and provides a large amount of data back. In order to limit this I've elected to use DDG lite. DDG lite has significantly less styling and promises no javascript, making it easier to scrape.

Searching DDG provides an easy interface though a get request

https://lite.duckduckgo.com/lite?q=TERM

where TERM is replaced with the term the user would like to search

This works great giving back an html page that could be broken down, but what if we want more than just the initial 30 results?

This then turns into a POST request with the following parameters. On first glance this seems like an easy addition, parameters "s" and "dc" clearly manipulate the amount of results, and the pagination.

Despite mimic-ing the parameters the html page seems to not give back the next page's results. manipulating the "s" and "dc" parameters changes the numbers that appear in the html page but not the actual links. I will have to investigate more if this is related to some other sort of sessioniztion mechanism. For the first test I will just make use of the first 30.

The code starts pretty simple. This site is already running in an axum server so all we have to do is craft a json api. For this I made use of reqwest for non blocking api calls, scraper to performant parse the dom, and urlencoding to pull out the href of the a tags.

Above encompasses the entire function. We send a request, parse out results then preform a string manipulation on the urls. This last part is important due to the format of URLs in DDG's html.

While not shown in inspect element, a response from querying the DDG endpoint gives back a tags with a differing link.

Versus the response my parser sees as

"a class="result-link" rel="nofollow" href="//duckduckgo.com/l/?uddg=https%3A%2F%2Fwww.speedtest.net%2F&rut=e22e2d095d31b01205099e099b7965b4a292dc8d26f8a4e98054643a5b304744"

To obtain the real link we first use our urlencoding crate to decode the url then use several splits to obtain the original link. In this case

https://www.speedtest.net/

Putting this all together with some like javascript and html I get something serviceable shown in the image below.

Unfortunately, only certain terms work. Upon searching "cats" I discover my parser immediately breaks.

got doc Html { errors: ["Found special tag while closing generic tag", "Found special tag while closing generic tag"

Ultimately I believe this to be a flaw in the crate I'm using. Behind the scenes I tested multiple others with similar errors. For now I'll have to leave this until I either attempt to find a fix or switch crates.

A possible solution is to look into Servo's codebase. Servo is a rust based browser engine that was mostly abandon by mozilla only ever making it into a collaboration with the microsoft Holo Lense browser after edge took away augmented reality support.

A browser like Servo is likely to have a fast parser, however they may be better alternatives. For now the search is open to anyone, but do not expect good results as it will likely fail on most searches.

The other possible issue is related to rate limiting, with multiple users at the same time I'd likely get blocked extremely fast. For those reasons I'm not directly posting a link to the search in my blog (although it is extremely easy to find).

As I continue to learn about better scraping methods and search engine tech this project will be expanded on.