A Search Proxy in .NET
Introduction
We all use Internet search. Nowadays, bookmarks have become somewhat outdated, as the modern search engines are so good at finding what we want, that they have become somewhat obsolete. But, what if we need to perform a search in code? We can obviously send an HTTP request to our search engine of choice and then parse the result, but it would be useful if there was a library that could do it for us. And this is how Search.NET was born!
Concepts
First, let's define a contract for a search proxy in .NET. A concrete proxy implementation will be specific to a search engine, such as Google, or Bing, but the contract will be the same. Here is what I propose, a contract named ISearch:
public interface ISearch { Task<SearchResult> Search(string query, CancellationToken cancellationToken = default);
Task<SearchResult> Search(string query, QueryOptions options, CancellationToken cancellationToken = default); }
As you can see, it has essentially one Search method with an overload that takes some query options, which if not supplied, it should assume defaults. Both methods are asynchronous and return a SearchResult.
The QueryOptions class looks like this:
public class QueryOptions
{
public uint? Page { get; set; }
public uint? Size { get; set; }
public string? Site { get; set; }
}
It takes optional page size (Size) and starting record (Page), and also a site to which the results should be specific to (Site).
As for SearchResult:
public class SearchResult : IEnumerable<SearchHit>
{
public List<SearchHit> Hits { get; } = new List<SearchHit>();
public int Count => Hits.Count;
public int? TotalCount { get; init; }
IEnumerator<SearchHit> IEnumerable<SearchHit>.GetEnumerator() => Hits.GetEnumerator();
IEnumerator IEnumerable.GetEnumerator() => Hits.GetEnumerator();
}
This class contains the results themselves (Hits) and a count (Count), and possibly the total results (TotalCount). In the future, or in inherited classes, we can return more info.
A search result hit is represented by the SearchHit record:
public record SearchHit
{
public required string Title { get; init; }
public required string Url { get; init; }
public required string Content { get; init; }
public string? Image { get; init; }
public string? Date { get; init; }
}
Each hit has some required properties (Title, Url, Content) and some that only exists for some results (Image, Date).
So, a specific implementation must implement ISearch and return results, maybe using the SearchOptions class, or another inheriting from it.
There is also a global options class, SearchOptions, which you can use when you register the service to the dependency injection (DI):
public class SearchOptions
{
public string? UserAgent { get; set; }
public List<string> AcceptLanguages { get; } = new List<string>();
}
This allows setting global parameters, such as the user agent and accept languages. If set, these will be sent as headers on each request. It is up to the search implementation to honour these.
Google Search
I implemented a single search provider, for now, and that is for Google. You register it using the AddGoogleSearch extension method:
builder.Services.AddGoogleSearch();
Or, with some global options:
builder.Services.AddGoogleSearch(static options =>
{
options.AcceptLanguages.Add("pt");
options.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36 Edg/128.0.2739.90";
});
There are some extension methods over SearchOptions that help with this, namely, setting some of the more popular browsers as the user agent and also adding or clearing the accepted languages:
builder.Services.AddGoogleSearch(static options =>
{
options.AcceptLanguages("pt", "en");
options.SetChromeUserAgent();
});
Now, we need to get a reference to an ISearch to use:
public async Task<IActionResult> Search([FromServices] ISearch search, [FromQuery] string query, CancellationToken cancellationToken)
{
var results = await search.Search(search, cancellationToken);
//do something with the results
}
If we need, we can pass a query options to filter out the results. For Google, we have an extended QueryOptions class:
public class GoogleQueryOptions: QueryOptions
{
public GoogleSearchType? SearchType { get; init; }
}
public enum GoogleSearchType { Video, News, Images, Web }
//...
var results = await search.Search(search, new GoogleQueryOptions
{
Site = "bbc.co.uk",
SearchType = GoogleSearchType.News
}, cancellationToken);
The addition here is the type of the query, which can be for videos, news, images, or plain web results but we're also getting results from just one site, bbc.co.uk.
Now we can go through all of the results:
foreach (var hit in results)
{
//do something with each hit
}
Limitations
The Google provider cannot really specify how many records to return at a time, and also does not return the total number of results. I'm trying to find a way to implement this.
Behind the scenes, it uses a provider mechanism where it tries many implementations to parse out the results. It works best if you set the user agent.
Future Work
Some things on the roadmap include:
But it's really too soon to know when and if will these be available.
Conclusion
To parse the results in the Google provider I used AngleSharp and AngleSharp.Css. This is an excellent tool for parsing HTML. I made the code fully available on GitHub and I published the packages on Nuget.
This is still work in progress. Let me know if this is useful to you, always keen to listen to feedback about my projects!
Source code: https://github.com/rjperes/NetSearch
Comments
Post a Comment