Good search results are very important to the end-user, which means helping them find the best possible search terms. For a commerical site, helping users get to the correct product page quickly is a value proposition that would positively impact the revenue stream of the business. With this in mind, could we do better by suggesting possible search results before the user has even finished typing? It's possible with Elasticsearch's completion suggester.
Elasticsearch is a highly scalable full-text search and analytics engine that helps developers build applications that bring out the best in data, whether it's a sophisticated search for products or custom dashboards that can perform complex business intelligence queries. It's open-source, well-supported, and used by industry-leading companies such as eBay, Netflix, Microsoft, and Facebook.
What’s great about Symfony Panther is that you’re actually spawning and controlling an instance of the Google Chrome web browser form your PHP scraper script, as opposed to doing it with raw HTML requests. This is a nice thing because Google Chrome is great for executing JavaScript. A browser testing and web scraping library for PHP and Symfony Panther is a convenient standalone library to scrape websites and to run end-to-end tests using real browsers. Panther is super powerful. It leverages the W3C's WebDriver protocol to drive native web browsers such as Google Chrome and Firefox. Top projects built on Symfony BlaBlaCar. This is a shared mobility platform, that contributes to efficiency in time, spendings, and resources. In simple words, it helps you to commute and drivers to earn spare money. BlaBlaCar uses Symfony for the website’s backend and backoffice. When you need to get from A to B, it works the following way. Symfony goutte FriendsOfPHP/Goutte: Goutte, a simple PHP Web Scraper, Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses. Goutte is a screen scraping and web crawling library for PHP.
Let's say we're developing a website where users can find restaurants to view menus, booking information, leave reviews etc. The client wants a search box for users to type in a restaurant name to search on and get instant feedback. However, a restaurant could go by different names, or the user may type 'Birmingham' or 'Italian', wishing to search by properties like location or cuisine style. The client expects to scrape restaurant data from around the world, potentially storing millions of records. This article explores an approach on how to satisfy this requirement using Elasticsearch.
As always, there's a companion Symfony project that offers a working code example.
Integrating Elasticsearch into your Symfony project
Important: The following article was written with Elasticsearch 2.4 in mind. We've since upgraded the code example to be compatible with 5.x and included a Docker environment to run it in. The original is tagged as v1.0; if you're upgrading your own project feel free to reference this commit!
To begin with, let's assume we've yet to install Elasticsearch in our development environment. For Linux/Windows visit the Elasticsearch download page and follow their instructions. For Mac OS X, I'd recommend using Homebrew.
I'd also highly recommend installing Kibana & Sense, which gives us a nice web-interface to RESTfully query our Elasticsearch instance.
Next, install and configure the FOSElasticaBundle, making sure that we enable the bundle in our AppKernel. At the time of writing there is no stable 4.x release of FOSElasticaBundle, and so I've had to work around the Elastica library constraint to use a release that supports Elasticsearch 2.x. If there is a stable release, feel free to only require FOSElasticaBundle without a version constraint specified.
Here is a minimal FOSElasticaBundle configuration required for us to begin indexing our Restaurant entity. Add it to config.yml, making sure to also define and set elasticsearch_host
and elasticsearch_port
in parameters.yml (the host default should be 'localhost' and the port default should be 9200).
There are a couple of things going on here. In Elasticsearch, one index can contain many different types of documents, each with their own fields. This is not too dissimilar from Doctrine's own entity manager, and so for our use-case it makes sense to have a single 'app' index where our entities will reside. Our restaurant configuration enables the bundle to hydrate Elasticsearch results from a restaurant type query back into Restaurant objects. It will also listen for any changes to them and update their Elasticsearch document counterparts automatically, since it synchronises the entity and document by primary ID.
Moving on, we still need to define some field mappings so that the bundle knows which Restaurant fields to persist in an Elasticsearch document for us to query on.
Completion suggester mapping
Our goal is to start typing a restaurant name and get suggestions. However, a user may struggle to search for 'L'Escargot Blanc' unless they type exactly how the name begins. This is where analyzers come into play. By default, Elasticsearch will apply simple analysis and lowercase all letters when the document is stored at index time, and when the document is queried at search time (note: this doesn't affect how the field's data is actually stored). For our French restaurant this won't be good enough, as savvy users may start typing 'es' (omitting 'the') and so it won't be suggested. By applying a number of token filters and wrapping it up in a custom analyzer, we can still suggest restaurants with elisions or accents in the name, even if the user doesn't include them in their search term.
View on GitHubNext, we need to store the restaurant name so it can be queried by the user's search term. Rather than simply map the field, we need to provide a data structure for the completion suggester that describes possible input and what the output is. We'll start by creating a function in our Restaurant class to do this.
View on GitHubFor now the input and output parameters will simply be the restaurant's name, so when a user starts typing the name will be returned as a suggestion, keeping things nice and consistent. We've also included a payload parameter which allows us include whatever we wish (as long as it's serializable to JSON!) and returned as part of the suggestion result. Most of the time you'll want to include the object's ID, as this will enable you to load the suggested entity without performing another search. This is important since our completion query won't be performed using FOSElasticaBundle's finder service, meaning we won't be relying on the bundle to hydrate the results (because suggestions aren't returned as documents).
Let's add our mapping using the new analyzer we configured earlier.
View on GitHubWith the mapping in place we can create and populate the index with a console command. Note: you'll want to do this each time you adjust your index and type configuration during development, or if you manually change a value in the database.
Using Sense and executing GET app/restaurant/_search
, we can see our restaurants are now stored in Elasticsearch.
Implementing search-as-you-type
Now that our restaurants are indexed correctly we'll need a UI element that takes a user's search term (while they're typing), performs a suggestion query, and returns any results. For brevity I've omitted the Symfony controller actions, as these can be viewed in the companion project.
First we need a suggestions endpoint. The query itself is quite simple: a new completion suggester is created that uses the name_suggest mapping that we defined earlier ('suggest' is just a name that's used to identify different groups of results if you use more than one suggester). We then use the service automatically exposed by the fos_elastica configuration to search restaurant types by wrapping the suggester in a query.
View on GitHubNext I'll be using the Select2 library to supercharge a select box, which will call our suggestions endpoint whenever a user types in it, displaying any restaurants returned from the backend. Let's make it easier for ourselves by mapping the results set to a data structure more suitable for Select2.
View on GitHubSymfony Web Scraper Download
To be able to cleanly specify the endpoint URL we'll use the excellent FOSJsRoutingBundle to expose application routing in the frontend. There are multiple ways to expose routes; in this example I'm using annotations so I'll add the expose option to the endpoint.
View on GitHubNow it's quite straightforward configuring Select2 to use the suggestions endpoint, since we don't have to process the results. When dealing with a large dataset we'd want to extend this implementation to include pagination (as demonstrated in the Select2 ajax example), and increase the minimum input length (3 is recommended) so the potential search space is smaller.
View on GitHubThere is a lot of scope for what can be done next, such as custom results formatting (the suggestions payload could include an image URL or description to help with this). I've opted for the simple approach of sending the user off to the restaurant page once they select a result.
View on GitHubIn reality, if the user has no suggestions to select from, they should be able to transition to performing a full-text search. Although it's outside the scope of this article, it's worth keeping in mind that we're building a feature that will improve the user's search experience, not something that will replace search entirely.
Improving results
We're off to a good start: our users can start typing a restaurant name and get meaningful suggestions back. But what if they don't quite know the name of the restaurant they're searching for? Perhaps they start typing 'grill', why isn't 'PJ's Bar & Grill' returned?
This is because Elasticsearch isn't searching for words in a phrase using a regular match query, rather it's using something called a finite state transducer (FST). An FST is a big graph that the completion suggester builds once it's fed all possible completions (i.e. our inputs) where each input is broken into indivdual paths by character. Starting from left-to-right, Elasticsearch traverses the graph until it runs out of input before returning all possible endings of the current path, allowing for extremely fast low-memory access to valid suggestions. Additionally, common suggestions are treated with higher relevance, as word frequency isn't taken into account like it would do with a full-text search.
So, our solution to the problem of the user starting mid-sentence or using related phrases is providing the completion suggester multiple inputs to assist in covering different search phrases. This means we can add our restaurant's city and cuisine styles as inputs so that users can search by these terms and get suggestions.
View on GitHubWe can also control the order in which results are returned. For example, perhaps we'd like to bubble up restaurants with good ratings, or those who have paid to be promoted. Here's a very simple demonstration of how that can be achieved.
View on GitHubAnd the final result of our updated completion suggester data structure.
View on GitHubWe can also modify the completion suggester query to introduce a factor of fuzziness. This means that if a user starts typing 'pza' instead of 'pizza', they'll still get back suggestions such as 'Pizza Express'. However, fuzziness can be a double-edged sword, as you may find less relevant results are included.
View on GitHubOn the subject of relevance, Elastic (the creators of Elasticsearch) recommend the following:
- Continuously log searches. Find the most important/common ones, and add good suggestions first.
- Additionally, log the suggestion(s) selected by the user.
- Refine your inputs based on what users are searching and what they're selecting. See if there are any obvious gaps.
- Use weights with reasonable logic behind them, in accordance with whatever your definition of a best result is.
Rounding up
In the end it was fairly straightforward implementing search-as-you-type functionality thanks to existing open-source libraries, with the query itself only being a handful of lines long. In-fact, most of the work effort was upfront in understanding how Elasticsearch and the completion suggester works. Today we only scratched the surface regarding what is possible, and while (thankfully) the online documentation is great, I can't stress enough the need to fully understand how you wish to index and query your data to get the most out of Elasticsearch.
Symfony Web Scraper
As for the completion suggester: it is extremely fast, but it does require getting ahead of your user's search terms and fine-tuning inputs to provide the most relevant results (so they don't have to search at all!) For searches that are less predictable in terms of word order, another technique called edge n-grams may be better suited. If your problem space involves dealing with the human language (e.g. stemming), deep-searching of documents or where relevance is king, then the completion suggester is not a viable path and you'd be better off simply performing a slower but more relevant full-text search.