Every IT team wants to get the right information to employees and vendors as quickly as possible. Yet the task is always getting harder as more information becomes available and results invariably become stale. Disparate internal systems hold vital information. Search capabilities are not consistent across tools. No universal system exists. And even inside Google we can’t use our web search technology, because that assumes a fully public dataset, a lot of traffic, and more active content owners.
All too often, this sisyphean task ends up requiring huge amounts of manual labor, or leads to inferior results and frustrated people.
At Google we transitioned our internal search to rank results using machine learning models. We found this helps surface the most relevant resources to employees – even when needs change rapidly and new information becomes available.
Our internal search site–Moma–is Googlers’ primary way to source information. It covers a large number of data sources, from internal sites to engineering documentation to the files our employees collaborate on. Over 130,000 weekly users issue queries each week – to get their job done and to learn the latest about what’s going on at Google.
With COVID-19 and working-from-home changing so much so rapidly, lots of new content and guidance for Googlers was created quickly and needed to be easily accessible and discoverable by all employees. But how to make sure it gets shown?
Before adopting ML for search ranking, we used to tweak ranking formulas with literally hundreds of individual weights and factors for different data sources and signals. Adding new corpora of information and teaching the search engine new terminology was always possible, but laborious in practice. Synonyms, for example, would rely on separate datasets that needed manual updating, for example to make sure that searches for “Covid19”, “Covid”, and “Coronavirus” all return the relevant pages.
The involved human effort to carefully craft and apply changes, validate them and deploy them often meant that new content for new topics was slow to rank highly. Even then, search results could be hit-or-miss depending on how users formulated their queries, as writers often wouldn’t know exactly which keywords to use in their content – especially in situations where trends emerge quickly, and the terminology was evolving in real time.
We now use ML for scoring and ranking results based on many signals, and our model learns quickly because we continuously train on our own usage logs of the last four weeks. Our team integrated this ranking method in 2018, and it served us well with recent shifts in search patterns. When new content becomes available for new needs, the model can pick up new patterns and correlations that would have otherwise taken careful manual modelling. This is the fruit of our investments over the last years, including automatic model releases and validation, measurement and experimentation, which allowed us to get to daily ranking model rollouts.
Create training data
Creating training sets is the prerequisite for any application of machine learning, and in this case it’s actually pretty straightforward:
Generate the training data from search logs that capture which results were clicked for which queries. Choosing an initial simple set of model features helps to keep complexity low and make the model robust. Click through rate for pages by queries and a simple topicality score like TF-IDF can serve as starting points. Each click on a document gets a label of 1, everything else a label of 0. Each search impression that gets a click should become a training example to the ML model. Don’t do any aggregations on query or such; the model will learn these by itself. Feed the training data into an ML ranking model, like tensorflow_ranking.
Once the basics are working, you’ll want to gauge the performance of the model, and improve it. We combine offline analysis – replaying queries from logs and measuring if the clicked results ranked higher on average – and live experimentation, where we divert a share of traffic to a different ranking model for direct comparison. Robust search quality analysis is key, and in practice it’s helpful to consider that higher-up results will always get more clicks (position bias), and that not all clicks are good. When users immediately come back to the search results page to click on something different, that indicates the page wasn’t what they were looking for.
Expanding the model
With more signals and page attributes available, you can train more sophisticated models that consider e.g. page popularity, freshness, content type, data source or even user attributes like their job role. When structured data is available, it can make for powerful features, too. Word embeddings can outperform manually defined synonyms while reducing reliance on human curation, especially on the “long tail” of search queries.
Running machine learning in production with regular model training, validation and deployment isn’t trivial, and comes with quite a learning curve for teams new to the technology. TFX does a lot of the heavy lifting for you, helping to follow best practices and to focus on model performance rather than infrastructure.
The ML-driven approach allows us to have a relatively small team that doesn’t have to tweak ranking formulas and perform manual optimizations. We can operate driven by usage data only, and don’t employ human raters for internal search.
This ultimately enabled us to focus our energy on identifying user needs and emerging query patterns from search logs in real time, using statistical modelling and clustering techniques. Equipped with these insights, we consulted partner teams across the company on their content strategy and delivered tailor-made, personalized search features (called Instant Answers) to get the most helpful responses in front of Googlers where they needed them most.
For example, we could spot skyrocketing demand for (and issues with!) virtual machines and work-from-home IT equipment early, influencing policy, spurring content creation and informing custom, rich promotions in search for topical queries. As a result, 4 out of 5 Googlers said they find it easy to find the right information on Covid-19, working from home, and updated company services.
Give it a try
Interested in improving your own search results? Good! Let’s put the pieces together. To get started you’ll need:
Detailed logging, ranking quality measurements and integrated A/B testing capabilities. These are the foundations to train models and evaluate their performance. Frameworks like Apache Beam can be very helpful to process raw logs and generate useful signals from them. A ranking model built with Tensorflow Ranking, based on usage signals. In many open source search systems like Elastic Search or Apache Solr, you can modify, extend or override scoring functions, which can allow you to plug in your model into an existing system.Production pipelines for model training, validation and deployment using TFX
We want to acknowledge Anton Krohmer, Senior Software Engineer, who contributed technical insight and expertise to this post.
Cloud BlogRead More