Websolr

Search made easy

3 notes

RubyGems.org — A case study in upgrading to full-text search (Part 2)

Part 2: Choosing a client and indexing the data

The changes made to upgrade the RubyGems search were pretty straightforward, and I’m happy to break it all down for you. As you’ll see, the actual code changes were pretty minimal. It took me about one afternoon to make the changes, and another to polish up the tests and documentation to make it easy for other developers in the community to understand what was going on.

Step one: choosing a Solr client

Sunspot is arguably the most popular Solr client for Ruby applications. It’s got a nice DSL syntax for configuring your models to be indexed by Solr, lots of helpful ActiveRecord hooks to keep your index up to date as changes get made, and a search DSL for generating complex queries.

Another decent option, if you’re somewhat familiar with Solr’s APIs, is to use RSolr. RSolr is an extremely minimal wrapper around the Solr API itself. In fact, Sunspot itself uses RSolr under the hood. Personally, I alternate between the two, or even combine them, based on the needs of any particular project.

For Ruby applications, the advantage offered by RSolr would be that of a much more minimal code base. You very clearly control all of the interaction points between your application and the Solr server. If all you care to work with are hashes in Ruby, then you can create a much smaller and more streamlined codebase with less “magic.”

That said, if you’re in a similar position to most developers, you’ll probably benefit from the Sunspot approach of abstracting away the details for you. In particular, Sunspot provides some good tools for self-hosting a local Solr instance in development mode.

Ultimately I chose to use Solr for the RubyGems site, for the benefit of other volunteer developers who may have less hands-on experience with Solr itself.

Step two: installing Sunspot

As of this writing, the RubyGems site is a Rails 3 application, and the current version of Sunspot is 1.2.1. As of its 1.2 release, Sunspot is really easy to install into a Rails 3 application. While Sunspot itself is designed to be framework agnostic, it provides the sunspot_rails gem to do all of the interfacing work with a Rails application.

To install Sunspot in your Rails application, simply add this line to your Gemfile:

gem 'sunspot_rails'

After running bundle install to download and install the gem, you’re all set!

Before we dive into the model code, let’s pause and think about how search is being used.

When designing your search functionality, it is important to consider in advance some of the use cases you’ll be supporting. Solr does most of its heavy lifting at indexing time — you send over your data, Solr pre-processes it, and then you can search it. That gives you great query time performance at the cost of some up-front processing.

Solr itself is even pretty quick about that pre-processing phase, but it does mean that your application is liable to spend a lot of time collecting and serializing and uploading your data over to Solr itself. This is a simple fact of life for most any indexing search server, and so it’s important to keep that in mind to avoid spending too much time indexing unnecessarily.

When looking at your use cases, a great place to start is the logs for an existing search page. That tells you how users are already using the search for your site.

You should also consider ways to learn about how users are trying to use your search, but not receiving the results they’re looking for. Keeping a log of searches that don’t return results, or tracking which searches are receiving clicks or not, would be a fantastic idea for any application with heavy search usage.

Finally, keep in mind that your users’ behavior might already be trained based on your site’s current behavior. In this case, or when you’re starting from scratch, it may be up to your ingenuity as a developer or entrepreneur to design from your own personal insight into the domain.

Step four: configure your models with Sunspot

Starting with the simple

For the RubyGems site, Nick Quaranto provided a list of the queries for the top 500 searches. That’s a great baseline to start from. It was easy to see that most pages were simply for the name of a particular gem.

So, for starters, we need to index the name of a gem. Pretty straightforward. In Sunspot, that might be as simple as the following:

class Rubygem
  searchable do
    text :name
  end
end

It’s worth noting at this point that Solr has support for different data types. In this case, Sunspot is using the value from the model’s name attribute, and sending it over to Solr as a text field.

The text field is Solr’s bread and butter — the values for these fields get all the fancy tokenization and pre-processing that you generally associate with full-text search. I’ll write more on that in future articles here.

Moving on, we also considered the existing code, which searches against the summary of the most recent version of a gem. Because Solr is a fundamentally flat collection of documents, with no formal associations between them, we have to denormalize this data from the associated object onto the object that we’re searching.

Fortunately Sunspot lets us supply a block for it to evaluate at run-time when it prepares the document to index. This lets us perform exactly the kind of denormalization we’re looking for in this case:

class Rubygem
  searchable do
    text :name
    text :summary do
      versions.most_recent.summary
    end
  end
end

The final full example

For the final RubyGems search enhancements, we took advantage of this denormalization to index more data from associated gem versions, including author names, and gem dependencies. We also index the number of downloads on a gem, and a boolean flag to give us a coarse control over whether a gem should show in search results at all.

My final searchable block then looks like this:

searchable do
  text :name, :as => 'rubygem_name'
  text :authors do
    versions.most_recent.try(:authors)
  end
  text :description do
    versions.most_recent.try(:description)
  end
  text :summary do
    versions.most_recent.try(:summary)
  end
  text(:dependencies, :as => 'dependency_name') do
    if versions.most_recent
      versions.most_recent.dependencies.collect(&:rubygem).collect(&:name)
    end
  end
  integer :downloads
  boolean :indexed do
    versions.indexed.most_recent.present?
  end
end

The usage here has some additional Sunspot and Solr features. Let’s walk through those briefly:

Custom field tokenizing
text :name, :as => 'rubygem_name'

In our case here, I created a new kind of field in our Solr schema.xml called a name. It’s essentially the same as a text field, except we tokenize it a bit differently. Instead of the standard tokenization used for general full text, which primarily splits on whitespace, we are instead splitting on dashes and underscores, which is more useful for the well-defined format of gem names.

The same applies for the dependency_name column — a listing of all dependent gems.

I would call this an intermediate usage of Sunspot and Solr, since it requires a customization to be made to Sunspot’s otherwise extremely generic and flexible default schema.xml.

More associated model data

Authors, description, summary and dependencies are all coming from the most recent gem version. Dependencies even goes a bit further, stepping out along a few more associations to collect the names of gems declared as dependencies for this gem.

Tracking the downloads

We index the number of times that a gem has been downloaded, so we can later ask Solr to factor this into the ordering of our search results. All else equal, we would like a gem with more downloads to show higher in the search results than a gem with less downloads.

Controlling indexing

When a gem gets its only published version yanked, we need a way to quickly drop it from the search results. I’m going for a pretty coarse mechanism to control this, indexing a boolean true or false on whether there even is a most recent version of the gem that should be indexed. This allows me quite a bit of confidence that we’re not going to let anything leak out that shouldn’t show up in search results.

Step five: Indexing your data for the first time

The above is all we needed to match the current state of RubyGems search, plus throw in a few upgrades in terms of performance and more flexible querying. Now might be a good time to run an index and test things out.

Sunspot provides some Rake tasks to start and stop a local development Solr instance. You can run rake -T sunspot to see the documentation for its Rake tasks. To start a Solr server in the background for your development environment, you can run the following command:

rake sunspot:solr:start

Now with a local Solr server up and running, you can run another rake task to index your data:

rake sunspot:reindex

When you’re simply using your application, Sunspot provides basic after_save hooks to make sure your ActiveRecord objects are sending over their changes to Solr when they’re created, updated or removed in the database. That helps keep your search index up to date with the contents of your database.

  1. t2001t2001 reblogged this from websolr-blog
  2. websolr-blog posted this