Search made easy

10 notes

Sunspot 2.0 pre-release

[At websolr, we help maintain some open source Solr clients. Here’s an announcement cross-posted to the Sunspot user mailing list about the first pre-release of Sunspot 2.0.]

In order to channel all the momentum that went into the recent Sunspot 1.3.0 release, I intend to release semi-frequent (every other week or so) pre-release gems of Sunspot 2.0. With the first pre-release, I figure it’s a good idea to let everyone know a bit more of what I, and other contributors, have been thinking about with respect to Sunspot’s current trajectory.

Why the major version bump to 2.0 so soon after 1.3.0? A few reasons.

First, Sunspot is now being developed against Solr 3. That’s a pretty big-sounding change, which in practice hasn’t had any major impact, but it’s worth attaching some more visibility to the fact. You’ll want to reindex your data to get the most out of Solr 3.

Along with this update to Solr 3, we are taking the liberty of making some updates to the standard Sunspot configurations, and I definitely feel like changes to those configurations warrant a major version bump. This library would be no fun to use if you had to worry about merging your schema.xml tweaks with every release. That said, it’s my hope that you’ll be able to clobber your development solrconfig.xml, and have a clear picture of what you need to do to maintain your schema.xml against the latest Sunspot canonical schema.

That said, here’s a quick tour of the features that are being worked on right now:

Spatial Search: Based on the work and feedback of @brupm, @ericxtang, @justinko and @alindeman on GitHub. We’re (finally!) integrating with the official Solr 3 Spatial Search API, replacing the Solr 1.4-era plugins and client-side geohashing. Its usage is documented in the main README on https://github.com/sunspot/sunspot#readme, the diff is here: https://github.com/sunspot/sunspot/pull/144, and the official Solr docs are here: http://wiki.apache.org/solr/SpatialSearch

Field Collapsing: A fun bit of functionality that lets you show the top N search results per some category or attribute. Implemented by @alindeman. Usage also documented in the main README, the diff is here: https://github.com/sunspot/sunspot/pull/136, and the Solr docs are here: http://wiki.apache.org/solr/FieldCollapsing

So when will 2.0 be released?

No clue. It’s still a ways out. There are a few Issues on GitHub tagged for development against 2.0 — I think once it’s clear that the configuration changes and anything backwards-compatible is behind us, we’ll focus on packing things together for a release. Enhancements that aren’t backwards incompatible may be reserved for minor-version releases to help prevent the main 2.0 release from getting too bogged down.

If you’re wondering if some specific Issue or Pull Request will be addressed in 2.0, I encourage you to bring it up on GitHub.

All that said, the pre-releases should be pretty stable, now that we have some pretty excellent integration testing on TravisCI. I’m moving a few apps of mine over to this prerelease over the next couple of days, and I encourage you to do the same. If you have any feedback on documentation and installation, or notice any bugs that need fixing, let us know in a Pull Request or new Issue on the GitHub repository.


Oh, and one more thing…

Did you know there’s a @sunspot_ruby Twitter account? Because there is. You should follow it for periodic non-annoying tweets and retweets of Sunspot- and Solr-related news and trivia.

2 notes

RubyGems.org — A case study in upgrading to full-text search (Part 3)

Part 3: Searching the index

The simplest case

With our models configured, and some initial indexing having taken place, it’s time to actually use Solr to do what it was built for: searching that data! Here is what a simple controller action might look like:

class ArticlesController << ApplicationController

  def index
    @search = Article.search do
      keywords params[:q]
    @articles = @search.results

  # ...


Sunspot defines a search method on the model, which accepts a block that provides a DSL for building Solr queries. In the simplest case, we are passing a query to the keywords method. This query gets applied to all of the text fields that we have sent over to Solr. Pretty straightforward!

The search method returns an object which I am saving here to a @search instance variable. It’s got some extra metadata about the search itself, in addition to the results, which I am storing in the @articles instance variable.

The RubyGems case

That was the simple example, already quite an upgrade for most applications. But let’s continue and take a look at where we ended up for RubyGems.org.

def self.search(query, options={})
  options = {
    :page => 1
  self.solr_search(:include => :versions) do
    keywords query do
      minimum_match 0
      boost_fields :gem_name => 100.0, :authors => 2.0, :dependency_name => 0.5
      boost(function { :downloads })
    where(:indexed, true)
    paginate :page => options[:page]

Right away, you can see that we’re defining our own search class method. Sunspot itself actually defines solr_search and creates the search alias if you don’t have your own search method. This let me encapsulate the new Solr search into a method that maintains the syntax and assumptions already present in the application.

Next, we specify a default page number for our pagination. Page one is a good place to start. Solr itself supports pagination, and Sunspot will return its results in a collection is compatible with the will_paginate gem, when present.

As we call the solr_search method, we give it a parameter to ask that Sunspot later perform an eager join on the versions association when it fetches the search results from the database. (Values from these Version objects are used later in the search results interface when displayed to the user.)

Within the solr_search block, we start with our basic keywords query. In this case, we’re providing it a block that specifies some behavior of the keywords query — specifically, behavior of Solr’s DisMax Query Parser.

  • We provide a minimum_match of 0 — essentially, treating all the search terms as optional rather than mandatory in matching results. (To learn more about this Solr feature, see my article on minimum match and boolean querying in Solr.)
  • We specify that certain fields receive a boost relative to their importance. In this case, we’re giving matches of the name itself a heavy weight relative to the authors and dependencies.
  • Finally, we specify an extra boost function that multiplies the score of a result against its downloads count, to roughly sort by download count.

The boosting in particular is pretty naive, but it’s a great place for us to start in tweaking the relevance ordering of our search results.

Finally, we call the results method against our Sunspot search result object, which fetches our search results from the database.

See for yourself!

As of this writing, the gem search described above is deployed as a demo application on Heroku at http://gemcutter-solr.heroku.com/.

Search for a few of your favorite gems, or gem authors. Compare that against http://rubygems.org/ and if you have any ideas on how search at one of our favorite community tools can be further improved, I’m all ears!

Likewise, if you would like to comment on my pull request, which as of this writing is still open for feedback, your thoughts would be welcome.

Back to your app — Why accept less?

Personally, I am of the opinion that we as developers and entrepreneurs set much too low a bar for ourselves when it comes to the quality of our search pages. Whether it’s simple latency, unintuitively rigid query syntax, or largely irrelevant search results, our users deserve better.

RubyGems has lucked out in terms of usability thus far, insomuch as users have adapted to searching only for gem names. But I think we can do better.

Even those of us who are experienced with the power and potential of open source full-text search engines, like Apache Solr, it has historically carried too much cost in terms of time and expertise to properly set up and manage such a service. And even when the budget is there, there is the simple matter of having the specialized man-hours available for all the standard ancillary issues that go into quality system administration.

Given that context, it is lamentable, but understandable, that most developers and clients simply resign themselves to a lower standard of search quality.

As a developer myself, having experienced this phenomenon too many times over the years, and it’s exactly this compromise in quality that I want to confront. It’s what motivates me as I help to build out quality hosted full-text search, powered by Apache Solr, as one of the co-founders over at Websolr.

In taking the complexity and cost out of hosting a powerful open source search technology, I’m hoping we can inspire you to raise the bar on quality for your users and customers.

An open offer

Working on the site search was a lot of fun for me, personally, particularly considering my own history within the Open Source and the Ruby communities. Over at Websolr, we love having the opportunity to give back to the developer community at large to whom we owe so much.

So if you manage, or use, a similar community tool that could benefit from some upgrades to their search, even if just to make things snappy again, be sure to get in touch! We’re happy to sponsor such sites with free hosted Solr search, and some consulting to help you make the most of it.

If you have more questions about Solr, or just want to chat about life, you can drop us a line any time at info@onemorecloud.com.

3 notes

RubyGems.org — A case study in upgrading to full-text search (Part 2)

Part 2: Choosing a client and indexing the data

The changes made to upgrade the RubyGems search were pretty straightforward, and I’m happy to break it all down for you. As you’ll see, the actual code changes were pretty minimal. It took me about one afternoon to make the changes, and another to polish up the tests and documentation to make it easy for other developers in the community to understand what was going on.

Step one: choosing a Solr client

Sunspot is arguably the most popular Solr client for Ruby applications. It’s got a nice DSL syntax for configuring your models to be indexed by Solr, lots of helpful ActiveRecord hooks to keep your index up to date as changes get made, and a search DSL for generating complex queries.

Another decent option, if you’re somewhat familiar with Solr’s APIs, is to use RSolr. RSolr is an extremely minimal wrapper around the Solr API itself. In fact, Sunspot itself uses RSolr under the hood. Personally, I alternate between the two, or even combine them, based on the needs of any particular project.

For Ruby applications, the advantage offered by RSolr would be that of a much more minimal code base. You very clearly control all of the interaction points between your application and the Solr server. If all you care to work with are hashes in Ruby, then you can create a much smaller and more streamlined codebase with less “magic.”

That said, if you’re in a similar position to most developers, you’ll probably benefit from the Sunspot approach of abstracting away the details for you. In particular, Sunspot provides some good tools for self-hosting a local Solr instance in development mode.

Ultimately I chose to use Solr for the RubyGems site, for the benefit of other volunteer developers who may have less hands-on experience with Solr itself.

Step two: installing Sunspot

As of this writing, the RubyGems site is a Rails 3 application, and the current version of Sunspot is 1.2.1. As of its 1.2 release, Sunspot is really easy to install into a Rails 3 application. While Sunspot itself is designed to be framework agnostic, it provides the sunspot_rails gem to do all of the interfacing work with a Rails application.

To install Sunspot in your Rails application, simply add this line to your Gemfile:

gem 'sunspot_rails'

After running bundle install to download and install the gem, you’re all set!

Before we dive into the model code, let’s pause and think about how search is being used.

When designing your search functionality, it is important to consider in advance some of the use cases you’ll be supporting. Solr does most of its heavy lifting at indexing time — you send over your data, Solr pre-processes it, and then you can search it. That gives you great query time performance at the cost of some up-front processing.

Solr itself is even pretty quick about that pre-processing phase, but it does mean that your application is liable to spend a lot of time collecting and serializing and uploading your data over to Solr itself. This is a simple fact of life for most any indexing search server, and so it’s important to keep that in mind to avoid spending too much time indexing unnecessarily.

When looking at your use cases, a great place to start is the logs for an existing search page. That tells you how users are already using the search for your site.

You should also consider ways to learn about how users are trying to use your search, but not receiving the results they’re looking for. Keeping a log of searches that don’t return results, or tracking which searches are receiving clicks or not, would be a fantastic idea for any application with heavy search usage.

Finally, keep in mind that your users’ behavior might already be trained based on your site’s current behavior. In this case, or when you’re starting from scratch, it may be up to your ingenuity as a developer or entrepreneur to design from your own personal insight into the domain.

Step four: configure your models with Sunspot

Starting with the simple

For the RubyGems site, Nick Quaranto provided a list of the queries for the top 500 searches. That’s a great baseline to start from. It was easy to see that most pages were simply for the name of a particular gem.

So, for starters, we need to index the name of a gem. Pretty straightforward. In Sunspot, that might be as simple as the following:

class Rubygem
  searchable do
    text :name

It’s worth noting at this point that Solr has support for different data types. In this case, Sunspot is using the value from the model’s name attribute, and sending it over to Solr as a text field.

The text field is Solr’s bread and butter — the values for these fields get all the fancy tokenization and pre-processing that you generally associate with full-text search. I’ll write more on that in future articles here.

Moving on, we also considered the existing code, which searches against the summary of the most recent version of a gem. Because Solr is a fundamentally flat collection of documents, with no formal associations between them, we have to denormalize this data from the associated object onto the object that we’re searching.

Fortunately Sunspot lets us supply a block for it to evaluate at run-time when it prepares the document to index. This lets us perform exactly the kind of denormalization we’re looking for in this case:

class Rubygem
  searchable do
    text :name
    text :summary do

The final full example

For the final RubyGems search enhancements, we took advantage of this denormalization to index more data from associated gem versions, including author names, and gem dependencies. We also index the number of downloads on a gem, and a boolean flag to give us a coarse control over whether a gem should show in search results at all.

My final searchable block then looks like this:

searchable do
  text :name, :as => 'rubygem_name'
  text :authors do
  text :description do
  text :summary do
  text(:dependencies, :as => 'dependency_name') do
    if versions.most_recent
  integer :downloads
  boolean :indexed do

The usage here has some additional Sunspot and Solr features. Let’s walk through those briefly:

Custom field tokenizing
text :name, :as => 'rubygem_name'

In our case here, I created a new kind of field in our Solr schema.xml called a name. It’s essentially the same as a text field, except we tokenize it a bit differently. Instead of the standard tokenization used for general full text, which primarily splits on whitespace, we are instead splitting on dashes and underscores, which is more useful for the well-defined format of gem names.

The same applies for the dependency_name column — a listing of all dependent gems.

I would call this an intermediate usage of Sunspot and Solr, since it requires a customization to be made to Sunspot’s otherwise extremely generic and flexible default schema.xml.

More associated model data

Authors, description, summary and dependencies are all coming from the most recent gem version. Dependencies even goes a bit further, stepping out along a few more associations to collect the names of gems declared as dependencies for this gem.

Tracking the downloads

We index the number of times that a gem has been downloaded, so we can later ask Solr to factor this into the ordering of our search results. All else equal, we would like a gem with more downloads to show higher in the search results than a gem with less downloads.

Controlling indexing

When a gem gets its only published version yanked, we need a way to quickly drop it from the search results. I’m going for a pretty coarse mechanism to control this, indexing a boolean true or false on whether there even is a most recent version of the gem that should be indexed. This allows me quite a bit of confidence that we’re not going to let anything leak out that shouldn’t show up in search results.

Step five: Indexing your data for the first time

The above is all we needed to match the current state of RubyGems search, plus throw in a few upgrades in terms of performance and more flexible querying. Now might be a good time to run an index and test things out.

Sunspot provides some Rake tasks to start and stop a local development Solr instance. You can run rake -T sunspot to see the documentation for its Rake tasks. To start a Solr server in the background for your development environment, you can run the following command:

rake sunspot:solr:start

Now with a local Solr server up and running, you can run another rake task to index your data:

rake sunspot:reindex

When you’re simply using your application, Sunspot provides basic after_save hooks to make sure your ActiveRecord objects are sending over their changes to Solr when they’re created, updated or removed in the database. That helps keep your search index up to date with the contents of your database.

1 note

RubyGems.org — A case study in upgrading to full-text search (Part 1)

Part 1: Background and benchmarking

RubyGems.org is a wonderful community resource for discovering and distributing Ruby Gems. The relaunch of its front-end in February 2010 based on the Gemcutter project provided an excellent improvement to the entire process of creating and distributing gems.

Like many projects focused on releasing early and often, the history of the Gemcutter search has continued to be one improvement after another, starting as simple as possible and moving forward from there. And, as a developer, what simpler way is there to implement your search than starting with SQL LIKE? And that is exactly where our friend Gemcutter started.

Does the following look familiar?

scope :search, lambda { |query| 
  where(["versions.indexed and (upper(name) like upper(:query) or upper(versions.description) like upper(:query))", {:query => "%#{query.strip}%"}]).
  order("rubygems.downloads desc")

This is the current search method used up at RubyGems.org. In truth, this is a great place to start as an agile developer, especially in early development. It certainly beats hard-coding your search pages with lorem ipsum in terms of usefulness to your stakeholders.

Unfortunately, as your site starts to grow, SQL LIKE is not going to keep up in the performance department. As the sheer size of the columns that you’re searching against starts to grow, the amount of time spent searching them will grow with it. Your site’s users will be punished for the growing popularity of the site itself.

How slow?

Well, let’s run a quick benchmark. Mind you, this will be highly un-scientific, but it should serve some legitimacy in terms of relative user experience. My setup is using ab, from my home machine, over the Internet, so take this with a grain of salt.

And the results? A median total response time of 2,150ms, with a standard deviation of 95ms, for 10 requests.

If that sounds terrible, let’s be generous and remember that this is accounting for general Internet latency as well. Which in my case is an average ping time of… 75ms. Okay, yes, that 2,150ms is pretty terrible.

Now let’s take a look at my demo implementation using Solr. It’s running against a recent database dump from RubyGems.org, graciously provided by Nick Quaranto just a few days ago. The site itself is running on a modest single dyno on Heroku, with the search index itself being hosted by your truly over at Websolr.

The result? A median total response time of 1,230ms, with a standard deviation of 15ms. And a pretty similar average ping time, too.

Already, with a very quick and dirty test setup, we’re seeing a statistically significant improvement from the perspective of where a user would be sitting.

How quick and dirty was that test? Well, comparing the home page actions, my demo clocks in at a median total request time of 1,200ms, plus or minus 280ms. Similar to the search action itself. And RubyGems.org proper? 160ms plus or minus 6ms.

After seeing something like that, it’s clear that the SQL-based search is at a clear disadvantage in terms of performance alone. My guess is that, if running on the same hardware as RubyGems proper, the Solr search would be yet another order of magnitude faster than the nearly halved response time we’ve already seen.

In my next post, we’ll move on and take a look and see how much work it is to move that search traffic out of SQL and on to Solr.

1 note

Binpress Programming Contest

If you’re in the business of crafting and selling software, and you haven’t seen this already, you should definitely check out the Binpress Programming Contest which is currently underway!

Binpress is building a marketplace of source code, where developers can buy, or build and sell, reusable source code. For example, if you’re reading this blog post, we’re guessing you may have some interesting search modules and UI patterns to offer to the world. Solr spelling correction hints and autocompleting search fields, anyone?

We’re excited about the potential of Binpress, and excited to count ourselves among the sponsors of their Programming Contest. We’re looking forward to what everyone comes up with!

3 notes

“Did you mean…?”

It’s been a long time coming, but Websolr now supports the Solr SpellCheckComponent in our configuration panel. Just check one box and you’re all set to use one of Solr’s most impressive features for improving search relevancy.

Spellcheck: Check!

Spelling correction in Solr works much the same way that you might check your own spelling. If you’re not sure how to spell a word — look it up in a dictionary! There are a number of different sorts of dictionaries to use, but a pretty flexible way to set things up is to configure your own index to be the authoritative source for how words ought to be spelled. That way you don’t have to worry about languages or synonyms or proper nouns and jargon specific to your domain.

Getting started with spellcheck

When you first enable the spellcheck component on your index, you’ll need to reindex your data. This populates a textSpell field with all of the terms from your text fields.

Once you’ve reindexed, you’ll need to then instruct the spellcheck component to build its dictionary. You can do this by issuing an empty query with the spellcheck.build parameter set to true, as with the following URL:


Now that your dictionary is built, you can inspect the response from Solr to find suggestions for mis-spelled words. Consider this sample response for a search where “world” is mis-spelled as “wrld”:

  "responseHeader": {
  "response": {
  "spellcheck": {

As you can see, there is now a “spellcheck” block with some suggestions for the incorrectly spelled word, and some information about how often the alternative choice appears in your index. You can then use this extra information to provide UI hints to your users that can guide them in the right direction.

Configuring spellcheck for your local environment


If you’re into the technical details, and keen to try this out in your own development Solr instance, here’s what our searchComponent looks like:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
  <str name="queryAnalyzerFieldType">textSpell</str>
  <lst name="spellchecker">
    <str name="name">default</str>
        <str name="field">textSpell</str>
    <str name="spellcheckIndexDir">./spellchecker</str>

Here are the default options we are sending in with queries to the standard request handler:

<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.count">5</str>

And last but not least, don’t forget to add spellcheck to the last-components directive in the relevant requestHandler!

<arr name="last-components">

For more information on the configuration of the SpellCheckComponent, you should refer to the SpellCheckComponent docs at the Solr Wiki.


In the schema.xml, we expect a textSpell field to be present. You are welcome to define that field however you like. If one doesn’t exist, we use the following field definition:

<field name="textSpell" type="text" stored="false" indexed="true" multiValued="true" />

And, finally, a copyField directive to collect all of your fields matching *_text:

<copyField source="*_text" dest="textSpell" />

If you have a different naming convention for your text fields, feel free to define the textSpell field in your schema.xml however you like, and we’ll defer to your settings.

How’s it going?

Are you adding spellcheck to your app? Want to show it off? We love seeing all the cool sites our customers are building with Solr — from simple keyword searches all the way up to interactive faceted geospatial maps! Send a tweet to @websolr with your URL so we can brag about you to our friends.

Looking for a production Solr host, or even just a place to take Solr for a spin without hassling with all the installation or configuration? Sign up for a month of our Silver plan for free, on us, with the coupon code SPELLCHECK1101.

1 note

IBM DeveloperWorks - Look ahead to emerging web technologies in 2011

Cloud computing has been widely hyped as the cure for the common cold and even better than whole-grain sliced bread. Perhaps those claims exaggerate its potential, but not by much. Cloud computing has had a remarkable impact. Services like Amazon Elastic Compute Cloud (Amazon EC2), the Rackspace Cloud, and other similar virtual server farms have made the machine room and the need for capital investments obsolete. Turning a machine on is as simple as clicking the mouse.

Further, turnkey hosting providers such as Heroku provide clouds with silver linings: application hosting with no system administration. Just install your code and go. Heroku is especially notable because it has fomented a cottage industry of specialized cloud providers that complement its core services. Need search? Plug in Websolr. Global SMS? Use Moonshado. Prefer a “no SQL” database? Choose from Mongo, Redis, or JasonDB.

So long, root prompt! It’s been nice knowing you. With little or no systems administration, developers will be free to focus on application code in 2011.

1 note

Geospatial searching in Sunspot 1.2

Geohash is a very cool algorithm for encoding a latitude and longitude into a string of characters. The format of these strings represent geographical quadrants of increasing precision, where a smaller, more precise quadrant shares the prefix of the quadrant in which it is contained.

For example, the coordinates (51.5001524, -0.1262362) represent a point in London. This point is encoded into the geohash “gcpuvpk44kprq” which includes a bounding box size of about 2 meters. Shortening that geohash to “gcpuvpk” translates to the rounded-down (51.50, -0.13) with a much less precise bounding box size of about 10 km.

This is where Sunspot comes in to use the lexical properties of a geohash to leverage Solr’s strengths.

Rather than asking Solr to explicitly calculate distances on the server side (a feature scheduled for Solr 1.5, and only somewhat-successfully backported to 1.4), Sunspot stores and compares these geohashes lexically. It does so by building a series of clauses in your query attempting to lexically match decreasingly-precise geohashes for the point you are searching near. With a corresponding set of decreasing boost factors, this allows Solr to assign scores to your results based on their nearness.

This is a rough ordering, to be sure. However it is very easy to set up and use, and much kinder on your Solr 1.4 install than calculating distances, particularly if you have a lot of data.

One side effect that you might see if you’re just getting started with Sunspot 1.2’s geohash based searching, is that you can’t seem to find any results that are far away. In fact, around 389 miles away, to be more precise.

Because Sunspot uses the DisMax query parser by default, if you are using its default “minimum match” of 100%, it treats all clauses as mandatory. A minimum match of 100% effectively requires that results be at least within the least-precise bounding geohash box of about 389 miles.

To search outside of that distance, we need to make sure the spatial clauses are treated as optional. Setting minimum_match to 0 would make all clauses strictly optional. Or setting it to -1 would make all but one optional. The choice is up to you. If you’d like to learn more about minimum match in the DisMax query parser, you should refer to an earlier article I wrote on querying with boolean logic using Sunspot and the DisMax query parser.

Using this approach, sorting will by default be a combination of keyword relevancy (if you are searching with a query string) and proximity. The Sunspot documentation on its near search restriction goes into more detail about how to weight one over the other.

Ultimately, to display actual distances to your users, or to enforce precise distance ordering, you will need to perform distance calculations on your results. Fortunately, so long as your weighting is okay, you should only have to do this on a page or results at a time. Here is an example of sorting your results by their exact distance to the point you are searching near:

@results = @search.results.sort_by { |result| result.distance_to(lat, lng) }

That distance_to method is left as an exercise to the reader :)

In practice, Sunspot’s new geohash search is trivial to set up and use, and covers all of the most common use cases for applications that need spatial search. Until we see baked-in spatial search in Solr 1.5, this approach works beautifully for the majority of applications out there, and I encourage you to give it a try.

And if you’re not using Sunspot, this is an easy win to add to your application or Solr client :)