Sphinx: MySQL Full-text Search Replacement
The cornerstone of the entire internet is search. The internet is so vast and sparse, the only way to make it truly useful is by allowing a person to easily find whatever they are looking for. That is why search engines and web portals have been so successful. At Grooveshark, search, along with recommendations, are the fundamental way people find music.
In the past, using MySQL’s full-text search was convenient and rather fast. As the amount of information grew, it became very apparent that MySQL would not be able to handle the size of the data and the number of searches. Looking around for replacements, the two best solutions I found were Lucene and Sphinx. Lucene is a nice tool that integrates with a bunch of other Apache projects, but Sphinx was small, fast and really easy to use.
Setting up Sphinx is a cinch. Using the official documentation and this IBM article, I was able to get Sphinx running in less than 30 mins. You have to compile the source and getting the data into Sphinx can take awhile depending on your data source (MySQL, PostgreSQL or XML). In the actual Sphinx download, there are PHP and Python examples to also help you start out using their really easy to use API. For international support, you can modify the charset_table option in your configuration file using Sphinx’s Unicode character mapping.
With Grooveshark, even Sphinx is not the perfect solution because we don’t have “perfect” data. After we get results back from Sphinx (on our slow test machine, we never had a query go over 0.3 seconds!), we put the results through a filter and reorder them accordingly. An example is preventing songs that contain the artist name in multiple places from being considered a more “relevant” result then the same song that has the correct metadata. The current solution is definitely not perfect and there is still more work to do, but now, searches are quicker and more relevant than they ever were before.