The cornerstone of the entire internet is search. The internet is so vast and sparse, the only way to make it truly useful is by allowing a person to easily find whatever they are looking for. That is why search engines and web portals have been so successful. At Grooveshark, search, along with recommendations, are the fundamental way people find music.

In the past, using MySQL’s full-text search was convenient and rather fast. As the amount of information grew, it became very apparent that MySQL would not be able to handle the size of the data and the number of searches. Looking around for replacements, the two best solutions I found were Lucene and Sphinx. Lucene is a nice tool that integrates with a bunch of other Apache projects, but Sphinx was small, fast and really easy to use.

Setting up Sphinx is a cinch. Using the official documentation and this IBM article, I was able to get Sphinx running in less than 30 mins. You have to compile the source and getting the data into Sphinx can take awhile depending on your data source (MySQL, PostgreSQL or XML). In the actual Sphinx download, there are PHP and Python examples to also help you start out using their really easy to use API. For international support, you can modify the charset_table option in your configuration file using Sphinx’s Unicode character mapping.

With Grooveshark, even Sphinx is not the perfect solution because we don’t have “perfect” data. After we get results back from Sphinx (on our slow test machine, we never had a query go over 0.3 seconds!), we put the results through a filter and reorder them accordingly. An example is preventing songs that contain the artist name in multiple places from being considered a more “relevant” result then the same song that has the correct metadata. The current solution is definitely not perfect and there is still more work to do, but now, searches are quicker and more relevant than they ever were before.

Posted in misc | No Comments »

Memcached is a wonderful tool to offload database calls by storing the needed information in cache. One trick I’ve been using to increase productivity is that whenever I know I will be using complex or long running queries, I use a local instance of Memcached to cache the initial database calls. As I debug the output or tweak the logic, the average runtime of the script stays almost constant because the bulk of the processing, database calls, has been cached.

Sigh….

After months of hardwork, its finally here: Grooveshark Lite. Everyone at Grooveshark has been busting their tails off trying to bring the best music, listening experience the internet has to offer. Now, you don’t have to download Sharkbyte to stream music, though it is still needed to download and purchase songs. This flash application allows users to easily find any song in our system. The best way to find out about Grooveshark Lite is to see it for yourself. Go to http://listen.grooveshark.com to see it in action.

Working Hard for the Man

April 5th, 2008

It’s 1.40AM on a Friday night and I’m still at work. Something is very wrong with that picture.

Over at Grooveshark, LOTS of changes are being that is going to make a pretty good music site an even better one. Better organized data, more online songs and a better user experience are all on the horizon. For all those impatient folks out there, which I am one of, just hold on and this major update will knock you out of your boots, or socks… or feet so make sure you wear some kind of footwear!

I want MY music. I want it fast, organized, high-quality and now. This is not too much to ask, and I, a consumer, have been begging for it for years.

Before BitTorrent, Napster or CDs, there was the mixtape. I would spend hours listening to the radio and when a song I liked came on, I recorded it. It was MY music. I choose to keep the notes, the voice, the rhythms with me for all time. I would mix and match songs according to my mood and the occasion. I would swap tapes with friends with similar tastes. And if they were lucky, they might have received one as a gift. Of course being a kid, I didn’t have a lot of money. But nothing says I care more than a well crafted mixtape (and no, I’m not ripping off Nick Hornby).

Then came the CD with its shiny new cover and it’s obvious similarity to the venerable vinyl record. But this was the 90’s. A new time, a new era, full of ones and zeros. Anything digital was all the rage. So I paid $20 a pop for a piece of plastic that did sound better than a tape, but was still inferior to a vinyl record. Granted, the CD came with a small booklet full of useful information: more pictures, sometimes lyrics, and even the name of the sound engineer from that song I never listened to. With a mixture of mass marketing and an era of economic prosperity, YOU, the labels, reaped record profits and I was left with piles of CDs containing just a few songs of enjoyable music.

In the end, I was left unfulfilled. I returned to the trusty mixtape to create the album I really wanted. Sure it was lesser quality and I couldn’t magically skip to the next song, but I at least had the songs I wanted in the order I wanted. It was my music my way. At the end of the 90’s, the party was over. The floodgates were open and a sea of unlimited free music was unleashed upon us. But this wasn’t a revolution. People didn’t storm the castle or burn down the walls. It’s the same evolutionary process that brought about the transistors, computers and the internet.

As the technology advanced with cheaper disk space and faster web connections, I wanted more. First there was Usenet, full of silly chatter and annoying trolls, but I could download free music. Then there was IRC with better file downloading but added viruses. Each successive iteration provided faster downloads and more content. The tide of progress couldn’t be stopped so technology finally reached a tipping point where the will, getting the music I want, found a way, Napster.

In truth, Napster changed everything for internet users. Before, advanced internet users had to download from large, clunky servers that might not have the data you wanted, or use a command-line interface to connect directly to files you wanted. What Napster did correctly was provide an adequate interface to content that was searchable and downloadable. What Napster neglected to do was filter out all the junk that people tossed in the system. Despite their drawbacks, what made Napster and other P2P networks even more worthwhile was recordable CDs. Now you had the best of both worlds: the crisp, clean digital sound of a CD along with the malleable, personal nature of a mixtape.

During this era of free music, a connection was broken. I was no longer a customer. I became a common theif and had to be dealt with immediately. You tried suing me and that did not work. You tried suing my little siblings and grandmother and you lost my respect. You did not ask me what I wanted or what it would take to change my ways. You could have created a new and better system with higher quality files and organized information. You could have created a faster system with special content like videos or live concerts. You could have done a lot of things, but you chose to ignore me. That was your decision, but now I have made mine: I WILL have my music.

There is Bittorrent which lets me download massive amount of music with a click of a button. There are decentralized P2P networks which let me find the exact song I want without being sued. There are social music sites where random strangers and I can share our individual tastes with each other and the world. There are programs out there that let me organize, tag and correct the gigabytes of music I have on my hard drive, and portable music players that can carry my entire library and fit in my pocket. I do this because I still love music and listen to it everyday. Now I download music and create playlists instead of buying CDs and burning mixed CDs. I do this because you drove me away and made me an outcast. Now a new era is upon us and you have to decide what your next step will be.

It’s starting out with a trickle: Radiohead, Trent Reznor and Barenaked Ladies. But that’s only the beginning of your troubles. When I finally build a system that let’s me listen to my music when, where and how I want it, I will need you less. When I can easily share and broadcast my music with friends or the world, you will see my influence. When I can discover the music I like or find music I never knew I would like, I will be empowered. And when I can finally pay the people directly who create, perform and produce the music I love, you will be nothing.

There are some who claim that they can do these things already. That they can provide a solution to my problems. I will be the judge of that. In the meantime, I refuse to wait for a solution because I am the solution. I choose to build the very system I pleaded for you to build. I have built Wikipedia, Flickr, Digg, Last.fm and Grooveshark. These are just the beginning and more is coming ahead. You can be part of this future if you really want. Just remember, I will have my music, but will YOU be part of the system?

Posted in life, music | 2 Comments »

Working with any website, using the Model-view-controller (MVC) design pattern is a must. One way to achieve this is by using templates. Within PHP, there is a large divide on whether using a formal template system is necessary. Most proponents will claim that PHP itself is a template system (see Wordpress and its countless themes). Lately I have come to really like Smarty, a php template engine.

Over at Grooveshark, we’ve been making A LOT of changes. Basically, the brains of Grooveshark is improving with a different database design and backend code while the face of the site stays the same. This is where Smarty has made my life so much easier. All I do is make sure that the same variables are assigned with the same information and Smarty handles the rest.

Smarty has other handy features like caching to compensate for the extra overhead of processing the templates. For really dynamic sites, Smarty provides really fine control of the cache so nothing is ever stale. Smarty is really adaptable so that you can use it to produce your feeds (interchange XML for HTML and you are done). As of right now, I’m really liking Smarty.

Firefox is the best browser out there. That’s one argument nobody will ever convince me otherwise. Sure Opera uses less memory, Safari renders pages faster and IE isn’t even in the conversation. In the end, Firefox provides the best experience for the less technically inclined to the most advanced users.

Being a web developer, Firefox has some of the best tools out there for testing forms, debugging Javascript and interactive HTML/CSS editing. If you are a web developer and haven’t heard of Firebug, then you really aren’t a web developer. Other great tools include the Web Developer’s Toolbox and FasterFox.

Recently, Jay mentioned Firefox 3 Beta 4’s release and I really wanted to try it out. I had to keep Firefox 2 for testing purposes so I did a google search and discovered this post showing how to run both versions without them clashing (they can’t run at the same time) on linux. Now I’m running a much improved Firefox with better memory management (not THAT much better), faster page rendering and the best developer plugins possible. Talk about having your cake and eating it too (that saying makes no sense).

March Madness is the best tournament on earth. For 3 weeks during the end of March and beginning of April, 65 college football teams square off in an orgy of upsets and thrillers. This year, CBS expanded its online on demand live streaming. Working all day, the only time I have to watch games are at night. By streaming the videos, it gives me a chance to at least listen to the games while I work. This way everyone wins: I get to keep track of my favorite games and CBS gets to serve me more ads.

Being a geek, I really wanted to see how the March Madness on demand service will handle the bandwidth of serving the video. I found this InformationWeek article that points out that Akamai is used to stream their videos. This makes sense because Akamai has always been a huge player in the contend delivery market. From their site, Akamai claims it “handles 20% of the world’s total Web traffic.” Now those are big numbers. Even with Akamai’s large content delivery network, “CBSSports.com monitors and throttles its system based on usage and historical data patterns” so that it won’t overload their system. The fact that CBS has to restrict the amount of people using this service shows how far the US has to go if it wants to be completely digital.

For really cool visualization of global web traffic, check out this Akamai flash app.

Music Information Retrieval

March 18th, 2008

Over the weekend, I really got into music information retrieval (MIR). Its basically grabbing meta-information of an audio file by analyzing its waveform. This type of information is really valuable, especially for a music company (ie: Grooveshark). If I ever have time, this would be a really fun side project. A really good source of information about this topic is this bibiography page (too bad it hasn’t been updated since August, 2007). A list of up and running MIR systems can be found here.

What makes MIR systems so important is that for music sites, they can generate a lot of useful data without anyone having to enter it by hand. For iTunes, this is not a problem because labels give them all the information they need, but for sites where song files can come from anywhere and anyone, there’s no way you can handle the variability in data quality and availability. By having a system that could automatically fetch the required info, within certain bounds of error, you create a vast collection of information that you can use to generate recommendations, provide more accurate searches, and create better categorization of all that music.

The problem with MIR systems is that they require large amounts of storage space and processing power. The cost of both storage and processing are dropping everyday which is great for the future of MIR systems. Processing power is the largest inhibiting factor, especially when you try to analyze millions of songs. The only companies that could probably do a project like this on a large scale would be Google, Amazon and their ilk. Currently, I’m very hopeful that a startup with the right mix of programmers, hardware, and music can compete with the big boys ;)

Browsers do a lousy job of providing an interface to the HTML document. The DOM is supposed to be that interface but it is horribly slow, and clunky. Traversing the DOM tree extensively is one of the sure-fire way to slow down your site. In an effort to help out Javascript coders, the DOM does have functions like getElementsByTagName, getElementsByClassName and getElementsByName, but they do not all work across all browsers. This why you should create a function called getElementsByID:

var groupCache = {};
function getElementsById(id){
  if(!groupCache[id]){
    groupCache[id] = [];
  }
  var nodes = groupCache[id];
  for(var x=0; x<nodes .length; x++){
    if(nodes[x].id != ""){
      nodes.splice(x, 1);
      x--;
    }
  }
  var tmpNode = document.getElementById(id);
  while(tmpNode){
    nodes.push(tmpNode);
    tmpNode.id = "";
    tmpNode = document.getElementById(id);
  }
  return nodes;
}

Now whenever you want a collection of DOM objects, just give all of them the same id and call this function to grab an array of the objects you want. This is not the most ideal way and its actually a pretty big hack. But sometimes, speed is more important than form.