back to article Sphinx - text search The Pirate Bay way

In 2001, Andrew Aksyonoff had an itch. He ran a website that indexed song lyrics and allowed users to search them. At the time, none of the open source options for text search worked. Lucene, the standard response at an engineering team meeting when the problem of search comes up, was still a baby. Indexing took too long, and it …

COMMENTS

This topic is closed for new posts.
  1. Chris
    Flame

    MySQL? Oh do fuck off.

    Ted, you're showing your limited knowledge here. Decent databases (including a certain open source one that's got a thing about elephants) have supported full text searching properly for years. Just because MySQL has half-arsed full text indexing and piss poor locking doesn't mean there's anything exciting about Sphinx - those in the know wouldn't use MySQL in the first place.

  2. Adrian Midgley

    ht:/Dig?

    Am I wide of the mark in thinking this is an effective free text search engine?

  3. Anonymous Coward
    Anonymous Coward

    PostgreSQL

    I've been using PostgreSQL to implement a website search engine. Far outperforms both Lucene, MySQL and most other FTS implementation. It is highly flexible and configurable, built into a standard high performance database makes it easy to integrate into most systems.

    I'm still amused by how many people use the nightmare that is MySQL when PostgreSQL is more scalable, quicker, powerful and featured. Yet just as easy to setup and use. Packed with features suitable for enterprises, for example the Slony replication system.

    PostgreSQL for me was a no-brainer.

  4. drag
    Stop

    STOP THE POSTGRE MADNESS. WE KNOW. WE UNDERSTAND.

    "I'm still amused by how many people use the nightmare that is MySQL when PostgreSQL is more scalable, quicker, powerful and featured. "

    Well that would be amusing indeed. Except, of course, PostgreSQL is not actually more scalable, nor is it quicker, nor does it have nearly the same amount of features and flexibility (aka power) that MySQL does.

    What Postgresql fans seem to like most about Postgresql is that it's simply not MySQL.

    This is a point that has become VERY clear every single time that MySQL comes up in any article anywere on the entire fucking internet. I could write a entire 400 page article about the history of databases and computing and put a single line mentioning MySQL anywere and, with absolute reliability, you'll have at least one PostgreSQL fanboy stand up and say how much he loves Postgresql becuase it is not MySQL and he does not understand why MySQL is so popular despite the fact that it's MySQL and MySQL sucks.

    The reality of hte situation is that there are lot of things that are not MySQL (Dirt, humans, Oracle, IBM, video cards, desks, etc etc) and while Postgresql is one of them it's not a really good reason to actually use Postgresql.

    While there are actually very good reasons to use MySQL over Postgresql and people actually do know databases very well actually do end up choosing MySQL for really valid reasons. Despite the fact that a average postgre fanboi doesn't see this or understand why...

  5. bell

    Postgres

    I too am amazed the people continue to torture themselves with MySqueal while ignoring the pachyderm in the Open Source Database room.

    As much of a fan as I am of Postgres I do think that a dedicated free text search engine has its place. I recently implemented Sphinx search for a content management project I was working on and found that it provided some query time sorting tweaks that Postgres just didn't quite. If you have someone sane setting the requirements for result ranking you probably wouldn't need the features, but Sphinx was the one thing that beat Postgres's free text performance.

    This is in no way a slur on Postgres, which does the job of managing relational databases briliiantly. I'm not even sure that free text search should be regarded as being part of that job.

  6. Dean Higginbotham
    Thumb Up

    Index servers

    Great article. Sounds like you're pulling a lot from your Pressflip experience.

    I can emphasize with all the developers trying to figure out enterprise search. For our site, I went through nearly a year, and many index engines to finally get it right. There are some clever things programmers have to do do beyond the normal Lucene and engine tweaks. By thinking out of the box for their particular application (wrappers, pre-processing, etc), results can be much more useful than the normal search engine scoring scheme.

  7. Anonymous Coward
    Anonymous Coward

    At least...

    ...he didn't go with a trendy naming technique ala grokster, tumblr, ... otherwise we'd have Sphinxter, and that just wouldn't be any good at all.

  8. Christopher Ahrens
    Go

    @STOP THE POSTGRE MADNESS. WE KNOW. WE UNDERSTAND.

    You just managed to sum up nearly all debates about why certain people love certain pieces of software. Especially true in the Windows vs. Linux vs Mac vs. evryone lese wars that have gone on for decades now....

  9. Mo

    There are plenty of alternatives

    In my experience, searching TPB for anything has given awful results without some serious massaging of the query.

    I've not looked at Sphinx itself, but Xapian, on the other hand, is absolutely fantastic (and is used by all manner of sites out there).

  10. Reg Sim
    Happy

    @ David Wiernicki

    for making me laugh. Other wise all this is way over my head, I am sad to say.

  11. Anonymous Coward
    Anonymous Coward

    No title

    I switched my website's search from MySQL fulltext to Sphinx a while ago. Search times plummeted by at least an order of magnitude, results got not just usable, but really good. Overall load on my server went right down. During the switch, I increased the number of fields being indexed as well, meaning Sphinx is searching at least three times as much text as MySQL was before.

    Indexing a GB of text out of a MySQL database on a live, busy webserver (single machine, four cores) takes about ten minutes. My Sphinx has been told to be quite nice with server resources too, to keep response times down for humans browsing the site while the indexer runs, so I suspect it might be able to index a bit faster if it was allowed to.

    Installing Sphinx was as simple as a quick apt-get and then a brief bash at a config file, unlike getting Lucene or Solr going, which has in the past taken several days of swearing. That could just be me though.

    Sphinx does more than just search though. It has a nice "Did you mean...." spellchecky feature, related searches suggestions, and it has an easy result filtering system - which is extremely useful for the kind of searching my site needs. Plus it's fast. Very fast.

    In summary, I wholeheartedly endorse this product or service.

    ps. the InnoDB engine in MySQL provides much more sane locking (can I get a row-lock? Yes I can!)

  12. Anonymous Coward
    Thumb Up

    @ David Wiernicki

    Hah! Thanks :)

  13. Anonymous Coward
    Anonymous Coward

    @ drag the MySQL fanbois

    that's about it really.

  14. Anonymous Coward
    Anonymous Coward

    @drag

    No, MySQL does not outperform PostgreSQL - except for queries with no joins on tables using the MyISAM table type (which has no referential integrity). MySQL performance is far worse than PostgreSQL if you use joins on MyISAM tables or use InnoDB tables (the ones with referential integrity). MySQL conformance to the SQL standards is poor - far poorer than any other general purpose RDBMS I'm aware of. It has a history of corrupting data, such as when configuration changes are made, despite the manual saying such changes are safe (I was bitten by this when MySQL silently dropped all indexes and foreign key constraints after an InnoDB configuration tweak to a live system - it had worked fine on the test system). Some data corruption is "intentional", as MySQL allows you to insert invalid values which are outside the domain of the data type. It also massages values to what it thinks you intended, not something an RDBMS should be doing. As for features, PostgreSQL is as fully featured as most commercial RDBMS and certainly more fully featured than MySQL. Even features that MySQL has on paper than PostgreSQL also has are often of limited use in MySQL (such as full text searching - I recall having to recompile to use a different stopword list than the default, and MySQL would only support one language at a time).

  15. Francis Fish
    Happy

    Can I be the first to say it also works really well with Rails?

    Just install the Thinking Sphinx gem and off y'go.

    We've used it to implement combined searches across many heterogenous tables (as in type in a reference to a company, client or claim and find all of the places it's referenced for example)

    It Just Works

    The MySql/Postgres debate - err ... guys ... relational databases are just a commodity now. Just somewhere to stuff things you want to be stored permanently. Soon as you start using the extensions you've locked yourself in, which is why using another engine for free text makes sense. You can swap it out if it starts not doing what you want without having a ton of embedded code.

    This is why I'd never use the Oracle Enterprise search thing, but that's a discussion for another day...

  16. Anonymous Coward
    Anonymous Coward

    Re: Can I be the first to say it also works really well with Rails?

    "The MySql/Postgres debate - err ... guys ... relational databases are just a commodity now. Just somewhere to stuff things you want to be stored permanently."

    It's all one big hashtable in the sky, is it? I wouldn't expect anything more insightful from a Rails advocate.

    Meanwhile, for a competitive solution offering a decent level of support for the SQL standards, actively yet cautiously developed, the smart money has always been on PostgreSQL amongst the Free Software offerings. MySQL has always been catching up, trying to remedy the outspoken (and frequently laughable) views of its founding developers.

    Confidence is a significant factor when choosing something to manage one's data. According to those whose opinions are worth listening to - in other words, not those who use MySQL as some kind of soil pipe for their Web site logs - MySQL has never really been able to match PostgreSQL on that front.

  17. drag
    Heart

    @Big Chris

    Way to miss the point here.

    The point is that I know you fanboys love your database and every time you see the term 'MySQL' printed anywere... in comments or in a article, even in a article (like this one) that talks about how you DO NOT WANT TO USE MYSQL... you people show up and talk about how much postgre is. It's like a compulsion, a disease, and it has to stop.

    Nobody cares anymore.

    Beyond that... I HATE ALL SQL DATABASES.

    Why? Because they are the most abused thing imaginable.

    First off... The vast majority of websites that use SQL database for storage.. well.. they don't need it. In fact they would be much faster and be better designed if they didn't use SQL at all and used flat files and file-system backed storage rather then shoving everything into some binary blob. They would be simplier, easier to back up, easier to setup, permissions handled easier, less prone to corruption, and quite a bit faster.

    That covers about 90% of the users that go on and on about Postgre vs MySQL. They are wrong and stupid for using a SQL database in the first place. It's just stupid.

    There are SQL databases that store my URL history... I have video games that store scores in databases... there are music playing applications that store music databases in SQL format. If I try to install a stupid web application for doing some simple-stupid automation on a remote machine it tries to pull in MySQL or Postgresql or some other shitastic thing.

    Why? Because web developers don't know how to do jack shit other then using SQLfor storage. They think it scales, they think having websites running from multiuser relational databases make sense. They are under some delusion that SQL databases are fast and secure.. they won't think out of the box and never learned how to do anything else. Then they stop being web designers and try to be application designers and well... when all you know how to use a hammer then every problem is a nail waiting to get hammered into submission. Round hole, Square Peg? Goddammit.. just smash it hard enough with SQLite and it'll fit perfectly!!! The repair manual says to tighten to 15 foot pounds? Well.. dammit Postgre is ACID COMPLIANT! Smash that bolt down with that and your garrenteed success. If you hit it hard enough then the leaks are going to have to stop, it's physics!

    Sure, sure for a lot of things SQL makes sense, but usually it doesn't.

  18. Goat Jam
    Thumb Up

    DBMS fanboy war

    I for one just hate all Databases, they are 'orrible, finicky things .

    I can't imagine why anyone would like ANY DBMS to the point of enjoining an INTERNET FLAMEWAR over them!

    It's like a bunch inmates at the Guantanamo Hilton having a fist fight regarding the relative merits of waterboarding vs extreme sleep deprivation.

    Keep up the good work chaps <reaches for beer and crisps>

  19. Anonymous Coward
    Anonymous Coward

    hmmm 2 useless artivles in a row...

    Who would of thunk it.

  20. Ian Michael Gumby
    Flame

    @Francis Fish

    Relational Databases are not a commodity. At least to those that know about relational databases.

    Its like saying that cars are a commodity. Just try and say that a Porche 911 and a Ford Mustang are equivalent vehicles.

    The real problem is that a lot of the tips and tricks that are in the commercial products are not done well in the 'open source' community. Sorry, but query optimization developers isn't a commodity skill.

    If the author did his homework, he would have found that IBM's IDS has a VTI (Virtual Table Interface) that also has a way of creating advanced indexing. So you can write you own indexing scheme.

    The point is that what you praise in the OpenSource world has been around in the commerical world. Too bad certain web 2.0 companies would rather be a billionaire on paper than actually develop the rock solid infrastructure needed to efficiently support high volumes of users.

  21. BlueGreen

    @drag, @Ian Michael Gumby

    I find the people with the loudest voices and strongest opinions generally have least knowledge. Learn how to use sql properly, in anger, with a grownup db, understand what goes on under the hood and on your next project involving storage you might save a lot of time and bugs.

    @Ian Michael Gumby: "query optimization developers isn't a commodity skill". Well said.

This topic is closed for new posts.

Other stories you might like