← Pizza & Beer:Learning with Atlassian A Better Analogy for Agile Software... →

SphinxSearch | Full Text Searching

11/21/2013 By Stephen Dunn

This article explains how to use SphinxSE, Sphinx real time indexing, and set up Sphinx in order to improve search query performance.   First some background about issues surrounding full-text search.

Why does someone need a full text search engine? A MySql database with over a million records will take a very large hit when the following statement is run

`Select * FROM table WHERE title LIKE ‘%someneedle%’;`

For more details on fulltext search performance issues visit: http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html SphinxSeach is here to rescue us from this kind of performance issue and make our searches perform lighting fast. Sphinx is a full text search engine that integrates easily with MySql.

Let’s get started by removing MySql from the box and upgrade it MariaDB. MariaDB is a better database and a drop-in replacement for MySql. There will not be any need to rewrite code or anything since MariaDB is a drop in replacement for MySql.  MariaDB is faster than mysql ( see here: http://slashdot.org/topic/bi/mariadb-vs-mysql-a-comparison/), and has several modules that come with it.  One of  these modules is SphinxSE.   This article will not cover how to upgrade to MariaDB, the details can be found here: https://downloads.mariadb.org/mariadb/repositories/. Setting up a MariaDB is not required but it does speed up mysql and saves time installing SphinxSE.

Great, so we have MariaDB install and now we can go ahead and install SphinxSearch. There is a Debian Wheezy/Squeeze/Lenny package for it so it’s as simple as, apt-get install sphinxsearch. If you want to compile the latest version from source it is also very simple to do. However, Wheezy currently has Sphinx 2.0.4 and at the time of the article it is 2.1.2. I was unable to see any huge differences between the releases.

Our server is now set up with SphinxSearch, SphinxSE and MariaDB. We are now ready to start adding to the SphinxSearch. You can do this a number of ways. Some of the more common ways are to use indexer. This will select from your MariaDB and build up the required SphinxSearch files to perform searches on. You can also specify deltas to perform even faster indexing and searching. However, this article is not going to talk about that since a new indexer is going to need to be performed on each database change! That is great but typically a database is going to grow every day by several thousand rows, so we are going to want Real Time Indexing.

Real Time indexing will index the new row instantly without running indexer. This is a great time saver since running indexer on a mysql table with several million rows can take quite a bit of time. To set up Real Time indexing with full wildcard searching we are going to open up /etc/sphinxsearch/sphinx.conf file. Inside of the sphinx.conf file you can simply go down to where it says index rt and then that is the only block we are going to need to worry about. In this block you’re going to want to specify:

type = rt
path = /var/lib/sphinxsearch/data/rtindexfile
rt_fields = title
enable_star = 1
dict = keywords
min_infix_len = 3
infix_fields = title

That is going to set up RT for full wildcards so you can search for items with *title* in it. The Path is where the rtindexfile. will live. This will get very big depending on your database and min_infix_len. min_linfix_len is how deep you want to go with your keyword. So if you have min_infix_len set to 3 you will not be able to search less than three so *ti* will not work proper to return ‘titlesomename’. However, setting it to three will be fine for a search needle of *som* to match titlesomename. Setting the min_infix_len low does have a direct impact on the size of your rtindexfile so might want to keep that in mind. Setting infix and not prefix allows us to perform both wildcards *title* and not just title* or *title. enable_star turns on the star feature for the infixed keywords. There are some other items that I set like rt_mem_limit and charset_table but both are dependent upon your box and datasets So you can read about them here: http://sphinxsearch.com/docs/2.1.2/confgroup-source.html. In the RT block you can also set up more fields for the RT index to add but this example just has one field ‘title’ another one you might use is gid.

Go ahead and restart your ‘searchd’ /etc/init.d/searchd restart to update the ‘searchd’ with your new config file. You can now begin to start adding in data to the RT index using mysql -h0 -p 9306 this is the SphinxSearchinterface. If you open up a connection to port 9306 ( default port ) you might see other tables other than the RT you can remove that by editing your sphinx.conf file described above. What I typically do is look at my MariaDB and select all the fields that I want into my RT into a CSV file and then import that CSV file into mysql -h0 -p 9306 into the RT table. This will make your rtindexfile start to grow as the RT table fills up. Now let’s say you filled up your rtindexfile with your current dataset and want to make a change to the index RT block in sphinx.conf. If you want to do this go ahead but remember to erase your rtindexfile because you’re going to have to remake it.

So we now have a populated RT table but don’t know how to keep it in sync with the real database. Well this can be done a few different ways but I will go ahead and tackle it with what I feel is the best way ( for me ). Ultimately it depends on how your current code is setup. How I do this is by setting up another table in my database called to_add_for_sphinx and set up a trigger on insert for my existing table so that on each new insert there is also an insert to the to_add_for_sphinx table. Then I have a crontab task that updates my mysql -h0 -p 9306 with the contents of to_add_for_sphinx and erases the inserted row from to_add_for_sphinx. Another approach might just be to edit your code that is making all of the sql inserts to also connect to mysql -h0 -p 9306 and make an insert into your RT table the same time that you make an insert to the existing MariaDB table.

We have come a long away and now have an insync existing table to the RT indexer table. Well that’s great, but let’s use it to perform the full text search. In MariaDB you’re going to create a new table and have it connect to your ‘searchd’. You must type in the exact table ‘create’ statement with at least the same number of fields as the one provided or it will simply not work. Depending on what fields you added to your RT index you can provide more but no less.

query    VARCHAR(3072) NOT NULL,
group_id  INTEGER,
) ENGINE=SPHINX CONNECTION=”sphinx://localhost:9312/rt”;

When you create the engine you can choose to not specify /rt at the end. Then it will open up all of your indexes to SphinxSE. Great, so now let’s provide a search query to it:

SELECT * FROM t1 JOIN existingtable ON exisitingtable.id = t1.id WHERE query=’*title*;mode=extended;’

The above will select using full text *title* and join it to the existing table so that you can get all of the other existing data back so that you don’t need to overfly populate your RT index with fields that don’t need to be indexed. SphinxSE lets you do several nice searching features and has more modes than simple extended. For a complete list please visit: http://sphinxsearch.com/docs/2.1.2/searching.html

This article provides you the knowledge on how to set up SphinxSearch using RT indexing so that your database does not take over a minute to do a Full Time Search when there are over 1 million rows.

← Pizza & Beer:Learning with Atlassian A Better Analogy for Agile Software... →
Source Allies Logo © Source Allies, Inc.