Wednesday, March 10, 2010

Indexing substrings with PostgreSQL

So, a couple years ago one of my e-commerce clients was complaining that the search function on their web site was crap. At the time, all the search was doing was hitting the database with a LIKE constraint. In terms of performance and quality of results, it was awful. So I did some research and decided to pursue the Full Text Search option (It comes built-in with Pg, btw).

NOTE: It has come to my attention that there is a package called pg_trgm which is built-in with Pg. It has a very similar functionality to what is described below, but differs slightly. The functionality of trgm is based on string similarity and a certain threshold when performing a match query. If the similarity is below a certain threshold, the row will not be returned. For example, if I have I want to search for "ABC" and I would like a product whose part number is "ABC1234567891011" to be returned, a query using pgtrgm will not return it; because the threshold is based on a % of character matches in the two strings. In this case, the similarity % would be 0.1667 or 16.67% match. The default threshold is 0.3. I realize that I could just lower the threshold, but that would not only make the result set larger and quality poorer, but it wouldn't solve the problem. There would always be matches being missed.

The results were not only 1000x better, the speed was incredible. The only drawback was that I lost the capability to do partial string searches. To be clear, for my client's needs, it really only applied to being able to search for a partial manufacturer's part number. So if the product that I was looking for, had a part number of "ABC123", they wanted it to come up when searching for keywords like "ABC12", "BC12", "123", etc. Full Text Search doesn't really support this. So, I had to think up a way to do this on my own. I could have just gone back to using LIKE on the part number column, but that would have brought the performance back down to what it used to be. So, how could I index a partial string?

First, I'll start from Step 1 and we'll get to that.

Let's say that our table in question is called Products. Let's use the following definitions.


CREATE TABLE Products
(
id serial NOT NULL,
mfgno character varying(32) NOT NULL,
man_id smallint NOT NULL,
upc character varying(32),
"name" character varying(500),
keywords text,
categoryid character varying(25),
ts_mfgno_name_keywords tsvector,
img character varying(255),
CONSTRAINT Products_pkey PRIMARY KEY (id),
CONSTRAINT Products_mfgno_man_id_key UNIQUE (mfgno_stripped, man_id)
)
WITH (
OIDS=FALSE
);


CREATE INDEX Products_ts_mfgno_name_keywords
ON Products
USING gin
(ts_mfgno_name_keywords);


CREATE TRIGGER ts_mfgno_name_keywords_update
BEFORE INSERT OR UPDATE
ON Products
FOR EACH ROW
EXECUTE PROCEDURE tsvector_universe_trigger();


CREATE OR REPLACE FUNCTION tsvector_universe_trigger()
RETURNS trigger AS
$BODY$
DECLARE rec RECORD;
BEGIN
NEW.ts_mfgno_name_keywords := setweight(to_tsvector('pg_catalog.english', COALESCE(NEW.name,'')), 'A');

NEW.ts_mfgno_name_keywords := NEW.ts_mfgno_name_keywords ||
(SELECT setweight(to_tsvector('pg_catalog.english', COALESCE(name,'')), 'A') FROM man WHERE id = NEW.man_id);

NEW.ts_mfgno_name_keywords := NEW.ts_mfgno_name_keywords || (SELECT setweight(to_tsvector('pg_catalog.english', COALESCE(NEW.keywords,'')), 'C'));

RETURN NEW;
END
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;


So, in a nutshell, column "ts_mfgno_name_keywords" is of type tsvector. This holds all the information needed for the FTS to work. How do we populate it? Well, we create a trigger so that when a record gets inserted (or when it's updated) to table Products, we can build it. As you can see, the trigger function throws a bunch of stuff into the keyword "bucket". You don't want to throw too much stuff in, because that will hurt the quality of search results. I decided to add the product name, manufacturer name and a keyword column which contains other relevant data. You'll also notice that I added weights to each piece. If you think about it, the product name and it's manufacturer are more product specific than any other column, so I gave it a weight of 'A' (the highest). This dramatically helps the quality of results. If a keyword is in the product name of one product, and only the keywords column of another, the two would basically rank the same -- when that's obviously not the case. Now that we have what we need, to speed things up, we put an index on column "ts_mfgno_name_keywords".

So, now that we have everything setup, we can query the table in the following way:


SELECT * FROM Products WHERE ts_mfgno_name_keywords @@ to_tsquery('printers');


If you have more than one keyword, you simply separate them with a '&', like:


SELECT * FROM Products WHERE ts_mfgno_name_keywords @@ to_tsquery('inkjet&printers');


Nnow we can get back to the issue at hand -- handling partial strings. After about 57 cups of coffee, I finally found a solution. How about, I create an index of *every* substring on our "mfgno" column that is of 3 characters in length! So, if the value was, "ABC123", I would want "ABC", "BC1", "C12", and "123". This may not seem obvious now, but this is important.

First, I had to write a pg function that would take a string and return all the substring of length 3:


CREATE OR REPLACE FUNCTION text_to_trigrams(str character varying)
RETURNS character varying AS
$BODY$
DECLARE tg varchar[];
DECLARE i smallint;
BEGIN
FOR i IN 1..(char_length(str) - 2) LOOP
tg[i] := substring(str,i,3);
END LOOP;
RETURN array_to_string(tg,' ');
END;
$BODY$
LANGUAGE 'plpgsql' IMMUTABLE
COST 100;


This returns a string of 3 character substrings, separated by a space. Using our example, the returned string would be : "ABC BC1 C12 123".

Now, I can create a ts_vector based on that value:


CREATE INDEX Products_ts_substring_mfgno
ON Products
USING gin
(to_tsvector('simple'::regconfig, text_to_trigrams(mfgno)::text));


Now, to use this, I take the search term and I split it up into 3 character substrings (similar to the function we just created) So, if I wanted to search for 'ABC1', I would do the following query:


SELECT * 
FROM Products
WHERE to_tsvector('simple'::regconfig, text_to_trigrams(mfgno)::text) @@ to_tsquery('simple','ABC&BC1')


This would return the product who's part number is ABC123 (and be indexed!).

The important part to note, is that I refer to the 'simple' dictionary that PG has. This will sub-stain from converting the keyword to an english word.
So, in conclusion, we now have a way to perform indexed substring searches in PostgreSQL without needing to use LIKE or a regular expression!

3 comments:

  1. It gets even better: Trigram-indexing is available as a contrib-module: http://www.postgresql.org/docs/current/static/pgtrgm.html

    ReplyDelete
  2. Hey Alex,

    This actually looks pretty cool, but it differs a little from mine. This will return similar searches, unlike mine, which will return more exact matches. To be honest, for my needs, my implementation is more applicable.

    ReplyDelete
  3. This concept is really nice with a nice program supporting it.Thanks for sharing this useful information.

    ReplyDelete