Wednesday, March 10, 2010

Indexing substrings with PostgreSQL

So, a couple years ago one of my e-commerce clients was complaining that the search function on their web site was crap. At the time, all the search was doing was hitting the database with a LIKE constraint. In terms of performance and quality of results, it was awful. So I did some research and decided to pursue the Full Text Search option (It comes built-in with Pg, btw).

NOTE: It has come to my attention that there is a package called pg_trgm which is built-in with Pg. It has a very similar functionality to what is described below, but differs slightly. The functionality of trgm is based on string similarity and a certain threshold when performing a match query. If the similarity is below a certain threshold, the row will not be returned. For example, if I have I want to search for "ABC" and I would like a product whose part number is "ABC1234567891011" to be returned, a query using pgtrgm will not return it; because the threshold is based on a % of character matches in the two strings. In this case, the similarity % would be 0.1667 or 16.67% match. The default threshold is 0.3. I realize that I could just lower the threshold, but that would not only make the result set larger and quality poorer, but it wouldn't solve the problem. There would always be matches being missed.

The results were not only 1000x better, the speed was incredible. The only drawback was that I lost the capability to do partial string searches. To be clear, for my client's needs, it really only applied to being able to search for a partial manufacturer's part number. So if the product that I was looking for, had a part number of "ABC123", they wanted it to come up when searching for keywords like "ABC12", "BC12", "123", etc. Full Text Search doesn't really support this. So, I had to think up a way to do this on my own. I could have just gone back to using LIKE on the part number column, but that would have brought the performance back down to what it used to be. So, how could I index a partial string?

First, I'll start from Step 1 and we'll get to that.

Let's say that our table in question is called Products. Let's use the following definitions.


CREATE TABLE Products
(
id serial NOT NULL,
mfgno character varying(32) NOT NULL,
man_id smallint NOT NULL,
upc character varying(32),
"name" character varying(500),
keywords text,
categoryid character varying(25),
ts_mfgno_name_keywords tsvector,
img character varying(255),
CONSTRAINT Products_pkey PRIMARY KEY (id),
CONSTRAINT Products_mfgno_man_id_key UNIQUE (mfgno_stripped, man_id)
)
WITH (
OIDS=FALSE
);


CREATE INDEX Products_ts_mfgno_name_keywords
ON Products
USING gin
(ts_mfgno_name_keywords);


CREATE TRIGGER ts_mfgno_name_keywords_update
BEFORE INSERT OR UPDATE
ON Products
FOR EACH ROW
EXECUTE PROCEDURE tsvector_universe_trigger();


CREATE OR REPLACE FUNCTION tsvector_universe_trigger()
RETURNS trigger AS
$BODY$
DECLARE rec RECORD;
BEGIN
NEW.ts_mfgno_name_keywords := setweight(to_tsvector('pg_catalog.english', COALESCE(NEW.name,'')), 'A');

NEW.ts_mfgno_name_keywords := NEW.ts_mfgno_name_keywords ||
(SELECT setweight(to_tsvector('pg_catalog.english', COALESCE(name,'')), 'A') FROM man WHERE id = NEW.man_id);

NEW.ts_mfgno_name_keywords := NEW.ts_mfgno_name_keywords || (SELECT setweight(to_tsvector('pg_catalog.english', COALESCE(NEW.keywords,'')), 'C'));

RETURN NEW;
END
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;


So, in a nutshell, column "ts_mfgno_name_keywords" is of type tsvector. This holds all the information needed for the FTS to work. How do we populate it? Well, we create a trigger so that when a record gets inserted (or when it's updated) to table Products, we can build it. As you can see, the trigger function throws a bunch of stuff into the keyword "bucket". You don't want to throw too much stuff in, because that will hurt the quality of search results. I decided to add the product name, manufacturer name and a keyword column which contains other relevant data. You'll also notice that I added weights to each piece. If you think about it, the product name and it's manufacturer are more product specific than any other column, so I gave it a weight of 'A' (the highest). This dramatically helps the quality of results. If a keyword is in the product name of one product, and only the keywords column of another, the two would basically rank the same -- when that's obviously not the case. Now that we have what we need, to speed things up, we put an index on column "ts_mfgno_name_keywords".

So, now that we have everything setup, we can query the table in the following way:


SELECT * FROM Products WHERE ts_mfgno_name_keywords @@ to_tsquery('printers');


If you have more than one keyword, you simply separate them with a '&', like:


SELECT * FROM Products WHERE ts_mfgno_name_keywords @@ to_tsquery('inkjet&printers');


Nnow we can get back to the issue at hand -- handling partial strings. After about 57 cups of coffee, I finally found a solution. How about, I create an index of *every* substring on our "mfgno" column that is of 3 characters in length! So, if the value was, "ABC123", I would want "ABC", "BC1", "C12", and "123". This may not seem obvious now, but this is important.

First, I had to write a pg function that would take a string and return all the substring of length 3:


CREATE OR REPLACE FUNCTION text_to_trigrams(str character varying)
RETURNS character varying AS
$BODY$
DECLARE tg varchar[];
DECLARE i smallint;
BEGIN
FOR i IN 1..(char_length(str) - 2) LOOP
tg[i] := substring(str,i,3);
END LOOP;
RETURN array_to_string(tg,' ');
END;
$BODY$
LANGUAGE 'plpgsql' IMMUTABLE
COST 100;


This returns a string of 3 character substrings, separated by a space. Using our example, the returned string would be : "ABC BC1 C12 123".

Now, I can create a ts_vector based on that value:


CREATE INDEX Products_ts_substring_mfgno
ON Products
USING gin
(to_tsvector('simple'::regconfig, text_to_trigrams(mfgno)::text));


Now, to use this, I take the search term and I split it up into 3 character substrings (similar to the function we just created) So, if I wanted to search for 'ABC1', I would do the following query:


SELECT * 
FROM Products
WHERE to_tsvector('simple'::regconfig, text_to_trigrams(mfgno)::text) @@ to_tsquery('simple','ABC&BC1')


This would return the product who's part number is ABC123 (and be indexed!).

The important part to note, is that I refer to the 'simple' dictionary that PG has. This will sub-stain from converting the keyword to an english word.
So, in conclusion, we now have a way to perform indexed substring searches in PostgreSQL without needing to use LIKE or a regular expression!

Unraveling SEO

In the e-commerce world, if SEO isn't everything to you, then you are more than a few steps behind. For most e-tailers working on a shoe-string margin -- cost is everything. Everything that you can do to increase profit margin can and must be done. An excellent way to do this is through the implementation of SEO tactics. I mean, how can you argue against this? Free traffic is free traffic.

I've had the opportunity to work with many SEO companies and to be honest, I am not very impressed. SEO has grown into this magical entity, where most people think it takes someone with the hands of God to produce. In reality, this is not the case at all.

SEO comes down to a small set of rules and techniques. Mainly, it's about good content. Put yourself in the shoes of Google. If you were making a search engine, you would want your users to get search results with the best content, right? Well of course. I don't want to have to wade through 1,000's of pages to find the one I'm really looking for. So, how does one get good content? Well, it really boils down to writing content rich descriptions, that aren't flooded with the keywords you're looking to target. Just having this content is important, but what will really set you apart from the rest is how and where you place this content. Let's first start out on defining the main elements of a web page.


1) Page Title.

This is what you see at the top of your browser. It's one of the most important pieces of your page. It should clearly state what your page is about. It shouldn't be too long, but one or two words won't cut it either.


2) Page URL.

This should be very similar to the page title. It will not only add to SEO value, but when a users see's it, they will have more confidence in clicking it because of it's description. Example: http://www.mydomain.com/options-trading.html. It's not to hard to figure out what this page is about without even clicking on it.


3) Meta keywords and descriptions.

This is basically a way to tell search engine crawlers a little more about your page. It should basically follow the lines of your page title, but go into a little more depth.


4)<h1>, <h2>, <h3>, etc.. tags.

You want to place these tags at the beginning of sections on your page. It will designate what's most important (in terms of content) on your page.


5) Paragraph (<p>) tags.

Sections that are surrounded by <p> tags are meant to go into depth on what your page is really about. Since you've already told the search engines what the general topic is about, with your page title, meta and h* tags; your p tags are for the nitty gritty. Your user is really here for these. So although,they aren't #1 on the list, they are definitely very important because without them, your page would just be a title. Let's not forget that creating a successful website is not just about making the search engines happy, but making the user happy as well. I mean that's why you're creating this to begin with.


6) Link (<a>) tags. Since web pages are broken down into multiple pages (preventing the need to have a page long enough to reach Pluto and back), we use links to allow the user to easily navigate to a particular section. An important factor to remember when creating links is to always remember to include the title attribute to further describe what the link is pointing to. Another good measure is to follow links with a more in-depth description of what it's about. For example, if you wanted to link to a page about options trading, you would want something like this:


<a title="Investing in Options" href="/options.html">Options Trading</a> Options trading can be a very lucrative venture. One must realize that in the stock market, there are normal stocks and there are options. Options allow one to have the "option" to purchase a stock at a certain date/time and strike price.

7) Link backs.

I was thinking about putting this higher in the list, but decided that 1-6 should be done much sooner than this one. Link backs should only be done once you have your website up and running. Once you get to this step, you'll find that it's very time consuming and tedious. Link backs are links from other sites that link to yours. It's like vouching for another web site. Just like in real life, someone vouching for someone else adds confidence. An important factor here is that the "importance" of the person vouching for you is most important. If I have a few CEO's from large companies saying that I'm a hard worker, wouldn't that be better than having 50 bumbs off of the street saying the same? So, getting other important web sites to link to your site will make you look a lot more appealing to search engines. The goal would be to find web sites with a high Google Page Rank and get a link on their site pointing to yours. I should point out, that if the link has a rel="nofollow" attribute in it, it will be useless in terms of SEO.
This process can be a little tricky though; there are a lot of companies that offer a way to "buy" link backs. I would not recommend going this route, as it can upset Google and you could end up getting a permanent ban from their search engine. Stick with the basics -- just get in contact with sites that you want links from, and just ask them. You'd be surprised how much just being nice can do for you. You may be asking, well how do I find these sites? Well a good starting point, would be to download a program like SEO Elite. This program shows you all the links that other sites (let's start with competitors) are getting. I would start out by contacting these sites and seeing if you can come to some arrangement to either trade links or provide some kind of service in return.
This should point you in the right direction. Once you start getting back links, you won't see much of an impact for several months (Google likes to take it's time when calculating Page Ranks mainly because of spam prevention purposes).

In conclusion, if you stick to what's natural, and just produce a quality web site that makes your users happy, you will most likely make the search engines happy. There's no magic involved here, so don't be fooled by companies pitching a "#1 on Google guarantee".

Good Luck!

C++ and Web Development

In my first few years as a young programmer, I thought the whole programming world revolved around C and C++. To best honest, they still are a very popular choice for a programming language. In terms of web development, it may not be, but I've been playing with the idea of using C++ to develop my next website. How can this be done you ask?

Well, there is a C++ class library that will definitely help you get started. GNU has release a library call cgicc. After checking it out, and creating a fairly bare-bones website, It definitely opened up. It offers functionality which will make dealing with HTTP headers and tags much easier, and speed up development time. The class offers functions to create any HTML tag you like. At first this sounded pretty cool, but after actually developing a pretty simple HTML page, I found out that it was pretty frustrating and tedious. Naturally, I'm (and I'm sure you are as well) used to developing HTML by just writing code in a flat file. So, I went searching, and found another C++ library that will let me do this. It's called ctemplate, and it's brought to you by Google Code. With this class, you can write all your HTML in a file and then make some simple function calls within your codebase to make use of it. It's like most templating systems, in that you can set variables, create sections (and conditionals), loops and all the like so that developing with typical structures is seamless and efficient.

Now to the good stuff...

Here is a sample file to get started:


main.cpp:

#include <iostream>
#include <string>
#include <stdio.h>
#include <stdlib.h>

#include <ctemplate/template.h>

using namespace std;

int main(int argc, char **argv) {
string content = "";

ctemplate::TemplateDictionary dict("example");
ctemplate::Template* tpl = ctemplate::Template::GetTemplate("main.tpl", ctemplate::DO_NOT_STRIP);

// If user is logged in
if(1 == 1) {
dict.ShowSection("USER");
dict.SetValue("USERNAME","John Doe");
}
else {
dict.ShowSection("NOUSER");
}

tpl->Expand(&content, &dict);

cout << content;

return 0;
}

Here is the corresponding template file (main.tpl):

<div id="content">

<div id="header">
{{#USER}}
<div id="userinfo">{{USERNAME}} | <a href="/logout.html">logout</a></div>
<ul>
<li><a href="/projects.ehtml">Projects</a></li>
</ul>
{{/USER}}
{{#NOUSER}}
<ul><li> </li></ul>
{{/NOUSER}}
</div>

</div>

Making the ShowSection call will show the HTML between {{#USER}} and {{/USER}}. Normal variables are just like {{USERNAME}} or {{VARIABLE}}.

Although this is a fairly trivial application, I think it shows that developing with C++ doesn't have to be brain surgery or pain-staking. With these tools one could easily develop a lightly dynamic website with ease. Now, if you want to get into database use, sessions, and all kinds of other stuff, then one has to put a little more elbow grease and crank out some classes. Although, I wouldn't be surprised if there were some already out there.