Jun 28 2014

Comparing String Values for Similarities

Category: C# | SQL Server � Administrator @ 12:41

Many moons ago I embarked on a proof-of-concept project to see if I could use SQL Server to perform a matching process.

It was successful but I still had a lingering suspicion that the underlying algorithm (Levenshtein) used to determine the sameness between human full names was less than optimal.  So what ensued was a period where I would look around the web, once in a while, to see if I missed something. 

I can't tell you how many algorithms I researched, converted to C# and tested.  Don't ask me why, but anyone who starts with Levenshtein will most likely never find this algorithm since so many others like Levenshtein vie for attention.

I finally found what I was looking for purely by accident:

Strike-A-Match: http://www.catalysoft.com/articles/StrikeAMatch.html

and kudos to the paste bin for the C# version: http://pastebin.com/EfcmR3Xx#

So now if you ever need to compare human full names whether they be in any order like:

  • last name, first name compared to first name, last name

Strike-A-Match will do the trick. Take a look at this comparison of a human name: "Jimi Hendrix" to "Hendrix Jimi"

Using Strike-A-Match will compute that these two are exactly equivalent.

Enough said.

I embedded this into A SQLCLR function and it works like a charm.

Levenshtein and all your brethren really don't get the job done when all you really want to do is compare for similarity.

The web has the brightest ideas but try to look in all the dark corners.

Tags: , ,

May 20 2012

SQL Server 2012 -- Full-Text Search (Matching Engine)

Category: � Administrator @ 06:48

Having been immersed in a Full-Text Search (FTS) proof-of-concept over the past two months I thought others would benefit from this experience.  First, let me start by saying that the way this came about was rather unusual.  A co-worker of mine had started to play with FTS and turned around one afternoon and asked me for some T-SQL help.  When I saw the language constructs of FTS that he was using I had what I call a h$!y s^!t moment.  The last time that I looked at using anything close was with SoundEx which was a mess for my purposes.  I quickly saw how I could make great use of this technology.  And I'm only scrapping the surface in what I'm doing but check out the video link below for more on what FTS is supposed to (documents) be used for:

http://channel9.msdn.com/Events/TechDays/Techdays-2012-the-Netherlands/2297

 

In my day job we do lots and lots of matching.  By that I mean we get in files with a person's name and a name of a piece of copyrighted material.  So internally we have a database of person's names and their related pieces of copyrighted material.  What we need to do is take the incoming data and find the match in our database.  Sounds easy right.  Well not with our current technology.

My thought was to take FTS and use it on the database of data we currently have and then take the incoming data and try to find the closest match.

If you think you should deploy this on SQL 2008R2 then you'll not benefit from the changes in 2012 which increased performance dramatically.  And performance is the key to this whole matching process.

In my matching tables, I have over 44 million rows to search, bring back results and score the results.  I was able to attain just over four attempted matches per second.  Not bad considering that I'm running the Developer edition on my local workstation with just two crappy SATA drives.  And don't worry about space consumption since for my test case of 44 million rows, I used just under one gig for the FT index after the population of the FT index completed.

Let start:

First you'll need to install FTS which is no big deal and I'll leave that to you.

 

Now the important steps:

  • Create a set of FileGroups (one for each table you'll be using to search/indexing.)  The best case scenario is to have many disks and split the table's clustered index on the searched tables away from the FT index.
-- Add FILEGROUP(s)
ALTER DATABASE Dmatch ADD FILEGROUP fgDmatchX;
GO
ALTER DATABASE Dmatch ADD FILEGROUP fgDmatchY;
GO

ALTER DATABASE Dmatch 
ADD FILE 
( NAME = DmatchdatX,
  FILENAME = 'X:\MSSQL\data\DmatchdatX.ndf',
  SIZE = 20000MB,
  MAXSIZE = 28000MB,
  FILEGROWTH = 1000MB)
TO FILEGROUP fgDmatchX

go

ALTER DATABASE Dmatch 
ADD FILE 
( NAME = DmatchdatY,
  FILENAME = 'Y:\MSSQL\data\DmatchdatY.ndf',
  SIZE = 20000MB,
  MAXSIZE = 28000MB,
  FILEGROWTH = 1000MB)
TO FILEGROUP fgDmatchY

go

--ALTER DATABASE Dmatch 
--REMOVE FILE DmatchdatX

--ALTER DATABASE Dmatch 
--REMOVE FILE DmatchdatY

 

  • Create a FTS Catalog:
USE Dmatch;
GO

--DROP FULLTEXT CATALOG DmatchWrkPtyCatalog
CREATE FULLTEXT CATALOG DmatchWrkPtyWrtCatalog WITH ACCENT_SENSITIVITY = OFF AS DEFAULT ;
GO
USE Dmatch;
GO

--DROP FULLTEXT CATALOG DmatchWrkPtyCatalog
CREATE FULLTEXT CATALOG DmatchWrkPtyPubCatalog WITH ACCENT_SENSITIVITY = OFF AS DEFAULT ;
GO

 

  • Create a StopList -- for this example I'm not going into what entirely I added and took away from the StopList but I'll include the SQL to show how:

 

--drop fulltext stoplist Dmatch1StopList;
GO
Create FullText StopList Dmatch1StopList from System Stoplist;
GO

  • Modify the StopList to suit your needs:
alter fulltext stoplist Dmatch1StopList  drop 'about' language 1033;
alter fulltext stoplist Dmatch1StopList  add 'company' language 1033;
 
  • Create a Full-Text Index

 

--DROP FULLTEXT INDEX ON tblWrkPtyWriterSearch

--Create FTS index on work and Pty name

CREATE FULLTEXT INDEX ON tblWrkPtyWriterSearch(WrkNa,PtyNa) 
   KEY INDEX ix1WrkPtySearch on ([DmatchWrkPtyWrtCatalog], FILEGROUP [ fgDmatchY])
   WITH STOPLIST = Dmatch1StopList;


--DROP FULLTEXT INDEX ON tblWrkPtyPublisherSearch

--Create FTS index on work and Pty name

CREATE FULLTEXT INDEX ON tblWrkPtyPublisherSearch(WrkNa,PtyNa) 
   KEY INDEX ix1WrkPtySearch on ([DmatchWrkPtyPubCatalog], FILEGROUP [fgDmatchX])
   WITH STOPLIST = Dmatch1StopList;
   
GO

 

 
  •  Check the status of the Population of the FT index
-- Number of full-text indexed items currently in the full-text catalog 
-- plus the population status and size:

SELECT 'DmatchWrkPtyWrtCatalog'
,FULLTEXTCATALOGPROPERTY('DmatchWrkPtyWrtCatalog', 'PopulateStatus') AS [Populate Status]
,FULLTEXTCATALOGPROPERTY('DmatchWrkPtyWrtCatalog', 'ItemCount')AS [Item Count]
, FULLTEXTCATALOGPROPERTY('DmatchWrkPtyWrtCatalog', 'IndexSize')AS [Size in MB];

SELECT 'DmatchWrkPtyPubCatalog'
,FULLTEXTCATALOGPROPERTY('DmatchWrkPtyPubCatalog', 'PopulateStatus') AS [Populate Status]
,FULLTEXTCATALOGPROPERTY('DmatchWrkPtyPubCatalog', 'ItemCount')AS [Item Count]
, FULLTEXTCATALOGPROPERTY('DmatchWrkPtyPubCatalog', 'IndexSize')AS [Size in MB];

 

  • Create procs and functions to return possible matches:

What I did here was to use the ContainsTable-SQL in conjunction with other functions to return a result of the top (x) matches.  I did the matching/scoring with help of the Levenshtein algorithm.

My overall match rate was just over 73% and of excellent quality, meaning that I didn't match to something that was not supposed to match.

I'll continue the next blog entry with the actual stored procedures and functions that really do the work.

Full-Text Search on SQL 2012 is a gift.

 

 

 

 

 

 

 

Tags: , , , , ,