“Pattern Matching” in Data Quality Services –Domain Rules

Guess what guys?  I’m speaking at PASS Summit this year!  My session titled Data Quality Services—Finally!    was selected and I get to go to the premier SQL Server conference and present on a topic which I find truly exciting.  Data Quality Services (DQS) is new with SQL Server 2012.  Along with Master Data Services (MDS), it represents Microsoft’s entry into Master Data Management (MDM).

The ideal use of MDS and DQS represent to me a paradigm shift in the way I.T. folks and business folks interact.  These tools are not just another way for I.T. to develop solutions for “the business” based on requirements (whether those requirements are thoroughly documented via a watershed lifecycle or verbally shared with agile).  DQS and MDS were explicitly designed for continued and iterative use by business users with expert Knowledge about their data.  In MDM speak, these experts are called Data Stewards.  I recommend that anyone in Data continue getting familiar with these concepts.  Data Governance and Master Data Management are terms you are likely to see from now on.  For more information I recommend David Loshin (http://mdmbook.com/) .  His book,  Master Data Management is a comprehensive guide to this topic.  I’ve also found useful articles here , including articles on ways for I.T. to get “buy in “ from C-levels.

So, back to today’s post.  As I was prepping my session for PASS Summit on Data Quality Services, I was extremely fortunate to be in communication with Matthew Roche and Matt Masson of Microsoft, who are also presenting on DQS  (and MDS– actually all of EIM) at Summit.  These two are both uber “ersum”, always willing to help me, answering emails at all hours of day and night.

Let’s dive in:  Keeping data clean with DQS starts with creating a Knowledge Base.  This is done in the Data Quality Client.  Inside a Knowledge Base you create and maintain Domains.  Single domains are the equivalent to a column or attribute in a database.

The Data Quality Client.

Domain properties are configured using a combination of five tabs in the Data Quality Client:

  1. Domain Properties –Data Type and other basic properties
  2. Domain Values –area where possible values, correct, invalid and incorrect are stored, as well as corrected values.
  3. Reference Data—area to map the domain value to Reference Data Sources, like those supplied by vendors such as Melissa Data.
  4. Term-Based Relations—area to configure strings within a domain and a replacement value.  Where the Domain Value would replace the entire content of an attribute, a term based relation would replace a part of it.  So if I have “Ave.”  related to “Avenue” as a term based relation, and Apple Ave comes through, the TBR will turn it into Apple Avenue.
  5. Domain Rules—area to create validation rules for values in the domain.
The Domain Tabs

Domain rules are pretty exciting to me.  (I might need to get out more though).  If I want to make sure my Canadian friends receive my chain letters (Matt Masson LOVES snail mail chain letters), I could create a domain for Canadian Postal Codes:  For a domain with a string data type, you get a plethora of options for defining valid data:

But wait, there’s even more:

Most of these are self-explanatory.  But I wanted to know what type of wildcard symbols were used with the “pattern” choices.  The definition on MSDN had a definition for it, but not a complete one:

BOL definition is not quite complete.

So I emailed Matthew Roche with the question, and he sent me a complete (unofficial but trustworthy, no guarantee expressed or implied ) definition within a few hours from the DQS dev team.  Many thanks to Matthew and the team:

The rules of the patterns are as following:

 

  1. Any letter (not only A) is considered as a pattern for any letter
  2. Any digit (not only 1) is considered as a pattern for any digit
  3. Any special character (not a letter and not a digit) is considered as pattern for itself
  4. Square brackets define optional matching

 Examples:

 Pattern for the current standard of Washington vehicle license: AAA-1111

1.      The same pattern can be written as ABC-1234 and will present the same pattern
2.      “-“ sign represents itself, while letters and digits represent any letters and digits
3.      Matching samples: AIA-4831

4.    Not matching samples: AIA-2A34 (not a digit), AIA+3456 (wrong special character)
 Let us say the new license standard will allow 4 or 5 digits in the end. Here we can use square brackets to define optional part.

New pattern would be: AAA-1111[1]

Matching samples: AIA-4831 (without additional digit), AIA-34561 (with additional digit)

So back to my domain rule.  The format for Canadian postal codes is:  letter-number-letter space number-letter-number.

Using A for a letter and 1 for a number, I could define my rule as this:  A1A 1A1

But I could get silly and also define it like this, where H is the letter and 0 (zero) is the number:  H0H 0H0 (this is actually the postal code used for Santa Claus in Canada—pretty cute eh?)

With that, the following value would be valid: V5S 3Y2

While this value would be invalid:  VSS 3Y2  — “S” is a letter, not a number.

And running data through a cleansing project, both valid and invalid would go something like this:

Note that some of these are invalid as “Domain Value” because they were in the kb previously as invalid.  The ones designated as failed rule, were new values to the kb.

I hope this “sneak peek” of DQS has been of interest to you.  And I hope to see you all in Seattle!

Resources:

http://blogs.msdn.com/b/dqs/archive/2012/03/09/one-stop-resource-for-data-quality-services-dqs-in-sql-server-2012.aspx

Channel 9 videos of TechEd Elad Ziklik, Matt Masson, Matthew Roche

Published by

Julie Smith

One half of the Datachix.com. ETL person. This picture is the best of me ever taken. Don't be too sad when you meet me for real :)

5 thoughts on ““Pattern Matching” in Data Quality Services –Domain Rules”

  1. Hi Julie! This is Sarah Boswell. I’m a consultant, and I’m using DQS for my current project. I am trying to see if I can change a value based on a pattern. This is specifically for Social Security Numbers, where I need to change the values from 111111111 format to 111-11-1111 format. Do you know if that can be done with DQS?

    1. HI Sarah,

      In DQS, you could make a rule so that the values are invalid if they lack the dashes. It is not the easiest place to do the transformation however. In order to do the transforms in DQS, you would have to put every possible ssn in the dqs domain values.

Leave a Reply