Using Fuzzy Grouping in SQL Server Integration Services for removing duplicates, fixing spelling errors, and converting abbreviations to full job titles

I had a dataset with this list of occupations:

CUST SERVICE REP.
CUSTOMER SERVICE REPRESENTATIVE
SHIFT LEAD
REGION MANAGER
ASSISTANT
REGIONAL MANAGER
STORE MGR.
CUSTOMER SERVICE REP
CUSTOMER SERVICE REPRESENTATIVE
STORE MANAGER
DISTRICT MANAGER
ASSIST. MANAGER
CUTSOMER SERIVCE REPRESENTATIVE
SHIFT LEADER
REGIONAL MANAGER
STORE MANAGER
DISTRICT MGR.
REGIONAL MGR
SHIFT LEADER
CUSTOMER SERIVCE REP
ASSISTANT MANAGER
DISTRICT MANAGER

In the notation a -> b meaning a should be replaced with b, with the data I received, I wanted the system to do the following for me. I also wanted to return the list of distinct values after applying the replacement aforementioned:

CUST SERVICE REP.               -> CUSTOMER SERVICE REPRESENTATIVE
CUSTOMER SERVICE REPRESENTATIVE -> CUSTOMER SERVICE REPRESENTATIVE
SHIFT LEAD                      -> SHIFT LEADER
REGION MANAGER                  -> REGIONAL MANAGER
ASSISTANT                       -> ASSISTANT MANAGER
REGIONAL MANAGER                -> REGIONAL MANAGER
STORE MGR.                      -> STORE MANAGER
CUSTOMER SERVICE REP            -> CUSTOMER SERVICE REPRESENTATIVE
CUSTOMER SERVICE REPRESENTATIVE -> CUSTOMER SERVICE REPRESENTATIVE
STORE MANAGER                   -> REGIONAL MANAGER
DISTRICT MANAGER                -> DISTRICT MANAGER
ASSIST. MANAGER                 -> ASSISTANT MANAGER
CUTSOMER SERIVCE REPRESENTATIVE -> CUSTOMER SERVICE REPRESENTATIVE
SHIFT LEADER                    -> SHIFT LEADER
REGIONAL MANAGER                -> REGIONAL MANAGER
STORE MANAGER                   -> STORE MANAGER
DISTRICT MGR.                   -> DISTRICT MANAGER
REGIONAL MGR                    -> REGIONAL MANAGER
SHIFT LEADER                    -> SHIFT LEADER
CUSTOMER SERIVCE REP            -> CUSTOMER SERVICE REPRESENTATIVE
ASSISTANT MANAGER               -> ASSISTANT MANAGER
DISTRICT MANAGER                -> DISTRICT MANAGER

My dataset was FuzzyGroupingExample.txt. I started by created my my reference table from that database. I opened Visual Studio and created a new project:

Create new project

I provided a name and location for my new project:

Provide name and location for new project

I created a Data Flow Task:

Create Data Flow Task

I right-clicked the Connection Managers area and then clicked New Flat File Connection...

Create New Flat File Connection

I provided a name and chose the file to use (FuzzyGroupingExample.txt). Then I clicked Columns to define the columns to use for this connection manager:

Flat File Connection Manager Editor

Notice the warning in the image above: Columns are not defined for this connection manager. It disappeared when clicking Columns and going back to General without further actions. From the Columns section, I confirmed that my data was displaying correctly, and clicked OK.

Confirm that data is displaying correctly

My connection was created successfully. I double clicked my Data Flow Task to open it:

Connection created successfully

I created a Flat File Source and double clicked to open it.

Create Flat File Source

From the Flat File Source Editor, I clicked Columns:

Click Columns from Flat File Source Editor

I unchecked LASTNAME and FIRSTNAME because I was creating my reference table, which should not have employee names, but only occupations. I renamed the Output Column to OccupationLabel. Then I clicked OK:

Uncheck columns and define Output Column

I created another connection, this time a New OLE DB Connection...

New OLE DB Connection

I chose the database that I wanted to use and clicked OK:

Configure OLE DB Connection Manager

I renamed my connection to SQLServerDatabase:

New OLE DB Connection created successfully

I created a Fuzzy Grouping Transformation and connected the arrow from Flat File Source - Occupation Load to Fuzzy Grouping. Then I double clicked Fuzzy Grouping to open it.

Create Fuzzy Grouping Transform

I clicked the Columns tab:

Click Columns from Flat File Source Editor

I checked OccupationLabel and clicked the Advanced tab:

Choose columns from Fuzzy Grouping Transformation Editor

I unchecked the Tab and Line feed Token delimiters. I replaced ,.;:-"'&/\@!?()<>[]{}|#*^% with . in the Additional delimiters textbox. I knew I had spaces and periods in the values of the TITLE column of my FuzzyGroupingExample.txt file. For that reason, I included both spaces and periods as Token delimiters in my configurations. Since I knew I did not have any other special characters in my data, I did not specify additional Token delimiters. I changed the Similarity threshold from 0.8 to 0.5. Finally, I clicked OK.

Specify Token delimiters

In order to specify which rows to return, I used the Conditional Split Transformation. Each of those returned rows represent groups. For example, CUST SERVICE REP., CUSTOMER SERVICE REPRESENTATIVE, CUSTOMER SERVICE REP, CUTSOMER SERIVCE REPRESENTATIVE, and CUSTOMER SERIVCE REP are all part of the same group whose only returned row is CUSTOMER SERVICE REPRESENTATIVE. I created a Conditional Split Transformation and connected the arrow from Fuzzy Grouping to Conditional Split. Then I double clicked Conditional Split:

Add Conditional Split Transform

I specified my condition in the Default output name textbox. I replaced Conditional Split Default Output with Delete. Then I clicked OK:

Define condition from Conditional Split Transformation Editor

I created an OLE DB Destination and connected the arrow from Conditional Split to OLE DB Destination. The Input Output Selection pop-up appeared:

Insert OLE DB Destination

I chose Best Match as my Output and clicked OK:

Choose BestMatch as Input Output Selection

The arrow from Conditional Split to OLE DB Destination appeared correctly with the Best Match label on it. I double clicked OLE DB Destination to open it:

Best Match arrow displayed

I clicked the New... button and in the Create Table box, I provided the code to create the destination table to hold the final output. Then I clicked OK:

CREATE TABLE [Occupation_FuzzyGrouping] (
  [OccupationID] [smallint] IDENTITY(1,1) NOT NULL,
  [OccupationLabel] [varchar](50) NOT NULL
)
Create table to hold final output

Name of the table or the view did not seem to have been populated properly. What I saw was [Occupation_FuzzyGrouping] (. That opening parenthesis did not look correct as part of a table or view name. I clicked Mappings. Notice the warning: Map the columns on the Mappings page.

Table or view created for OLE DB Destination

I was not allowed to go to the Mappings section. The following warning appeared.

Warning when trying to navigate to Mappings (part 1 of 2):

Table or view created for OLE DB Destination

Warning when trying to navigate to Mappings (part 2 of 2):

Table or view created for OLE DB Destination

I went to the database and discovered that the table had been created successfully:

Table created successfully

In the value for Name of the table or the view, I replaced [Occupation_FuzzyGrouping] ( with [dbo].[Occupation_FuzzyGrouping]. The warning changed back to Map the columns on the Mappings page. Then I clicked Mappings:

Mappings working properly

The mapping was wrong because it was linking OccupationLabel from Available Input Columns to OccupationLabel from Available Destination Columns:

Wrong mapping

I fixed the mappings by linking OccupationLabel_clean from Available Input Columns to OccupationLabel from Available Destination Columns. Then I clicked OK:

Fix mapping

The explanation of why the mapping was wrong and it had to be fixed, was because the OccupationLabel Output Alias returns several rows, while OccupationLabel_clean Group Output Alias returns the grouping of all those rows. For example, OccupationLabel Output Alias would return CUST SERVICE REP., CUSTOMER SERVICE REPRESENTATIVE, CUSTOMER SERVICE REP, CUTSOMER SERIVCE REPRESENTATIVE, and CUSTOMER SERIVCE REP; while OccupationLabel_clean Group Output Alias would only return CUSTOMER SERVICE REPRESENTATIVE:

Explain difference between Output Alias and Group Output Alias

I right clicked on the package and then clicked Execute Package:

Execute Package

The execution was successful, returning 6 rows from the original input of 22 rows:

Successful execution

I clicked Progress and scrolled all the way down to see how the execution was completed successfully. Its elapsed time was 21.563 seconds:

Show elapsed time

The final output contains only 6 rows, without duplicates, which were written to the database:

Final result written to database

What happens behind the scenes? In order to find out, I deleted the Conditional Split Transformation. I connected Fuzzy Grouping to OLE DB Destination. Then I double clicked Fuzzy Grouping:

Delete Conditional Split Transformation

I clicked the Advanced tab:

Click the Advanced tab

I changed the Similarity threshold to 0.00 and clicked OK:

Change Similarity threshold to 0

I deleted my Occupation_FuzzyGrouping table that I had previously used as my destination of final results.

After deleting it, I created a table with the same name, but with another set of fields. In order to do that, I started by double clicking OLE Destination DB:

Double click OLE Destination DB

I clicked New...:

Click New...

I provided the following code in the Create Table pop-up box:

CREATE TABLE [Occupation_FuzzyGrouping] (
  [_key_in] int,
  [_key_out] int,
  [_score] real,
  [OccupationLabel] varchar(50),
  [OccupationLabel_clean] varchar(50),
  [_Similarity_OccupationLabel] real
)

I clicked OK. As soon as the OK button was clicked, the new table was created in the database:

Provide code to create new table

I confirmed from the database that my table was created successfully:

New table created successfully

I clicked Mappings:

Click Mappings

I did not change anything. The mappings were already correct as I needed it, autimatically done by the tool based on exactly matching names. I clicked OK:

See Mappings and click OK

Click Start to execute the package:

Click Start to execute the package

From an input of 22 rows, the output is also 22 rows:

Show output of 22 rows

This was the result in the database:

Result in database with 0.00 Similarity threshold

I deleted all rows from the Occupation_FuzzyGrouping table. I changed the Similarity threshold to 0.99:

Change Similarity threshold to 0.99

I executed the package again. This was the new result in the database:

Results with 0.99 Similarity threshold

I deleted all rows from the Occupation_FuzzyGrouping table. I changed the Similarity threshold to 0.5:

Change Similarity threshold to 0.5

I executed the package again. This was the new result in the database:

Final result with 0.5 Similarity threshold

Notice how all values in _Similarity_OccupationLabel are higher than 0.5. That is because 0.5 was used as the Similarity threshold from the Fuzzy Grouping Transformation Editor. Observe how the query SELECT DISTINCT OccupationLabel_clean FROM Occupation_FuzzyGrouping returns the same 6 rows that had previously been obtained from the SSIS package, when writing to the Occupation_FuzzyGrouping table that had only two fields: OccupationID and OccupationLabel. Modifying it and writing to a table with more fields, I did it with the purpose of showing and explaining a little deeper part of what happends behand the scenes to obtain the desired final output:

Show final output

It almost feels like a spelling correction system, where if the value in the left column is not exactly the same as the value in the right column, we could ask, "Did you mean this?" offering the value in the right column as a suggestion:

Did you mean this feature to suggest values
Published: 5:59 AM GMT · Dec 12, 2023