Data Deduplication - Melissa UK
Data deduplication is a process of matching, merging and deduplicating customer data due to streams of data coming into a company’s systems from multiple channels and sources. Having more than one communication channel can cause pieces of the same information to be spread across different fields, records and databases, which can cause one system or one entire record to be inaccurate or inconsistent, impacting that single customer view.

What is Duplicate Data?

Duplicate data is having multiple sets of data from the same individual. For example, Miss Elizabeth Tailor from 123 Albion Street, London has also been entered into the same database as Mrs Tailor from 123 Albion St, London. 10% of a business database will accumulate duplicate data in a year due to people moving, changing marital status, or changing email addresses and phone numbers, just to name a few instances.

Duplicate data can also enter a company’s database from the multiple points of contact used by customers to communicate with the organisation. For example, a customer may give a different set of information to a call centre than they would online, leaving pertinent information either uncollected or in a different account not linked with their original one. This causes a collection of information that can be inconsistent, which then gets spread out across an organisation’s different departments and can internally effect the single customer view and “golden record” initiatives.

Consider all the noise that comes with customer data entering systems, such as:

  1. Bad data where people have keyed in incorrect information
  2. Missing information/data
  3. Formatting issues, such as standardisation
  4. Mis-fielded data (data in the wrong fields)
  5. Structural differences for information in different databases or systems, potentially causing duplicate accounts for the same customer
  6. Conflicting information for multiple customers with similar demographics and determining which is right, wrong, or if they are separate accounts

This lack of data quality can cause costly issues for organisations, which is why it is crucial to deduplicate your data often, as it widely contributes to the data quality life cycle.

Learn more about data quality in our What is Data Quality article.

Data Deduplication example, duplicate entries of the same data

How Does Data Deduplication Work?

Data deduplication tools work to match and merge existing data by employing a match code to determine if two records should be considered duplicates. Here is a list of fields used for identifying duplicates:

  • Prefix
  • Street Pre-Directional
  • ZIP Code
  • First Name
  • Street Name
  • Country
  • Middle Name
  • Street Suffix
  • Phone/Fax
  • Last Name
  • Street Post-Directional
  • Email Address
  • Suffix
  • PO BOX
  • Credit Card Number
  • Gender
  • Street Secondary
  • Date
  • First/Nickname
  • Address
  • Numeric
  • Middle/Nickname
  • City
  • Proximity
  • Department/Title
  • State/Province
  • General ID
  • Company
  • Postal Code
  • Company Acronym
  • Street Number

A data deduplication tool can also use "fuzzy matching" algorithms to combine deep domain knowledge of contact data to match similar records and quickly dedupe your database.

Here is a list of fuzzy match algorithms to identify "non-exact" matching of duplicate records:

  • Phonetics
  • Consonants Only
  • Smith-Waterman-Gotoh
  • Soundex
  • Alphas Only
  • Jaccard Similarity Coefficient
  • Containment
  • Numerics Only
  • Overlap Coefficient
  • Frequency
  • MD Keyboard
  • Longest Common Substring
  • Fast Near
  • Jaro
  • Double Meta Phone
  • Accurate Near
  • Jaso-Winkler
  • Frequency Near
  • n-Gram
  • UTF-8 Near
  • Needlemen-Wunch
  • Vowels Only

The Best Practice For Deduplicating Your Data

You can deduplicate your data in two ways. First, you can deduplicate in data cleansing practices, a process in which a company scrubs their data to ensure it is accurate, verified and compliant through multiple sources. The second way is from a deduplication API, which is built into an organisation’s system, working in the background to ensure their data stays deduplicated over time.

Learn more about data cleansing in our What is Data Cleansing article.

There are 3 ways to achieve data deduplication.

  1. Read / Write Deduping
    This process compares records in one or more databases at once. Each unique group will have one record that receives an “output” status. Then other matching records receive a “duplicate” status. This is ideal for matching entire databases at one time.
  2. Incremental Deduping
    Incremental deduping enables real-time matching by comparing each record as it comes in (ex. from a webform or a call centre) against the existing master database. If the income record is not a duplicate, a new record will be created.
  3. Hybrid Deduping
    Hybrid deduping provides a combination of the first two methods, with the flexibility to customise the process of matching an incoming record against a small cluster of potential matches. Hybrid deduping allows you to store the match keys in a proprietary manner. This is ideal for real-time data entry or batch processing of entire lists.

Why Melissa?

Melissa has been helping businesses improve their data quality for over 35 years with smart solutions that correct, verify, update and enrich customer data. Our full spectrum of data quality solutions gives businesses the tools they need to maintain clean, current and consistent data for more efficient operations and improved marketing and sales efforts. Melissa’s Data Quality Suite instantly verifies contact data at the point of entry for over 240 countries and territories, with flexible tools that are available as on-premise APIs or Web services to meet your specific needs.

250+ Countries & Territories
1,000,555,787+ Addresses Verified
40 Years
10,000+ Customers Worldwide