Dirty data, or unclean data, is any type of data that contains inaccurate, incomplete, inconsistent, or outdated information. While these misinformations are usually very tiny (think, “Mr. Smith” vs. “Mr. Smyth” in the headline) and caused by human error, dirty data can have far-reaching consequences, especially for data-critical industries, such as financial and healthcare.
Bad data is estimated to cost the US economy around $3.1 billion (Forbes) in lost productivity, system outages, and higher maintenance costs every year. Experts project that this number is only going to increase in the next few years, especially as it is estimated that 463 exabytes of data will be created each day globally by 2025 (World Economic Forum).
To clarify, an exabyte is one billion billion, or one quintillion bytes. To put this into further perspective, the Commonwealth Scientific and Industrial Research Organisation (CSIRO) in Australia is planning to upgrade its Square Kilometre Array (SKA), a next-generation radio telescope, to generate 300 petabytes of data per year in the next decade. Considering that 1 petabyte is only 0.01 exabyte and we’re referring to looking at celestial objects lightyears away, even this pales in comparison to the infinite amount of data we are (and will) produce each day on Earth.
So, while a misspelling may seem harmless, the millions of Mr. Smiths who receive an invoice or letter addressed to a Mr. Smyth from their companies may have a different opinion—and could ultimately lead to lost sales.
Learn valuable insights about the IT industry, including essential terms you need to know.
How does data get dirty?
1. Human error
The most common reason data gets dirty is through human error. While the well-loved phrase “No one is perfect” is meant to soothe people as they make mistakes in life, it could also contribute to a slip-up in data entry, such as a typo error. Over time, these human errors can pile up and slowly compromise the integrity of your otherwise reliable data. Human error is also one of the leading causes of cybersecurity vulnerability.
It’s worth noting that you can’t eliminate human imperfection, but there are many ways to mitigate this risk. For example, you can train your employees to always double-check their work before submitting it. Even then, it’s highly encouraged that you create processes to ensure that an editor or proofreader checks the same entries to ensure their validity.
2. Fake customer entries
Have you ever intentionally entered the wrong name or email address because you didn’t want a company to gain private information? You are not alone. Your customers don’t owe you their information, and many will not willingly give you their sensitive information if they don’t trust you.
The best way to reduce this risk is to build client trust. Be transparent with them as much as possible, and never use black-hat practices to manipulate information from prospects. Be genuine: That’s the best way to improve your trust rating.
3. No strategy or a lack of it
It’s important that your departments are not siloed, especially if they share data points. A lack of data collection strategy can lead to a lazy approach to treating your customers and data. For example, if your marketing team needs to interview the same people as your sales team, both teams must coordinate to ensure no redundancy. This also ensures consistent messaging in your branding.
It may be a good idea to assign a data checker within your organization to double-check all data points, even across teams.
4. No data audits
The truth is that all organizations may have some level of bad data at a certain point, particularly if their company is rapidly expanding. Your website is a perfect example of this. For instance, you may say you serve X number of people on your website—which would be perfectly accurate when the website was live. Nevertheless, if your company grows, this number could be inaccurate in two, six, or however many months.
Proactively auditing your data is vital to maintaining reliable records. In this age of GDPR, HIPAA compliance, and other increasingly strict consumer privacy laws, the importance of conducting regular data audits cannot be overstated.
Dirty data is one of the many IT challenges for 2024.
Discover the other IT challenges faced by business leaders by downloading this guide.
Examples of dirty data
1. Duplicate data
This refers to any data that partially or fully shares the same information. This typically occurs when the same information is entered multiple times, usually in different formats. For example, if a customer calls numerous times and is received by a different IT technician who types their name slightly differently each time. Duplicate data can look like this:
- Raine Grey
- Raine Gray
- Rain Grey
- Reine Grey
- Rainey Grey
Duplicate data may also be considered redundant data, which occurs when data between teams is not synced. Thus, even if the system refers to one person (such as Raine Grey, the author of this article), I would show up as five different people.
2. Incomplete data
This is data that lacks information. For example, if you ask a prospect for their complete name for your email newsletter but don’t indicate that these fields are mandatory, you may have only a first or last name, making your email campaign less personalized.
3. Inaccurate data
Inaccurate data is misleading information or any data that contains mistakes. On some occasions, inaccurate data can also be duplicate data, which would require you or one of your team members to manually check each data entry to find the true one.
4. Outdated data
Outdated data is any data that used to be accurate but is no longer valid for whatever reason. Common examples of this are old email addresses and changes of titles (e.g., Ms. to Mrs. or Mr. to Dr., etc.). This is why regular data audits are especially important.
5. Insecure data
This is any data vulnerable to a cyber threat, such as spear phishing. Insecure data points are not encrypted by any security protocol or are not protected by multifactor authentication. Essentially, insecure data can be accessed by anyone in your company.
How to clean your data
Data management can be simple if you have the necessary tools and resources. Most importantly, you must be steady in your commitment to auditing your customer data regularly to know where to begin and what to do. After all, you don’t know what you don’t know.
This usually starts with a data warehouse, a centralized repository that provides a unified view of all an organization’s data. From here, you gain a better, more comprehensive understanding of the scope of potential issues and determine the severity of each. This process of discovering patterns from your data falls under the umbrella scope of data mining.
You can then develop action plans to resolve any detected dirty data. Typically, this is done manually, but some IT teams may use Microsoft Excel. You may also consider the tools and software available in the market today that help you identify and clean dirty data.
Protecting yourself against dirty data
Given the volume of data companies need to manage today, it is impossible not to have some data get dirty. That said, you can minimize their potential organizational impact by being proactive about all the information you receive and handle. It is highly recommended that you regularly audit and clean your data. While this cannot wholly eliminate dirty data from your organization, it can make their threat to your bottom line negligible.