What Dirty Data Looks Like
Emma Kessinger
January 09th , 2020
Companies are being forced to process and parse more data than ever, and that kind of deluge can lead to issues. Dirty data — data that is unusable for analysis — costs businesses over 3 trillion dollars annually.
As your business grows and begins to extract and analyze greater amounts of data, you need to be prepared to deal with the spectre of dirty data. There is no universal fix for cleaning up your data, but there are a number of steps you can take to ensure that you’re dealing with the best data possible.
Keeping Your Data Clean
Data cleansing often comes down to identifying specific problems within your data sets. Here are a few common issues that can keep your data from being used properly:
1. Formatting Issues
One of the single most common causes for dirty data is issues in formatting. If your data formatting isn’t uniform across your whole set, you’re likely to run into some serious problems when it comes time to crunch the numbers.
Data type formatting is usually a good place to look — make sure that any dates you have are all formatted the same, with the day, month, and year in the same order and separate by the same punctuation marks. Also, it’s best to use the data format suggested by the particular database you are using. It’s also important to check at any dollar amounts are uniformly formatted, ensuring that change is dealt with in the same way across all numbers. Dollar amounts are best served by a decimal data type as a float may have deep rounding issues.
2. Duplicate Data
Duplicate data is the bane of data analysts and data scientists and can lead to dire consequences if reported on--as numbers will be over-inflated. A good database index design and ETL process can truly help eliminate duplicate records. When choosing an ETL tool, be sure deduplication is a core focus.
3. Improper Inclusion
If you’re compiling data in an attempt to calculate your business’s revenue, you don’t want your expense numbers included in the calculations. The improper inclusion of certain data into datasets where it doesn’t belong can make calculations difficult and can result in multiple datasets being rendered unusable.
To avoid any inclusion issues, keep different datasets siloed off from one another. Cramming everything into a single spreadsheet is a recipe for disaster, so always be careful when compiling or centralizing your data.
4. Incomplete Data
Just as the data transfer process can sometimes overproduce data, it can also leave some data unaccounted for. An incomplete data set means that you’re not able to see the full picture of whatever you’re analyzing.
Identify bits of data you can use as “markers” and do a search for all of them once your data has been transferred. Keeping your eye on a few random bits of data can be a good heuristic for seeing if everything has been transferred properly. That being said, if you encounter any issues in the data transfer process, it might be worthwhile doing a more thorough search.
5. Contradictory Data
About 27 percent of business leaders can’t reliably say how accurate their data is, and much of that issue is due to the high volume of contradictory data. Contradictory data is less likely to come from transfer or formatting issues and instead to issues with data entry — data that has been entered into the wrong category, with the wrong units, or comes from the wrong timespan.
Contradictory data can often stand out starkly from the rest of your dataset but can also sometimes hide more maliciously. The best safeguard against it is by having clearly-defined policies for data entry and intake.
Dirty data can be a major headache for your business, both in terms of time and finances. In order to keep your data as clean as possible, always follow best practices when it comes to data entry, transfer, and analysis. And if you don’t think you have time to do this all yourself, it may be time to invest in tools like ETLrobot that can help you clean, deduplicate, and properly structure your data during the ETL process.