What Dirty Data Looks Like

Emma Kessinger

January 09th , 2020

Companies are being forced to process and parse more data than ever, and that kind of deluge can lead to issues. Dirty data — data that is unusable for analysis — costs businesses over 3 trillion dollars annually. 

As your business grows and begins to extract and analyze greater amounts of data, you need to be prepared to deal with the spectre of dirty data. There is no universal fix for cleaning up your data, but there are a number of steps you can take to ensure that you’re dealing with the best data possible.

Keeping Your Data Clean

Data cleansing often comes down to identifying specific problems within your data sets. Here are a few common issues that can keep your data from being used properly:

1. Formatting Issues

One of the single most common causes for dirty data is issues in formatting. If your data formatting isn’t uniform across your whole set, you’re likely to run into some serious problems when it comes time to crunch the numbers.

Data type formatting is usually a good place to look — make sure that any dates you have are all formatted the same, with the day, month, and year in the same order and separate by the same punctuation marks.  Also, it’s best to use the data format suggested by the particular database you are using. It’s also important to check at any dollar amounts are uniformly formatted, ensuring that change is dealt with in the same way across all numbers.  Dollar amounts are best served by a decimal data type as a float may have deep rounding issues. 

2. Duplicate Data

Duplicate data is the bane of data analysts and data scientists and can lead to dire consequences if reported on--as numbers will be over-inflated.  A good database index design and ETL process can truly help eliminate duplicate records. When choosing an ETL tool, be sure deduplication is a core focus.

3. Improper Inclusion

If you’re compiling data in an attempt to calculate your business’s revenue, you don’t want your expense numbers included in the calculations. The improper inclusion of certain data into datasets where it doesn’t belong can make calculations difficult and can result in multiple datasets being rendered unusable. 

To avoid any inclusion issues, keep different datasets siloed off from one another. Cramming everything into a single spreadsheet is a recipe for disaster, so always be careful when compiling or centralizing your data.

4. Incomplete Data 

Just as the data transfer process can sometimes overproduce data, it can also leave some data unaccounted for. An incomplete data set means that you’re not able to see the full picture of whatever you’re analyzing. 

Identify bits of data you can use as “markers” and do a search for all of them once your data has been transferred. Keeping your eye on a few random bits of data can be a good heuristic for seeing if everything has been transferred properly. That being said, if you encounter any issues in the data transfer process, it might be worthwhile doing a more thorough search. 

5. Contradictory Data 

About 27 percent of business leaders can’t reliably say how accurate their data is, and much of that issue is due to the high volume of contradictory data. Contradictory data is less likely to come from transfer or formatting issues and instead to issues with data entry — data that has been entered into the wrong category, with the wrong units, or comes from the wrong timespan.

Contradictory data can often stand out starkly from the rest of your dataset but can also sometimes hide more maliciously. The best safeguard against it is by having clearly-defined policies for data entry and intake. 

Dirty data can be a major headache for your business, both in terms of time and finances. In order to keep your data as clean as possible, always follow best practices when it comes to data entry, transfer, and analysis. And if you don’t think you have time to do this all yourself, it may be time to invest in tools like ETLrobot that can help you clean, deduplicate, and properly structure your data during the ETL process.

FROM
OUR BLOG

29 | May

Get Smart with ETL for GitHub

Posted By: Emma Kessinger

In January 2020, GitHub reportedly had over 40 million users and more than 100 million repositories....

12 | May

3 Tips to Get More Value From Your Google Ads

Posted By: Emma Kessinger

Google Ads, not to be confused with Google Analytics, is one of the most helpful ETL integrations. G...

27 | Apr

Understanding the Legwork for Data Visualization 

Posted By: Emma Kessinger

The 21st century has been hailed as the “Age of Information,” and it’s not hard to see why —...

13 | Apr

The Anatomy of an Effective ETL Process

Posted By: Emma Kessinger

You know the value of ETL. You know you’re ready to invest in it. But you may not know how the rub...

2 | Apr

3 Ways ETL Can Strengthen Your Shopify Site

Posted By: Emma Kessinger

One of the most popular e-commerce tools out there is Shopify. But how do clients like Budweiser, Gy...

18 | Mar

5 Ways to Unlock New Value From HubSpot Data

Posted By: Emma Kessinger

No modern marketing platform is as popular as HubSpot. But without processes like ETL, it’s tough ...

3 | Mar

How to Maximize Your Qualtrics ETL Integration

Posted By: Emma Kessinger

Qualtrics offers a customizable survey software solution. With more than 9,000 clients, Qualtrics he...

18 | Feb

How to Use Five9’s ETL Integration to the Fullest

Posted By: Emma Kessinger

Cloud contact centers are the future of customer service. But without an ETL tool like ETLrobot, you...

6 | Feb

7 Questions For Finding the Right ETL Tool For You

Posted By: Emma Kessinger

ETL — which stands for extract, transform, and load — is one of the most common ways for busines...

20 | Jan

8 Data Security Questions to Ask For Your Business

Posted By: Emma Kessinger

Businesses that deal in physical goods go to great lengths to protect their products, so why shouldn...

21 | Dec

5 Signs That It’s Time to Invest in ETL

Posted By: Emma Kessinger

How much more data does your business generate than it did in 2016? Twice as much? Ten times as much...

10 | Dec

Do More With Data: 4 Reasons to Use ETLrobot

Posted By: Emma Kessinger

By 2020, the Big Data market is projected to grow to twice the size it was just five years ago. Inve...

5 | Dec

6 Data Skills Every Employee Should Have 

Posted By: Emma Kessinger

In 2017, The Economist ruled that data has become the world’s most valuable commodity, even beatin...

Copyright © 2020 ETLrobot. All rights reserved. Privacy Terms