Resource Data

What is dirty data? Sources, impact, key strategies

We’re excited to bring back Transform 2022 in person on July 19 and virtually from July 20-28. Join leaders in AI and data for in-depth discussions and exciting networking opportunities. Register today!


Business data is critical to business success. Businesses around the world understand this and leverage platforms like Snowflake to get the most out of information streaming from a variety of sources. However, more often than not, this data can become “dirty”. Essentially, it could, at any stage of the pipeline, lose key attributes such as accuracy, accessibility, and completeness (among others), becoming unsuitable for downstream use initially targeted by the organization.

“Some data may be objectively erroneous. Data fields can be left blank, misspelled or inaccurate names, addresses, phone numbers can be provided and duplicate information… are some examples. However, whether this data can be classified as dirty depends a lot on the context.

For example, a missing or incorrect email address is not necessary to close an in-store sale, but a marketing team that wants to contact customers by email to send promotional information will classify that same data as dirty,” Jason Medd, research director at Gartner, told VentureBeat.

Moreover, the untimely and inconsistent flow of information can also compound the problem of dirty data within an organization. This last case occurs in particular in the case of the fusion of information coming from two or more systems using different standards. For example, if one system classifies names into a single field while the other divides them into two, only one will be considered valid, the other requiring cleaning.

Dirty data sources

Overall, the whole problem comes down to five key sources:

People

As Medd explained, dirty data can occur due to human errors during input. This can be the result of shoddy work on the part of the person entering the data, lack of training, or poorly defined roles and responsibilities. Many organizations don’t even consider establishing a data-driven collaborative culture

Process

Process monitoring can also lead to cases of dirty data. For example, poorly defined data lifecycles could lead to outdated information being used across systems (people change numbers, addresses over time). There could also be issues due to lack of data quality firewalls for critical data capture points or lack of clear cross-functional data processes.

Technology

Technology issues such as programming errors or poorly maintained internal/external interfaces can affect data quality and consistency. Many organizations may even miss the deployment of data quality tools or end up maintaining multiple variable copies of the same data due to system fragmentation.

Organization

Among other things, activities at the broader organizational level, such as acquisitions and mergers, can also disrupt data practices. This problem is particularly common in large companies. Not to mention that due to the complexity of such organizations, the head of many functional areas could resort to keeping and managing data in silos.

Governance

Gaps in governance, which provides authority and control over data assets, could be another reason for quality issues. Organizations that fail to set data capture standards, appoint data owners/administrators, or implement non-compliant policies for scale, pace, and distribution of data can end up with sloppy proprietary and third-party data.

“Data governance is the specification of decision rights and an accountability framework to ensure appropriate behavior in the assessment, creation, consumption and control of data. It also defines a policy management framework to ensure data quality throughout the company’s value chains. Managing dirty data is not just a technological problem. It requires the application and coordination of people, process and technology. Data governance is a key pillar not only to identify dirty data, but also to ensure issues are addressed and continuously monitored,” Medd added.

Enterprise-wide impact

Regardless of the source, data quality issues can have a significant impact on downstream analytics, leading to poor business decisions, inefficiencies, missed opportunities, and reputational damage. There may also be small issues such as sending the same communication message multiple times to a customer whose name has been registered differently in the same system.

All of this ultimately translates into additional costs, churn, poor customer experiences. In fact, Medd pointed out that poor data quality can cost organizations an average of $12.9 million each year. Stewart Bond, director of data integration and intelligence research at IDC, also agrees, noting that his organization’s recent data trust survey found that low levels of quality and trust data has the most impact on operational costs.

Key actions to address data quality challenges

In order to keep the data pipeline clean, organizations should implement a scalable and comprehensive data quality program covering tactical data quality issues as well as strategic aspects of aligning resources and business objectives. This, as Medd explained, can be done by building a strong foundation reinforced by modern technology, metrics, processes, policies, roles and responsibilities.

“Organizations have typically addressed data quality issues as point solutions in individual business units, where the problems show up the most. This could be a good starting point for a data quality initiative. However, solutions often focus on specific use cases and often overlook the broader business context, which may involve other business units. It is critical for organizations to have scalable data quality programs so they can build on their successes in terms of experience and skills,” said Medd.

In a nutshell, a data quality program should have six main layers:

Definition

As part of this, the organization should define the larger purpose of the program, detailing what data it plans to keep under the scanner, what business processes may lead to bad data (and how) and which departments may ultimately be affected by this data. Based on this information, the organization could then set data rules and appoint data owners and stewards for accountability.

A good example might be the case of customer records. An organization whose goal is to ensure unique and accurate customer records for use by marketing teams may have rules such that all addresses and names collected from new orders must be unique when put together or addresses must be verified against an authorized database.

Evaluation

Once the rules are defined, the organization should use them to check new (at source) and existing data records for key quality attributes, starting with accuracy and completeness through to consistency and consistency. news. The process typically involves the use of qualitative/quantitative tools, as most companies deal with a wide variety and volume of information from different systems.

“There are many data quality solutions available on the market, ranging from specific domains (customers, addresses, products, locations, etc.) to software that detects bad data based on rules that define what is good. data. There is also an emerging set of software vendors that use data science and machine learning techniques to find anomalies in data as possible data quality issues. The first line of defense, however, is to have data standards in place for data capture,” IDC’s Bond told Venturebeat.

Analysis

Following the evaluation, the results must be analysed. At this point, the data team should understand the quality gaps (if any) and determine the root cause of the issues (wrong entry, duplication or otherwise). This shows how far the current data is from the original goal the organization was aiming for and what needs to be done to move forward.

To clean

With the root cause in view, the organization must develop and implement plans to resolve the issue at hand. This should include steps to correct the problem as well as policy, technology, or process changes to ensure the problem does not reoccur. Note here that the steps should be executed with resource and cost considerations, and some changes may take longer to implement than others.

Control

Finally, the organization must ensure that the changes remain in effect and that the quality of the data complies with the data rules. Information about current standards and the state of data should be promoted throughout the organization, cultivating a culture of collaboration to ensure data quality on an ongoing basis.

VentureBeat’s mission is to be a digital public square for technical decision makers to learn about transformative enterprise technology and conduct transactions. Learn more about membership.