Do data scientists really spend 80% of their time tackling data?

Yes and no. The implication is clear: if this statistic is accurate, the burden of providing data for their models hinders the ability of data scientists to use their data science skills.

Lost in this argument is that the “data battle” itself involves considerable data science skills. Moreover, feuds provide downstream benefits to others, using the discoveries discovered by the scientist. Finally, it is ridiculous to make a general statement when the work of data scientists is not uniform across industry and data platforms.

This 80/20 claim first surfaced at least a decade ago, and it still lives on. There is no clear and rigorous evidence that this 80% measurement is accurate. Depending on the circumstances, it indeed varies a lot according to the organization, according to the application and, certainly, according to the competence and the tools implemented. However, there is no denying that researching data for analytics and data science is a big effort, regardless of the percentage quoted.

Nonetheless, acquiring valid data for surveys is an overwhelming data management problem in an increasingly complex, hybrid, and distributed world. It’s too big for even highly trained analysts and scientists to handle on their own. The solution is a platform that provides consistent and connected services such as data relationship discovery, data flow, sensitive data discovery, data drift, impact analysis, and data analysis. redundant data. The entire suite should be driven by AI working alongside experts to foster relearning and adaptation. Instead of inadequate approaches, a semantically rich data catalog underpinned by a knowledge graph is the key to extracting value from the effort to make all that effort effective.

Things to consider are:

  • Why embedded machine learning technology to populate and maintain a knowledge graph is essential for tackling the work of data discovery management by mapping relationships in distributed data that are not evident in manual processes.
  • The data discovery process is dynamic and not a one-time ETL mapping to a stable schema.
  • What actual metadata is not neatly tucked away in a drawer but active throughout the process, from discovery to a semantically rich and dynamic data catalog.
  • Why even machine learning-based software is inadequate if it stops at metadata like column names and doesn’t study the actual instances of the data itself.
  • The role of continuous learning. As experts inspect the results of the models, their inputs in the form of additions, deletions, or corrections are fed back to the algorithms to relearn and adapt.

Better tools are needed to improve the productivity (and job satisfaction) of highly skilled and paid professionals. More traditionally, even those performing analytics in organizations can benefit from an intelligent, integrated product that takes them from data ingestion to an active, semantically rich data catalog.

It cannot be done with traditional methods. There is too much data and a diversity of sources for programmatic solutions. Data scientists (we use the term “data scientist” broadly to refer to anyone using data for analytical and quantitative work) need help. Interestingly, this help comes from the same disciplines they use in their work. The solutions that work today rely heavily on machine learning technology.

The promise of machine learning and AI-infused applications catapulting us to awesome capabilities is being driven by advances in processing, storage and networking technologies, the ability to process data at fantastic scale and the growing skills of data scientists. This technological base allows an innovative approach to data management, impossible barely ten years ago.

The current volume of data adds complexity to the problem, but it is necessary to understand it on a large scale.

Only a few years ago things seemed to be more orderly. Before the rise of Big Data, followed by the “data lake” and cloud object stores, the primary data warehouse data repository for analytics. The data extraction and integration technology for the data warehouse was Extract, Transform and Load (ETL). ETL was loaded upstream in the data warehouse development process, extracting information from somewhat known data sources to a known schema. Once settled, ETL operated mostly as a regular process. Most. Data warehouses are stable but not static, so there is usually ongoing development with ETL, but for the most part the routines run in production.

Finding and integrating data for data warehouses was not easy. First, source systems were not designed to be data providers for a data warehouse. Semantics were not aligned and there were data quality issues

The current fascination with “digital transformation” is forcing organizations to struggle to develop their skills in machine learning, AI, deep learning, or even just simple predictive models. The sources of data to consider have exploded. For example:

  • Social media platforms offer a wide variety of views of their data,
  • Data.gov contains more than a quarter of a million datasets ranging from Coast Guard casualties to bird populations, demographics to Department of Commerce information.
  • Healthdata.gov contains 125 years of US health care data, including claims-level Medicare data, epidemiological and demographic statistics. These are just a few of thousands of external data sources.

Even within an organization, disjoint data sources designed to capture data in a single domain are now seen as critically important data for new applications that were not possible before. For example, population health management, as an area of ​​application, requires at a minimum the following data sources:

  • Patient demographics
  • Vital signs
  • Laboratory results
  • Progress notes
  • Problem Lists and Diagnoses
  • Procedure codes
  • Allergy Lists
  • Drug data
  • Admission, discharge and transfer
  • Skilled nursing and home care
  • Social determinants of health

No data warehouse can easily integrate all this data. There are too many domains, too many data types, and the simple effort of cleaning and preserving would overwhelm any data warehouse schema. The logical location of this data is some variation of cloud and on-premises, distributed through Hadoop or Spark or a data lake (or lakes). These data repositories provide a convenient way to manage data ingestion, but they lack functionality to energize it, to make it meaningful to the investigator.

The problem arises because none of these data sources is semantically compatible with the others. The combination and integration of data from multiple sources adds to the richness of the models. This is where the 80% problem lies.

Data science work is very often ad hoc. This is a multi-step process involving data profiling, some data cleaning, continuous transformation of data from different sources into a single format, saving data, a name they can remember and version tracking. Each survey starts with a model and selects data for it. Creating training data involves more data processing, and multiple runs or versions of the model are also named and saved. Another contrast between ETL and data discovery today is that ETL is still mapped to a stable schema.

my catch

There is considerably more data processing for each experiment, which is quite different from extracting curated data from a data warehouse. That’s why it takes so long. It’s a natural productivity killer for data scientists. Even when using tools designed for big data/data science, there are many steps, and often multiple technologies are used, with incompatible metadata and little to no transfer. There is a better way.