The advent of AI, automation and smart bots triggers the question: Is it possible that data scientists will become redundant in the future? Are they indispensable? The ideal approach appears to be automation complementing the work data scientists do. This would better utilise the tremendous data being generated throughout the world every day.
Data scientists are currently very much in demand. But there is the question about whether they can automate themselves out of their jobs. Can artificial intelligence replace data scientists? If so, up to what extent can their tasks be automated? Gartner recently reported that 40 per cent of data science tasks will be automated by 2020. So what kind of skills can be efficiently handled by automation? All this speculation adds fuel to the ongoing ‘Man vs Machine’ debate.
Data scientists need a strong mathematical mind, quantitative skills, computer programming skills and business acumen to make decisions. They need to gather large unstructured data and transform it into results and insights, which can be understood by laymen or business executives. The whole process is highly customised, depending on the type of application domain. Some degree of human interaction will always be needed due to the subjective nature of the process, and what percentage of the task is automated depends on the specific use case and is open to debate. To understand how much or what parts can be automated, we need to have a deep understanding of the process.
Data scientists are expensive to hire and there is a shortage of this skill in the industry as it’s a relatively new field. Many companies try to look for alternative solutions. Several AI algorithms have now been developed, which can analyse data and provide insights similar to a data scientist. The algorithm has to provide the data output and make accurate predictions, which can be done by using Natural Language Processing (NLP).
NLP can be used to communicate with AI in the same way that laymen interact with data scientists to put forth their demands. For example, IBM Watson has NLP facilities which interact with business intelligence (BI) tools to perform data science tasks. Microsoft’s Cortana also has a powerful BI tool, and users can process Big Data sets by just speaking to it. All these are simple forms of automation which are widely available already. Data engineering tasks such as cleansing, normalisation, skewness removal, transformation, etc, as well as modelling methods like champion model selection, feature selection, algorithm selection, fitness metric selection, etc, are tasks for which automated tools are currently available in the market.
Automation in data science will squeeze some manual labour out of the workflow instead of completely replacing the data scientists. Low-level functions can be efficiently handled by AI systems. There are many technologies to do this.
The Alteryx Designer tool automatically generates customised REST APIs and Docker images around machine learning models during the promotion and deployment stage. Designer workflows can also be set up to automatically retrain machine learning models, using fresh data, and then to automatically redeploy them.
Data integration, model building, and optimising model hyper parameters are areas where automation can be helpful. Data integration combines data from multiple sources to provide a uniform data set. Automation here can pull trusted data from multiple sources for a data scientist to analyse. Collecting data, searching for patterns and making predictions are required for model building, which can be automated as machines can collect data to find patterns.
Machines are getting smarter everyday due to the integration of AI principles that help them learn from the types of patterns they were historically trying to detect. An added advantage here is that machines will not make the kind of errors that humans do.
Automation has its own set of limitations, however. It can only go so far. Artificial intelligence can automate data engineering and machine learning processes but AI can’t automate itself. Data wrangling (data munging) consists of manually converting raw data to an easily consumable form. The process still requires human judgment to turn raw data into insights that make sense for an organisation, and take all of an organisation’s complexities into account.
Even unsupervised learning is not entirely automated. Data scientists still prepare sets, clean them, specify which algorithms to use, and interpret the findings. Data visualisation, most of the time, needs a human as the findings to be presented to laymen have to be highly customised, depending on the technical knowledge of the audience. A machine can’t possibly be trained to do that.
Low-level visualisations can be automated, but human intelligence would be required to interpret and explain the data. It will also be needed to write AI algorithms that can handle mundane visualisation tasks. Moreover, intangibles like human curiosity, intuition or the desire to create/validate experiments can’t be simulated by AI. This aspect of data science probably won’t be ever handled by AI in the near future as the technology hasn’t evolved to that extent.
While thinking about automation, we should also consider the quality of the output. Here, output means the validity or relevance of the insights. With automation, the quantity and throughput of data science artefacts will increase, but that doesn’t translate to an increase in quality. The process of extracting insights and applying them within the context of particular data driven applications is still inherently a creative, exploratory process that demands human judgment. To get a deeper understanding of the data, feature engineering is a very essential portion of the process. It allows us to make maximum use of the data available to us. Automating feature engineering is really difficult as it requires human domain knowledge and a real-world understanding, which is tough for a machine to acquire. Even if AI is used, it can’t provide the same level of feedback that a human expert in that domain can. While automation can help identify patterns in an organisation, machines cannot truly understand what data means for an organisation and its relationships between different, unconnected operations.
You can’t teach a machine to be creative. After getting results from a pipeline, a data scientist can seek further domain knowledge in order to add value and improve the pipeline.Collaborating alongside marketing, sales and engineering teams, solutions will need to be implemented and deployed based on these findings to improve the model. It’s an iterative process and after each iteration, the creativity with which data scientists plan on adding to the next phase is what differentiates them from bots. The interactions and conversations driving these initiatives, which are fuelled by abstract, creative thinking, surpass the capabilities of any modern-day machine.
Current data scientists shouldn’t be worried about losing their jobs to computers due to automation, as they are an amalgamation of thought leaders, coders and statisticians. A successful data science project will always need a strong team of humans to work together and collaborate to synergistically solve a problem. AI will have a tough time collaborating, which is essential in order to transform data to actionable data. Even if automation is used to some extent, a data scientist will always have to manually validate the results of a pipeline in order to make sure it makes sense in the real world. Automation can be thought of as a supplementary tool which will help scale data science and make the work more efficient. Bots can handle lower-level tasks and leave the problem-solving tasks to human experts. The combination of automation with human problem-solving will actually empower, rather than threaten, the jobs of data scientists as bots will be like assistants to the former.
Automation can never completely replace a data scientist because no amount of advanced AI can emulate the most important quality a skilful data scientist must possess—intuition.
This article was first published in Open Source For You October 2019 issue.
Preet Gandhi is an avid Big Data and data scientist enthusiast