A sophisticated artificial intelligence (AI) system usually follows a two-stage learning mechanism—training and feedback. However, during the process of AI training, one should be aware of the mistakes they might make in order to achieve better performance and accuracy.
If you want your artificial intelligence (AI) system to perform better, you ought to teach it better; and how do you verify if you have trained it well? By testing it better!
The majority of AI solutions learn in two different stages. The first learning happens while working with controlled data sets and formulated base models. The second learning occurs on the go or periodically with the help of user interactions in the form of feedback.
A sophisticated AI system usually follows a two-stage learning mechanism, whereas a simple AI system may have only one formative stage. The two stages in which AI learns are:
The training stage is where you form basic machine learning models for the first time and train the system using these models. The accuracy of these models is highly dependent on the training dataset.
On the contrary, when you deploy the AI system, some (or all) data can be fed back to the system for continuous learning, which we term as feedback stage learning. This stage is relatively vulnerable to various ongoing risks.
Mainly because, no matter how accurate you developed your base models in the past, new data (coming through feedback system) can cause these models to readjust. A lot would also depend on how the machine is learning in each stage.
Avoid These Five Mistakes When Training Your AI
Meticulous teaching is the fundamental requirement for having an excellent performance consistently. However, there are a few mistakes to which traditional (and contemporary) data analytics or statistical methods are susceptible. Collective calling for such issues is often known as garbage in, garbage out.
1. Not having enough data to train
How much data does one need to train the AI system effectively? Well, it depends!
It’s not the answer you would expect when you are at the pointy end of your machine learning stage. Nevertheless, it does depend on the complexity of your problem and the complexity of the algorithm you plan to use. Either way, the best way would be to use empirical investigation and arrive at an optimal number.
You may want to use standard sampling methods in the collection of required data and may wish to use standard sample size calculators as used in standard statistical analysis tools. However, due to the nature of machine learning algorithms, the amount of data is often insufficient. You would most likely need more than what a standard sample size calculation formula tells you.
Having more data may not be a big problem as having less data. You have to make sure that there is enough data to reasonably capture the relationship that might exist within input parameters (features) and between input and output.
You may also use your domain expertise to reasonably assess how much data is enough to exhibit a full cycle of your business problem. It should cover all the possible seasonality and variations.
The model developed with the help of this data will only be as good as the data you have or provide for training, so make sure that it is adequately available. If you feel that the data is not enough, which may be a rare scenario in the current Big Data world, don’t rush and wait until you get enough of it.
2. Not cleaning and validating the dataset
Too much data is of no use if it is of poor quality. It could mean one or more of the following three things:
Data has noise. There is too much conflicting and misleading information. Confounding variables or parameters are present, and the essential variables are missing. Cleaning this type of data needs additional data points because the current set is unusable and hence not enough.
It is dirty data. Several values are missing (though parameters are available), or the data has inconsistencies, errors, and a mix of numerical or categorical values in the same column. This type of data needs careful manual cleaning by subject matter experts and may often need re-validation. Depending on resource availability, you may find it easier to obtain additional data instead of cleaning dirty data.
Inadequate or sparse data. It is a scenario where very few data points have actual values, and a significant part of the dataset is full of nulls or zeroes.
The type of issues present within the dataset is often not clear from the dataset itself, which is why I always recommend exploratory analysis and visualisation to be applied at the outset. Doing this first pass not only gives you a level of confidence in data quality but also tells you if there is something amiss.
Based on the visual representation, an interesting question would be—do you see what you expected to see?
If the answer is ‘no,’ then your data may be of poor quality and needs cleaning.
If the answer is ‘yes,’ it might be useful in finding some preliminary insights. This validation of the dataset is essential to proceed, and you should never miss it.
3. Not having enough spread in data
Having a large amount of data is not always a good thing unless it can represent all the possible use cases or scenarios. If the data is missing variety, it can lead to problems in future—you increase the chances of losing on low-frequency high-risk scenarios.
For traditional predictive analysis, there is a point of low returns as you obtain more and more data for training. Your data science team can often spot this point empirically.
However, since machine learning is an inductive process, your base model can only cover what it has seen in the data. So, if you miss on edge cases, these will not be supported by your model. It merely means your AI will fail when that scenario occurs. This is the only and the most crucial reason why your training data should have enough spread to represent the real population.
4. Ignoring near-misses and overrides
During initial training, it is hard to identify near-misses and disregarded data points. However, in a continuous learning loop with feedback, it becomes highly essential to pay close attention to near-misses, and human or machine overrides.
When you deploy your AI system for the first time, it has an only base model that governs the performance of an AI. However, as system operation continues, the feedback loop feeds live data, and the system starts to adjust, either live or regularly.
If the model has missed to correctly predict or calculate any output just by a bit and thereby the decision has changed, it would be a near-miss. For example, in case of a loan approval system, if 88.5 per cent score means ‘loan approved’ and 88.6 per cent results in ‘loan declined,’ this scenario is a near-miss.
From a technical and pure statistical point of view it is correct. However, from a real-life perspective, a margin of error may play a significant role. If contested by the affected party, such as loan applicants, chances of change in a decision are higher. Therefore these types of data points are of particular interest, and you should not ignore them.
The same applies when a human operator is supervising the AI system’s output and can decide to override it. You must treat the human operator override as a special-case scenario, and feed it back to the training model. Each of these scenarios either highlights inadequacies in the base model or provides new situations that never existed before. Ignoring overrides can degrade the model performance over time.
5. Conflating correlation and causation
In statistics, we often say, “Correlation does not imply causation.” It usually refers to the inability to legitimately deduce a cause and effect relationship between input variables and output. The resulting conclusion still may not be incorrect or false, but the failure to establish this relationship is often an indicator of the lurking problem.
On similar terms, the predictive power of your model does not necessarily imply that you have established an exact cause and effect relationship in your model. Your model may very well be conflating the correlation of input parameters and predicting output based on that.
You may think that “As long as it works, it shouldn’t matter.” However, the distinction matters since many machine learning algorithms pick upon parameters simply because there is a high correlation. Determining causality based on correlations can be very tricky and can potentially lead to contradictory conclusions. It would be much better to be able to prove that there truly is a causal relationship.
However, these days, developers and data scientists are merely relying on statistical patterns. Many of them fail to recognise that these patterns are only correlations amongst vast amounts of data rather than causative truths or natural laws, which govern the real world.
So, how do you deal with conflation?
Try this—during initial training and model building, soon after you find a correlation, don’t conclude too quickly. Take time to find other underlying factors, find the hidden factors, and verify if these are correct and then only conclude.
The basis for trust
As technology is improving day by day, it is placing powerful tools in the hands of people who do not understand how these work. It is creating significant business as well as societal risks.
Developers and data scientists are increasingly getting detached from an understanding of intricacies of the tools they are using and systems they’re creating.
The AI system means a black box is becoming a commonly accepted rhetoric, and the only sure-fire way to trust this black box is—training it meticulously and testing it rigorously!
Anand Tamboli is a serial entrepreneur, speaker, award-winning published author and emerging technology thought leader