Machine Learning Basics For Newbies


Preparing data.

Once data is collected, check the quality of what will be fed as training data to the system. You need to spend time in order to determine the quality of data and, accordingly, take steps to fix issues such as treatment of outliers and missing data. Exploratory analysis is one such methodology used to study the differences of data in detail, thereby strengthening nutritional content.

Training a model.

This step involves selecting the appropriate algorithm and representing data in the form of a model. The final purified data is split into two parts—training and test (proportion of data depends on prerequisite requirements). The first part (training data) is used to develop the model, whereas the second part (test data) is used as reference.

Evaluating the model.

This step involves evaluation of the machine learning model you have chosen to implement. Second part of data (test data) is used to test the accuracy of the learning model. This step determines how precise the algorithm selected is, based on outcome.

There is also a better test to check accuracy of the model, which sees how the model performs on data that has not been used at all while building it.

Improving the performance.

This step may involve choosing a different model altogether or even introducing more variables to improve the efficiency of the learning model. If the model is changed, then it again needs to be evaluated and its performance checked, which is why a lot of time needs to be spent in collecting and preparing data.

Tools for implementing machine learning

In order to implement machine learning on a system for any scenario, there are enough open source tools, software or frameworks available for you to choose from, based on your preference for a specific language or environment. Let us take a look at some of these.


Shogun is one of the oldest and most venerable machine learning libraries available in the market. It was first developed in 1999 using C++, but now it is not limited to working in C++ only; rather, it can be used transparently in many languages and environments such as Java, C#, Python, Ruby, R, Octave, Lua and MATLAB. It is easy to use, and is quite fast at compilation and execution.


Weka was developed at University of Waikato in New Zealand. It collects a set of Java machine learning algorithms that are engineered specifically for data mining. This GNU GPLv3-licensed collection possesses a package system, which can be used to extend its functionality. It has both official and unofficial packages available.

Weka comes with a book that explains the software and the techniques used in it. While Weka is not aimed specifically at Hadoop users, it can be used with Hadoop as well, because of the set of wrappers that have been produced for the most recent versions of it. It does not support Spark, but Clojure users can also use Weka.


CUDA-Convnet is a machine learning library especially used for neural network applications. It is written in C++ in order to exploit Nvidia’s CUDA GPU processing technology. It can even be used by those who prefer Python over C++. The resulting neural nets obtained as output from this library can be saved as Python-pickled objects and, hence, can be accessed from Python.

Note that the original version of the project is no longer being developed, but has been reworked into a successor named CUDA-Convnet2. It supports multiple GPUs and Kepler-generation GPUs.


H2O is an open source machine learning framework developed by Oxdata. H2O’s algorithms are basically geared for business processes like fraud or trend predictions. H2O can easily interact in a standalone fashion with different HDFS stores. It can be in MapReduce, on top of YARN or directly in an Amazon EC2 instance.

Hadoop Mavens can use Java for interaction with H2O, but this framework also provides bindings for R, Python and Scala. It enables cross-interaction with all libraries available on those platforms.

Applications of machine learning

The world is on the path to becoming smarter through automation of all possible manual tasks. Google and Facebook use machine learning to push their respective advertisements to relevant users. Given below are a few applications that you should know of.

Banking and financial services.

Machine learning is widely used to predict customers who are likely to be defaulters in paying credit card bills or repaying loans. This is of utmost importance as machine learning helps banks identify customers who can be given credit cards and loans.


It is widely used to diagnose various deadly diseases (like cancer) on the basis of patients’ symptoms, and tallying these with the past data available for similar kinds of patients.

H2O on Hadoop
Fig. 4: H2O on Hadoop (Image courtesy:



Please enter your comment!
Please enter your name here