ML Series – #3:-Data in/ Data out or Garbage in/ Garbage out?

Now that we have looked at the broad categories in which we can slot Machine Learning programs, it is time to focus on the MOST important aspect of  a ML system :- Data.

Understanding your data, analyzing it , cleaning it and preparing it for Machine Learning is an extremely tedious ( maybe only 2nd to endless trials of selecting/creating the right model) and the most unglamorous part of the job! But it is the MOST IMPORTANT part of the process as every subsequent step relies on the correctness of data selected.

There are many new terms/ concepts to learn when dealing with data:- Outliers, Dimensionality reduction, feature extraction, sampling bias, sampling error etc to name a few!

Overall there are two parts to your data collection:-

  1. The Quality of data
  2. The Relevance of data

For each point above, we have created mindmaps below which list out some of the important terms used and the typical issues when dealing with data!

  • Quality of data:- Here we focus on the amount of data collected and its quality in terms of accuracy, outliers, missing attributes etc. and how we handle such scenarios

The-Unreasonable-Effectiveness-of- (1)

  • Relevance of data:-  Next we need to ensure that the huge amount of good quality data that we have collected is actually relevant to the problem we are trying to solve or if it needs a little massaging to make it more relevant!

Feature-SelectionSelect-the-most- (1)

Hopefully these two simple mind maps serve their purpose of helping you remember what to watch out for when dealing with data and what the new terms meanSmile

Until Next Time!

Team Cennest

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>