While many people think machine learning (ML) is synonymous with artificial intelligence (AI) and robots that will eventually become our overlords… this isn’t quite what we mean.
Machine learning is a humble set of tools born out of the fields of statistics and computer science. These statistical algorithms are quirky and sometimes surprisingly frail, typically requiring quite a bit of cajoling to get them to do what you want them to do.
The goal is simple… create the ability to algorithmically (automatically via a computer program) derive information from data.
Now, machine learning has a few different flavors depending on what you are trying to do. These can be broken into supervised learning, unsupervised learning, and semi-supervised learning.
Supervised Learning
In supervised learning, the data contains both attributes (we usually call them features) and labels. This learning style can be broken into two categories: classification and regression.
Classification
An example of classification is predicting whether or not it will rain today. The answer is simply yes or no.
Data for this supervised learning, classification problem would have features like:
- Cloud Cover (%)
- Humidity
- Temperature
- Barometric Pressure
- Etc.
We might gather a year’s worth of data, one row for each day. The key here is that each row would have numbers for the above features and the answer. Did it rain that day? Yes or No.
Some examples of algorithms that can be used for classification are k-Nearest Neighbors and Logistic Regression (yes, I know it says regression… that’s what it’s called. But I promise it’s for classification).
Regression
Regression is slightly different. Instead of the label (the answer) being yes or no, we try and predict how much. So, in the above example we could have the same exact features, but for regression the answer would be how many inches of rain fell on a particular day.
Each of classification and regression are useful, depending on the situation. And many classification problems can be transformed into regression problems, and vice versa (as we saw with the rain example).
Some examples of algorithms that can be used for regression are k-Nearest Neighbors (yep, it can do both!) and Linear Regression.
Unsupervised Learning
In unsupervised learning, the data contains only features with no labels. This means that we don’t have an answer (or label) and the only thing we can do is look for similarities in the data. Many times, hidden among the unlabelled data, there exists interesting and sometimes very intricate structure in the data. It’s the job of an unsupervised machine learning algorithm to find this structure.
This can come in the form of natural grouping (clusters), associations, or even anomalies between samples in the data.
Clustering is a good example of unsupervised learning. If, for example, you wanted to understand the different types of users on a personal finance app, clustering would be a great option. You don’t know the answer ahead of time, but you might have quite a bit of data about each user’s behavior, demographics, etc. that would be fed into a clustering algorithm.
Some examples of clustering algorithms are K-Means Clustering and Spectral Clustering.
Semi-Supervised Learning
Although this is not a topic we’ll cover in this series, or much at all at The Data Hackr, it’s important to call out. Semi-supervised Learning, also referred to as Reinforcement Learning, makes use of both labeled and unlabeled data to learn a task. This is typically used for tasks like learning to play a game (like Go or Chess), or things like self-driving cars.
In most cases the machine learning algorithm (called the agent), observes the state of its environment (the game situation or the road and surrounding cars), makes a decision, and is rewarded or penalized based on that decision. After doing this over and over, the algorithm learns to avoid situations where it gets penalized and optimize for decisions that lead to the highest reward.