Amazon Machine Learning for a binary classification dataset

Machine learning as a service is real.  Clean data with a well organized schema can be fed to cloud-based machine learning services with a decent ML model returned in less than 30 minutes.  The resulting model can be used for inferring  target values of new datasets in the form of batch or real-time predictions.  All three public cloud vendors (AWS, Microsoft, Google) are competing in this space which makes the services cheap and easy to consume.

In our last discussion we ran the Kaggle red wine quality dataset through the Amazon Machine Learning service.  The data was fed to AWS without any manipulation which AWS interpreted as a regression problem with a linear regression model returned.  Each of the subjective wine quality ratings were treated as an integer from 0 (worse) to 10 (best) with a resulting model that could predict wine quality scores.  Honestly, the results weren’t spectacular – we could have gotten similar results by just guessing the median value (6) every time and we almost would have scored just as well on our RMSE value.

Our goal was to demonstrate Amazon’s machine learning capabilities in solving a regression problem and was not to create the most accurate model.  Linear regression may not be the best way to approach our Kaggle red wine quality dataset.  A (somewhat) arbitrary judge’s score from 0-10 probably does not have a linear relationship with all of the wine’s chemical measurements.

What other options do we have to solve this problem using the Amazon Machine Learning service?

Binary Classification

What is a machine learning binary classification problem?  It’s one that tries to predict a yes/no or true/false answer – the outcome is binary.  A dataset is labeled with the yes/no or true/false values.  This dataset is used to create a model to predict yes/no or true/false values on new data that is unlabeled.

An example of binary classification is a medical test for a specific disease.  Data is collected from a large group of patients who are known to have the disease and known not to have the disease.  New patients can then be tested by collecting the same data points and feeding them to a model.   The model will predict (with error rates) whether it believes the new patients have tehe disease.

Let’s explore how Amazon Machine Learning performs with a simple binary classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle Red Wine Quality dataset through the Amazon machine learning regression algorithms in the last post.  Why this dataset?  Because it was clean data with a relatively simple objective – to predict the wine quality from its chemical measurements.  We also had no data wrangling to perform – we simply uploaded the CSV to AWS and had our model created with an RMSE evaluation score ready to review.

This time we want to change things a bit.  Rather than treat the 0-10 sensory quality ratings as integers, we want to turn the quality ratings into a binary rating.  Good or bad.  Great or not-so-great.  This greatly simplifies our problem – rather than have such a wide swing of a ten point wine rating, we can simply categorize the wine as great and no-so-great.  In order to do this we need to edit our dataset and change the quality ratings to a one (great) or a zero (not-so-great).

Data preparation – feature engineering

Go ahead and download the “winequality-red.csv” file from Kaggle. Open up the .CSV file as a spreadsheet.  We need to replace the 0-10 quality scores with a 1 (great) or 0 (not-so-great).  Let’s assume most wines in this dataset are fairly average  – rated a 5 or 6.  The truly great wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and up are great and wines rated 6 and under are no-so-great.

All we have to do is edit our CSV files with the new 0 or 1 categories, easy right?  Well, kind of.  The spreadsheet has ~1600 ratings and manually doing a search and replace is tedious and not easily repeatable.  Most machine learning datasets aren’t coming from simple and small CVS files but rather from big datasets hosted in SQL/NoSQL databases, object stores, or even distributed filesystems like HDFS.  Manual editing often won’t work and definitely won’t scale for larger problems.

Most data scientists will spend a decent amount of time manipulating and cleaning up datasets with tools that utilize  some type of high level programming language.  Jupyter notebooks are a popular tool and can support your programming language of choice.  Jupyter notebook are much more efficient for data wrangling using code instead of working manually with spreadsheets.  Amazon even hosts cloud-base Jupyter notebooks within Amazon Sagemaker.

Converting the wine ratings from 0-10 to a binary 0/ 1 is pretty easy in Python.  Just open the CSV file, test if each quality rating is a 7 or higher (> 6.5) and convert the true/false to an integer by multiplying by 1.  Then we’ll write out the changes to a new CSV file which we’ll use for our new datasource.

Python code
import pandas as pd
wine = pd.read_csv('winequality-red.csv')
wine['quality'] = (wine['quality'] > 6.5)*1
wine.to_csv('binary.wine.csv')

Running the binary classification dataset through Amazon Machine Learning

Our dataset now has a binary wine rating – 1 for great wine and 0 for no-so-great wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our new binary CSV file.  The wizard will verify the schema and confirm that our file is clean.

What is different from linear regression is when we look at the schema, we want to make sure the ‘quality’ column is a ‘binary’ type rather than a ‘numerical’ type.  All other values are numerical, only the quality rating is binary.  This should be the default behavior but best to double check.  Also select “yes” to show that your first line in the CSV file is your column name to remove this name from the model.

SCHEMA

After finishing the schema, select your target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s check out the results by opening the evaluation.

evaluation

Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with a AUC score of .778.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance

By default the wizard saves 30% of our dataset for evaluation so we can measure the accuracy of our model’s predictions.  Binary classification algorithms are measured with an AUC or Area Under the Curve score.  The measurement is a value between 0 and 1 with 1 being a perfect model that predicts 100% of the values correctly.

Our score shows our model got 83% of our predictions correct and 17% incorrect.  Not bad!  What is nice about this type of scoring is we can also see our false positives (not-so-great wine classified as great) and false negatives (great wine that was predicted as no-so-great).  Specifically our model had 55 false positives and 25 false negative.

Predicting quality wine is not exactly a matter of life and death.  Which means we aren’t necessarily concerned with false positives or false negatives as long as we have a decent prediction model.  But for other binary classification problems we may want to adjust our model to avoid false positives or false negatives.  This adjustment is made using the slider on the model performance screen shown above.

The adjustments come at a cost – if we want less false positives (bad wine predicted as great) then we’ll have more false negatives (great wine that accidentally predicted as bad).  The reverse is also true, if we want less false negatives (great wine predicted as bad), we will have more false positives (bad wine predicted as great).

Final thoughts

AWS uses specific learning algorithms for solving the three types of machine learning problems supported.  For binary classification, Amazon ML uses logistic regression which is a logistic loss function plus SGD.  Most beginners won’t mind these limitations but if we want to use other algorithms we’ll have to look at more advanced services like Amazon SageMaker.

Machine learning as a service is easy, fast, and cheap.  By using the AWS black box approach we built a working binary classification model in less than 30 minutes with minimal coding and only a high level knowledge of machine learning.

Thanks for reading!

 

Amazon Machine Learning for a simple regression dataset

Amazon Machine Learning is a public cloud service that offers developers access to ML algorithms as a service.  An overview can be found in the Amazon documentation and in my last blog post.   Consider the Amazon Machine Learning service a black box that automatically solves a few types of common ML problems – binary classification, multi-class classification, and regression.  If a dataset fits into one of these categories, we can quickly get a machine learning project started with minimal effort.

But just how easy is it to use this service?  What is it good for and what are the limitations?  The best way to outline the strengths and weaknesses of Amazon’s offering is to run a few examples through the black box and see the results.  In this post we’ll take a look at how AWS holds up against a very clean and simple regression dataset.

Regression

First off, what is a machine learning regression problem?  Regression modeling takes a dataset and creates a predictive algorithm to approximate one of the numerical values in that dataset.  Basically its a statistical model used to estimate relationships between data points.

An example of regression modeling is a real estate price prediction algorithm.  Historical housing closing prices could be fed into an algorithm with as many attributes about each house as relevant (called features – ie. square feet, number of bathrooms, etc).  The model could then be fed attributes of houses about to go up for sale and predict the selling price.

Let’s explore how Amazon Machine Learning models and performs with a simple regression dataset.

Kaggle Red Wine Quality Dataset

Kaggle is a community-based data science and machine learning site that hosts thousands of public datasets covering the spectrum of machine learning problems.  Our interest lies in taking Amazon Machine Learning for a test drive on a simple regression problem so I’m using a simple and clean regression dataset from Kaggle – Red Wine Quality.

Why this dataset?  Because we aren’t going to cover any data cleaning or transformations – we more interested in finding out how the AWS black box performs on a dataset with minimal effort.  Of course this is not a real world example but it keeps things simple.

The dataset contains a set of numerical attributes of 1600 red wine variants of the Portuguese “Vinho Verde” wine along with a numerical quality sensory rating between 0 (worse) and 10 (best).  Our goal will be to train an AWS model to predict the quality of this variety of red wine based off the chemically measurable attributes (features).  Can data science predict the quality of wine?

Amazon Machine Learning Walkthrough

First download the “winequality-red.csv” file from Kaggle.  Open up the .CSV file in a spreadsheet and take a look around.  We can see the column names for each column which include a “quality” column that rates the wine from 0-10.  Notice all fields are populated with numbers and there are no missing values.

We’ll need to upload our .CSV file into AWS storage to get our process started so log into the AWS console and open the S3 service.

S3 console

Create a new bucket with the default settings to dedicate to our machine learning datasets.

Create a new bucket

Upload the winequality-red.csv file into this bucket and keep the default settings, permissions will be adjusted later.

Upload the winequality-red.csv file

Now open AWS Machine Learning service in the AWS console.

Amazon Machine Learning Console

Click the “get started” button and then “view dashboard”.

Machine Learning Dashboard Launch

Click “create a new datasource and ML model”.

Create new datasource and ML model

AWS will prompt us for the input data for the datasource, we will use our S3 bucket name and the wine quality CSV file that was uploaded (it should auto-populate once we start typing).  Click “verify” to confirm the CSV file is intact.  At this point AWS will adjust the permissions for the ML service to have access to the S3 bucket and file (click “yes” to give it permissions).  Once we have a successful validation message, click “continue”.

Create datasource from CSV in S3 bucket

Successful validation

The wizard will now review the schema.  The schema is simply the fields in our CSV file.  Because our CSV contained the column names (fixed acidity, volatile acidity, etc) we want to select “yes” so we don’t try to model the quality based off the column names, selecting “yes” will remove the names from the model.

Schema

Now we want to select the target value that we want the model to predict. As stated earlier, we want to predict the quality of the wine based off the measurable attributes so select “quality” and continue.

Target

Keep the row identifier as default “no”.  We’ll skip this topic but it would be important if we were making predictions in a production environment.  Review the selections and “continue”.

Row Identifier

Review

Accept the default ML model settings which will split our dataset and set aside 30% of the data for evaluating the performance of the model once it is trained using 70% of the data.

Review one last time and then select “create ML model”.

Model settings

Review

We are done!  The dashboard and will show the progress, the entire process takes about 10-20 minutes. AWS will take over and read the data, perform a 70/30% split,  create a model, train the model using 70% of the data, and then evaluate the model using the remaining 30% of the data.

How did we do?

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s check out the results by opening the evaluation.

Once again, this isn’t a real world example.  A data scientist would typically spend much more time looking at the data prior to evaluating an algorithm.  But since this is a black box approach, we will entertain the notion of skipping to the end without doing any data exploration.

EVALUATION SUMMARY

Looking at our evaluation summary, things look pretty good.  We have a green box saying our ML model’s quality score was measured with RMSE of 0.729 and was better than the baseline value of 0.793.  That summary is somewhat vague so lets click “explore model performance”.

ML Model performance

Since we saved 30% of our data for evaluation, we can measure the accuracy of our model’s predictions.  A common way to measure regression accuracy is the RMSE score or root-mean-square-error. This RMSE score is measuring the difference between the model’s prediction and the actual value in the evaluation set.  A RMSE of 0 is perfect, the smaller the value the better the model.

We can see a somewhat bell-shaped curve and our model is pretty close to zero which seems great.  However, if we look back at our summary we see that the RMSE baseline is .793 and our score is .729.  The baseline would be our score if we just guessed for median (middle) quality score for every prediction.  So although we area bit better, we aren’t better by much and would be close if we just ignored all the attributes and guessed the median quality score (6) every time

Amazon Machine Learning – the good

Although our results were not incredibly impressive, the easy wizard-driven process and time to results were very impressive.  We can feed the AWS ML black box data and have a working model with an evaluation of its accuracy in less than 30 minutes.  And this can be done with little to no knowledge of data science, ML techniques, computer programming, or statistics.

And this entire process can be automated via the AWS API sets.  Think of an automated process that collects data and automatically feeds it to the AWS machine learning engine using code instead manually clicking buttons in the  console.  Predictions could be generated automatically on constantly changing data using code and the black box machine learning algorithms in the public cloud.

Lastly, this the service is inexpensive.  It cost about $3 to half a dozen datasets through the service over the past month while researching this blog.  All we need is some data in S3 and the service can start provided ML insights at a very low cost.

Amazon Machine Learning – the bad

A few points become clear after using the ML service.  The first is we have to adjust our expectations when feeding the AWS black box data.  Regression problems fed to Amazon Machine Learning are going to be solved using linear regression with stochastic gradient descent.  If a problem is easily solved using those algorithms then Amazon will do a great job making predictions.  If other algorithms like random forests or decision trees get better results then we need to look at more advanced Amazon services to solve those types of problems.

The second point is that data needs to be explored, cleaned, and transformed prior to feeding it to the machine learning service.  While we can look at correlations and data distributions after we create a datasource in AWS, there is no easy way to manipulate your dataset other directly editing your CSV file.  There is an advanced function to use built-in data transformations as part of the ML model wizard which is more of an advanced topic and is limited to the data transformations referenced in the documentation.  Accepting the defaults and not wrangling your input data may get great results from our AWS ML black box.

Use case

Have a machine learning problem that is easily solved using linear regression with SGD?  Have an existing data pipeline that already cleans and transforms your data?  Amazon machine learning can outsource your machine learning very quickly and cheaply in this case without much time spent learning the service.  Just don’t expect a shortcut to your ML process, send Amazon clean data and you can quickly get an API prediction endpoint for linear regression problems.

 

Amazon Machine Learning – Commodity Machine Learning as a Service

Commodity turns into “as a service”

Find something in computing that is expensive and cumbersome to work with and Amazon will find a way to commoditize it.  AWS created commodity storage, networking, and compute services for end users with resources leftover from their online retail business.  The margins may have not been great, but Amazon could make this type of business profitable due to their scale.

But what happens now that Microsoft and Google can offer the same services at the same scale?  The need for differentiation drives innovative cloud products and platforms with the goal of attracting new customers and keeping those customers within a single ecosystem.  Using basic services like storage, networking, and compute may not be differentiated between the cloud providers but new software as a service (SaaS) offerings are more enticing when shopping for cloud services.

Software as a service (SaaS) probably isn’t the best way to describe these differentiated services offered by public cloud providers.  SaaS typically refers to subscription-based software running in the cloud like Salesforce.com or ServiceNow.  Recent cloud services are better described by their functionality with an “as a service” tacked on.  Database as a service.  Data warehousing as a service.  Analytics as a service.  Cloud providers build the physical infrastructure, write the software, and implement vertical scaling with a self-service provisioning portal for easy consumption.  Consumers simply implement the service within their application without worrying about the underlying details.

Legacy datacenter hardware and software vendors may not appreciate this approach due to lost revenue but the “as a service” model is a good thing for IT consumers.  Services that were previously unavailable to every day users have been democratized and are available to anyone with an AWS account and a credit card.  Cost isn’t necessarily the primary benefit to consumer but rather the accessibility and the ability to consume using a public utility model.  All users can have access now to previous exotic technologies and can pay by the hour to use them.

Machine Learning as a Service

Machine learning is at the peak of the emerging technology hype cycle.  But is it really all hype?  Machine learning (ML) has been around for 20+ years.  ML techniques allow users to “teach” computers to solve problems without hard-coding hundreds or thousands of explicit rules.    This isn’t a new idea but the hype is definitely a recent phenomenon for a number of reasons.

So why all the machine learning hype?  The startup cost both in terms of hardware/software and accessibility are much lower which presents an opportunity to implement machine learning that wasn’t available in the past.  Data is more abundant and data storage is cheap.  Computing resources are abundant and CPU/GPU cycles are (relatively) cheap.  Most importantly, the barriers have been lifted in terms of combining access to advanced computing resources and vast sets of data.  What used to require onsite specialized HPC clusters and expensive data storage arrays can now be performed in the cloud for a reasonable cost.  Services like Amazon Machine Learning are giving everyday users the ability to perform complicated machine learning computations on datasets that were only available to researcher at universities in the past.

How does Amazon provide machine learning as a service?  Think of the service as a black box.  The implementation details within the black box are unimportant.  Data is fed into the system, a predictive model/algorithm is created and trained, the system is tweaked for its effectiveness, and then the system is used to make predictions using the trained model.  Users can use the predictive system without really knowing what is happening inside the black box.

This machine learning black box isn’t magical.  It is limited to a few basic types of models (algorithms) – regression, binary classification, and multiclass classification.  More advanced type operations require users to look to AWS SageMaker and require a higher skill level than the basic AWS machine learning black box.  However, these three basic machine learning models can get you started on some real-world problems very quickly without really knowing much math or programming.

Amazon Machine Learning  Workflow

So how does this process work at a high level?  If a dataset and use case can be identified as a regression or binary/multiclass classification problem, then the data can simply be fed to the AWS machine learning black box.  AWS  will use the data to automatically select a model and train the model using your input data.  The effectiveness of the model is then evaluated with some measurable results that determine effectiveness of the model with a numerical score.  This model is ready to use at this point but can also be tweaked to improve the scoring.  Bulk data can get fed to the trained model for batch predictions or ad-hoc predictions can be performed using the AWS console or programmatically through the AWS API.

Knowing that a problem can be solved by AWS takes a bit of high-level machine learning knowledge.  The end user needs to have an understanding of their data and of the three model types offered in the AWS black box.  Reading through the AWS Machine Learning developer guide is a good start in terms of an overview.  Regressions models solve problems that need to predict numeric values.  Binary classification models predict binary outcomes (true/false, yes/no) and multiclass classification models can predict more than two outcomes (categories).

Why use Amazon machine learning?

For those starting out with machine learning this AWS service may sound overcomplicated or of questionable value.  Most tutorials show this type of work done on a laptop with free programming tools, why is AWS necessary?  The point is the novice user can do some basic machine learning in AWS without the high startup opportunity costs of learning programming or learning how to use machine learning software packages.  Simply feed the AWS machine learning engine data and get instant results.

Anyone can run a proof of concept data pipeline into AWS and perform some basic machine learning predictions.  Some light programming skills would be helpful but are not mandatory.  Having a dataset with a well-defined schema is a start as well as having a business goal of using that dataset to make predictions on similar sets of incoming data.  AWS can provide these predictions for pennies an hour and eliminate the startup costs that would normally delay or even halt these types of projects.

Amazon machine learning is a way to quickly get productive and get a project past the proof of concept phase in making predictions based on company data.  Access to the platform is open to all users so don’t rely on the AWS models for a competitive advantage or a product differentiator.  Instead use AWS machine learning as a tool to quickly get a machine learning project started without much investment.

Thanks for reading, some future blog posts here will include running some well-known machine learning problems through Amazon Machine Learning to highlight the process and results.