Handwritten digit recognition – MNIST via Amazon Machine Learning APIs

In our last post we explored how to use the Amazon Machine Learning APIs to solve machine learning problems.  We looked at the SDK, we looked at the machine learning client documentation, and we even looked at an example on Github.  But we didn’t write any code.

If it is so easy to consume AWS APIs with code then why don’t we take a shot at writing some code?  Point taken, this post will include code!  We will use Python to write some functions to call Amazon Machine Learning APIs to create a machine learning model and run some batch predictions.  The code I’ve written is heavily influenced by AWS Python sample code – I’ve written my functions and main body but borrowed a few bits like the polling functions and random character generation.

MNIST dataset – handwritten digit recognition

The MNIST dataset is a “hello world” type machine learning problem that engineers typically use to smoke test an algorithm or ML process.  The dataset contains 70,000 handwritten digits from 0-9 each scanned into a 28×28 pixel representation of each digit.  Each pixel value represents on of the 784 (28 x 28) pixel’s intensity with a value from 0 (white) to 255 (black).  See the Wikipedia entry for a sample visualization  of a subset of the images.

We are going to pull the MNIST data from Kaggle to make things easy.  Why pull it from Kaggle?  It saves us some time, the Kaggle version has conveniently split the data into 48,000 images in a training CSV and 28,000 images in a test CSV.  The training CSV has the labels in the first column which are just the numbers represented by all the pixels in the row.  The test CVS file has removed the label for use for predictions.  Kaggle has made things a bit easier for us, we don’t have to do any manipulation of the data.

Code on Github – https://github.com/keithxanderson

The code I have written to run MNIST through Amazon Machine learning via APIs is up on Github.  See the README.md for a detailed explanation of how to use the code and what exact steps are required to get things working.  We are going to use Python and specifically the ‘machinelearning’ client of Boto3 in order to pass our calls to AWS.

While the README.md explains things pretty well, let’s summarize here.  The main MNIST.py Python file uses the MNIST training data to create and evaluate an ML model and then uses the test data to create a batch prediction.  Data in the test file contains the pixels but no label, our model is predicting what digit is represented by the pixels after “learning” from the training data with labels (the answers).

There is a Python function written for each of the high level operations :

create_training_datasource – takes a training file from an S3 bucket, a schema file from an S3 bucket, and a percentage argument and splits the training data into a training dataset and an evaluation dataset.  The number of rows in each dataset is determined by the percentage set.  The schema file defines the schema (obviously).

create_model – takes a training datasource ID and a recipe file to create and train a model.  The function is hard-coded to create a multiclass classification model (multinomial logistic regression algorithm).

create_evaluation – takes an our model ID and our evaluation datasource ID and creates an evaluation which simply scores the performance of our model using the reserved evaluation data.  Results are seen manually in the AWS console but could be added to our code if we wanted.

create_batch_prediction_dataset – takes a test file from an S3 bucket and a schema file (a different schema file than our training schema file) and creates a batch dataset.  We will use this to make predictions, this data does not contain any labels.

batch_prediction – takes an existing model ID, a batch datasource ID and runs predictions of the batch data against our model.  Results are stored in the S3 output argument are written to a flat file.

Schema files and recipe

Our datasources all need a schema and we can define these with a schema file.  The schema for the training and evaluation data are the same and can use the same file.  The batch datasource does not contain labels so needs a separate schema file.

The training and evaluation schema file for MNIST can be found in the Github project with the Python code.  We simply define every column and declare each of the columns as “CATEGORICAL” since this is a multiclass classification model.  Each handwritten number is a category (0-9) and each pixel is a category (0-255).  The format of the schema file is defined by AWS, if you create a datasource manually in the console you can copy the AWS generated schema in your schema file.

The batch schema file is also found in the same project.  The only difference between the batch schema file and the training/evaluation schema file is that the batch schema does not contain the labels (numbers).

Lastly, our recipe file is found on Github with the Python code and schema files.  The recipe simply instructs AWS how to treat the input and output data.  Our example is very simply since all features (columns) are categorical and we just define all of our outputs as “ALL_CATEGORICAL”.  This saves us from manually declaring each pixels feature as a category.

Running the code

See the README.md for detailed instructions on running the code.  Provided we have our CSV files, schema files, and recipe file in an S3 bucket, we would just need to modify the MNIST.py file with the S3 locations for each file in the main body of the code along with an output S3 bucket for the prediction results (all can share the same S3 bucket).

After executing the MNIST.py file we will see the training, evaluation, and batch dataset creation execute in parallel and the model and evaluation process go pending.  Once we have our datasets created we will see the model training start and once we see the model training complete we will see the evaluation start.

A polling function is called within the batch prediction function to wait on the evaluation process to complete prior to running a batch prediction.  This is optional, there is really no dependency since the batch prediction can run in parallel with the evaluation scoring of our model.

Model evaluation performance

If all went well we will have an evaluation score in the AWS console.  Let’s take a look!  Looking at our ML model evaluation performance in the console we see an F1 score of .914 compared against a baseline score of .020 which is very good.

ML Model Performance

Exploring our model performance against our evaluation data we see that the model is very good at predicting the handwritten digits. The matrix below shows dark blue for every digit correctly classified as the numbers below.  This is not terribly surprising, most algorithms perform pretty well with the MNIST dataset but give credit to AWS for tuning the multiclass classification algorithm to this level of performance.

Explore Model Performance

Batch prediction results

We should also have an output file in our batch prediction output S3 bucket if everything ran to completion.  I’ve included a sample output CSV file within my MNIST project to view.  See the AWS documentation for more information on how to interpret the output.  Each column contains the categories for prediction which are the digits from 0-9.  Each row is the predicted probability that the test data belongs to the category.

For example, in our sample CSV file the first prediction shows about a 99% probability the number is a 2 digit and our second prediction shows a 99% probability the next number is a 0.

Why use Amazon Machine Learning?

We’ve discussed the value of using Amazon Machine Learning before but it is worth repeating.  It is very easy to incorporate basic machine learning algorithms into your application by offloading this work to AWS.  Avoid spending time and money building out infrastructure or expertise for an in-house machine learning engine.  Just follow the basic workflow outlined above and make a few API calls to Amazon Machine Learning.

Thanks for reading!

How to use the Amazon Machine Learning API

Machine learning, AI, deep learning – all of these technologies are popular and the hype cycle is intense.  When one digs into the topic of machine learning, it turns out to be a lot of math, data wrangling, and writing coding.  Which means one can quickly become overwhelmed by the details and the lose sight of the practicality of machine learning.

Take a step back from all the complexity and ask a fundamental question.  Why should we even look at machine learning?   Simple, we want our applications to get smarter and make better decisions.   Ignore the complexities of machine learning and think about the end result instead.  Machine learning is a powerful tool that can be used to our advantage when building applications.

Now consider how we would implement machine learning into our applications.  We could write lots of code within our app to pull historical data, use this data to create and train models, evaluate and tweak our models, and finally make predictions.  Or we could offload this work to public cloud services like Amazon Machine Learning.  All it takes is some code within our application to make calls Amazon Machine Learning APIs.

Machine learning via APIs?

I’ve been an infrastructure guy for most of my career.  While I have worked for software companies and have worked closely with software developers – I’m not a developer and have not made my living writing code.

That being said, machine learning requires learning to write code.  One wouldn’t go out and buy machine learning from a vendor but would rather write code to build and use machine learning models.  Furthermore, machine learning models don’t provide much value by themselves but are valuable when built into an application.  So coding is important for using machine learning.

Think along the same lines regarding consumption of public cloud.  Cloud isn’t just about cost, it’s about applications consuming infrastructure  and services with code.  Anything that can be created, configured, or consumed in AWS can (and should?) be done so with code.   So coding is also important for consuming public cloud.

Fortunately, writing code to use Amazon Machine Learning isn’t very hard at all.  Amazon publishes a machine learning developer guide and API reference documentation that shows how to write code to add machine learning to your application.  That combined with some machine learning examples on Github make the process easy follow.

Overview of an application using Amazon Machine Learning

What is the high-level architecture of an application using Amazon Machine Learning?  Let’s use an example as we walk through the architecture.  Think of a hypothetical customer loyalty application that sends rebates to customers we think may stop using our theoretical company’s services.  We could incorporate Amazon machine learning into the app to flag at-risk customers for rebates.

Raw data – first we need data.  This could be in a database, data warehouse, or even flat files in a filesystem.  In our example, these are the historical records of all customers that have used our service and have either remained loyal to our company or stopped using our services (the historical data is labeled with the outcomes).

Data processed and uploaded to S3 – Our historical data needs to be extracted from our original source (above), cleaned up, converted into a CSV file,  and then uploaded to an AWS S3 buckets.

Training datasource – our application needs to invoke Amazon Machine learning via an API and create a ML datasource out of our CSV file.

Model – our application needs to create and train a model via an AWS API using the datasource we created above.

Model evaluation – our app will need to measure how well our model performs with some of the data reserved from the datasource.  Our evaluation scoring gives us an idea of how well our model makes predictions on new data.

Prediction Datasource –  if we want to make predictions with machine learning, we need new customer data that matches our training data schema but does not contain the customer outcome.    So we need to collect this new data into a separate CSV file and upload it to S3.

Batch predictions – now for the fun part, we can now call our model via AWS APIs to predict the target value of the new data we uploaded to S3.  The output will be contained in a flat file in an S3 bucket we specify that we can read with our app and take appropriate action (do nothing or send a rebate).

How to get started?

First pick a programming language.  Python is popular both for machine learning and for interacting with AWS so we’ll assume Python is our language.  Next read through the AWS Python SDK Quickstart guide to get a feel for how to use AWS with Python.  Python uses the SDK Boto3 which needs to be installed and also needs to be configured to use your AWS account credentials.

Lastly, read through the machine learning section of the Boto3 SDK guide to get a feel of the logical flow of how to consume Amazon Machine Learning via APIs.  Within this guide we would need to call the Python methods below using the recommended syntax given.

create_data_source_from_s3 – how we create a datasource from a CSV file

create_ml_model – how we create an ML model from a datasource

create_evaluation – how we create an evaluation score for the performance of our model

create_batch_prediction –  how we predict target values for new datasources

Getting started with an example

At this point we have a good idea of what we need to do to incorporate Amazon machine learning into a Python app.  Those already proficient in Python could start writing new functions to call each of these Boto3 ML methods following the architecture we described above.

What about an example for those of use who aren’t as proficient in Python?  Fortunately there are plenty of examples online of how to write apps to call AWS APIs.  Even better, AWS released some in-depth ML examples on Github that walks through examples in several languages including Python.

Specifically lets look at the targeted marketing example for Python.  First check out the README.md.  We see that we have to have our AWS credentials setup locally where we will run the code so that we can authenticate to our AWS account.  We’ll also need Python installed locally (as well as Boto3) so that we can run Python executables (.py files).

Next take a look at the Python code to build an ML model.  A few things jump out in the program:

—  The code imports a few libraries to use in the code (boto3, base64, import json, import os, sys).

—  The code then defines the S3 bucket and CSV file that we will use to train our model (“TRAINING_DATA_S3_URL”).

—  Functions are written for each subtask that needs to be performed for our machine learning application.  We have functions to create a datasource (def create_data_sources), build a model (def build_model), train a model (def create_model), and performs an evaluation of the model (def create_evaluation).

—  The main body of the code (line 123 – ‘if __name__‘) ties everything together using the CSV file, a user defined schema file, and a user defined recipe file to call each function and results in a ML model with a performance measurement.

Schema and recipe

Along with the Python code we find a .schema file and recipe.json file.  We explored these in previous posts when using Amazon Machine Learning through the AWS console wizard.  When using machine learning with code through the AWS API, the schema and recipe must be created and formatted manually and passed as arguments when calling the Boto3 ML functions.

Data pulled out of a database probably already has a defined schema – these are just the rows and columns that describe the data fields and define the types of data inside them.  The ML schema needs to be defined in the correct format with our attributes and our target value along with some other information.

A recipe is a bit more complicated, the format for creating a recipe file gives the details.  The recipe simply defines how our model treats the input and output data defined in our schema.  Our example recipe file only defines outputs using some predefined groups (ALL_TEXT, ALL_NUMERIC, ALL_CATEGORICAL, ALL_BINARY ) and does a few transformations on the data (quantile bin the numbers and convert text to lower case).

Making predictions in our application 

Assuming everything worked when we ran the ‘build_model.py’ executable, we would have a model available to make predictions on unlabeled data.  The ‘use_model.py’ executable does just that.  We just need to get this new unlabeled in a CVS file and into an S3 bucket (“UNSCORED_DATA_S3_URL”) and allow the ‘use_model.py’ code create a new datasource to use for predictions.  The main section of this ‘use_model.py’ code ties everything together and calls the functions with the input/output parameters required.

Think about it, we have all the code syntax, formatting, and necessary steps necessary to add predictive ability to our application by simply calling AWS APIs.  No complex rules in our application, no need to write an entire machine learning framework into our application, just offload to Amazon Machine Learning via APIs.

Thanks for reading!


Amazon Machine Learning for a multiclass classification dataset

We’ve taken a tour of Amazon Machine Learning over the last three posts.  Quickly recapping, Amazon supports three types of ML models with their machine learning as a service (MLaaS) engine – regression, binary classification, and multiclass classification.  Public cloud economics and automation make MLaaS an attractive option to prospective users looking to outsource their machine learning using public cloud API endpoints.

To demonstrate Amazon’s service we’ve taken the Kaggle red wine quality dataset and adjusted the dataset to demonstrate each of the AWS MLaaS model types.  Regression worked fairly well with very little effort.  Binary classification worked better with a bit of effort.  Now to finish the tour of Amazon Machine Learning we will look at altering the wine dataset once more to turn our wine quality dataset prediction engine into a multiclass classification problem.

Multiclass Classification

What is a machine learning multiclass classification problem?  It’s one that tries to predict if an instance or set of data belongs to one of three or more categories.  A dataset is first labeled with the known categories and used to train an ML model.  That model is then fed new unlabeled data and attempts to predict what categories the new data belongs to.

An example of multiclass classification is an image recognition system.  Say we wanted to digitally scan handwritten zip codes on envelopes and have our image recognition system predict the numbers.  Our multiclass classification model would first need to be trained to detect ten digits (0-9) using existing labeled data.  Then our model could use newly scanned handwritten zip codes (unlabeled) and predict (with an accuracy value) what digits were written on the envelope.

Let’s explore how Amazon Machine Learning performs with a mulitclass classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle Red Wine Quality dataset untouched through the Amazon machine learning regression algorithm.  The algorithm interpreted the scores as floating point numbers rather than integer categories which isn’t necessarily what we were after.  We then doctored up the ratings into binary categories of “great” and “not-so-great” ran the dataset through the binary classification algorithm.  The binary categories were closer to what we wanted, treat the ratings as categories instead of decimal numbers.

This time we want to change things again.  We are going to add a third category to our rating system and change the quality score again.  Rather than try to predict the wide swing of a 0-10 rating system, we will label the quality ratings as either good, bad, or average – three classes.  And Amazon ML will create a multiclass regression model to predict one of the three ratings on new data.

Data preparation – converting to three classes

We will go ahead and download the “winequality-red.csv” file again from Kaggle.  We need to replace the existing 0-10 quality scores with with categories of bad, average, and good.  We’ll use numeric categories to represent quality – bad will be 1, average will be 2, and good will be 3.

Looking at the dataset histogram below, most wines in this dataset are average  – rated a 5 or 6.  Good wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and higher are good (3), wines rated 5 or 6 are average (2), and those rated less than 5 are bad (1).  Even though the rating system is 0-10, we don’t have any wines rated less than 3 or greater than 10.

Wine quality Histogram

We could use spreadsheet formulas or write a simply python ‘for’ loop to append our CSV “quality” column.  Replace the ratings from 3-4 with a 1 (bad), replace our 5-6 ratings with a 2 (average), and replace our 7-8 ratings with a 3 (good).  Then we’ll write out the changes to a new CSV file which we’ll use for our new ML model datasource.

Wine quality Histogram AFTER PROCESSING

Running the dataset through Amazon Machine Learning

Our wine dataset now has a multiclass style rating – 1 for bad wine, 2 for average wine, and 3 for good wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our multiclass CSV file.  The wizard will verify the schema and confirm that our file is clean.

We need to identify our quality rating as a category rather than numeric for Amazon to recognize this as a multiclass classification problem.  When setting up the schema, edit the ‘quality’ column and set the data type to ‘categorical’ type rather than the default  ‘numerical’.  We will again select “yes” to show our CSV file contains column names.


After finishing the schema, select the target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

Let’s first look at our source dataset prior to the 70/30 split just to get an idea of our data distribution.  82% of the wines are rated a 2 (average), 14% are rated a 3 (good),and 4% are rated a 1 (bad).  By default, the ML wizard is going to randomly split this data 70/30 to use for training and performance evaluation.

Datasource Histogram prior to Train/test split

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s take a look at our evaluation summary and see how we did.


Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with an average F1 score of 0.421.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance- Confusion matrix

Multiclass prediction accuracy can be measured as a weighted sum of the individual binary predictions.  How well did we predict good wines (3), average wines (2), and bad wines (1)?  Each category’s accuracy score is combined into what is called an F1 score where a higher F1 score is better than a lower score.

A visualization of this accuracy is displayed above in a confusion matrix.  The matrix makes it easy to see where the model is successful and where it is not so successful.  The darker the blue, the more accurate the correct prediction, the darker the orange/red, the more inaccurate the prediction.

Looking at our accuracy above, it seems we are good at predicting average wine (2) – our accuracy is almost 90% and we predicted 361/404 wines correctly as average (2).  However, the model was not so good at predicting good (3) and bad (1) wines.  We only correctly predicted 13/47 (28%) as good and only correctly predicted 2/26 (.08%) as bad (1).  Our model is good at predicting the easy (2) category but not so good at predicting the more difficult and less common (1) and (3) categories.

Can we do better?

Our model is disappointing, we could have done almost as well if we just predicted that every wine was average (2).  We want train a model that is better at predicting good (3) wines and bad (1) wines.  But how?  We could use a few ML techniques to get a better model:

Collect more data – More data would help train our model better, especially having more samples of good (3) and bad (1) wines.  In our case this isn’t possible since we are working with a static public dataset.

Remove some features – Our algorithm may have an easier time evaluating less features.  We have 11 features (pH, sulphates, alcohol, etc) and we really aren’t sure if all 11 features have a direct impact on wine quality.  Removing some features could make for a better model but would require some trial and error.  We’ll skip this for now.

Use a more powerful algorithm – Amazon uses  multinomial logistic regression (multinomial logistic loss + SGD) for multiclass classification problems.  This may not be the best algorithm for our wine dataset, there may be a more powerful algorithm that works better.  However this isn’t an option when using Amazon Machine Learning – we’d have to look at a more powerful tool like Amazon SageMaker if we wanted to experiment with different algorithms.

Data wrangling –  Feeding raw data to the Amazon Machine Learning service untouched is easy but if we aren’t getting good results we will need to perform some pre-processing to get a better model.  Some features have wide ranges and some do not – for example the citric acid values range from 0-1 while the total sulfur dioxide values range from 6-289.  So some feature scaling might be a good idea.

Also, the default random 70/30 training and testing data split may not be the greatest way to train and test our model.  We may want to use a more powerful method to split the folds of the dataset ourselves rather than to let Amazon randomly split it.  Running a Stratified ShuffleSplit might be helpful prior to uploading it to Amazon.

Lastly, Amazon Machine Learning uses a technique called quantile binning for numeric values.  Instead of treating a range of numbers as discreet values, Amazon puts the range of values in “bins” and converts them into categories.  This may work well for non-linear features but may not work great for features with a direct linear correlation to our quality ratings.  Amazon recommends some experimentation with their data transformation recipes to tweak model performance.  The default recipes may not be the best for all problems.

Final Thoughts

Machine learning is hard.  And while Amazon’s MLaaS is a powerful tool – it isn’t perfect and it doesn’t do everything.  Data needs to be clean going in and most likely needs some manipulation using various ML techniques if we want to get decent predictions from Amazon algorithms.

Just for fun I did some data wrangling on the red wine dataset to see if I could get some better prediction results.  I manually split the dataset myself using a stratified shuffle split and then ran it through the machine learning wizard using the “custom” option which allowed me to turn off the AWS 70/30 split.  The results?  Just doing a bit of work I improved the prediction accuracy for our good (3) wines to 65% correct, bad (1) wines to 25% correct, and raised the F1 score to .59 (up from .42, higher is better).

Confusion matrix with stratified shuffling

Thanks for reading!

Amazon Machine Learning for a binary classification dataset

Machine learning as a service is real.  Clean data with a well organized schema can be fed to cloud-based machine learning services with a decent ML model returned in less than 30 minutes.  The resulting model can be used for inferring  target values of new datasets in the form of batch or real-time predictions.  All three public cloud vendors (AWS, Microsoft, Google) are competing in this space which makes the services cheap and easy to consume.

In our last discussion we ran the Kaggle red wine quality dataset through the Amazon Machine Learning service.  The data was fed to AWS without any manipulation which AWS interpreted as a regression problem with a linear regression model returned.  Each of the subjective wine quality ratings were treated as an integer from 0 (worse) to 10 (best) with a resulting model that could predict wine quality scores.  Honestly, the results weren’t spectacular – we could have gotten similar results by just guessing the median value (6) every time and we almost would have scored just as well on our RMSE value.

Our goal was to demonstrate Amazon’s machine learning capabilities in solving a regression problem and was not to create the most accurate model.  Linear regression may not be the best way to approach our Kaggle red wine quality dataset.  A (somewhat) arbitrary judge’s score from 0-10 probably does not have a linear relationship with all of the wine’s chemical measurements.

What other options do we have to solve this problem using the Amazon Machine Learning service?

Binary Classification

What is a machine learning binary classification problem?  It’s one that tries to predict a yes/no or true/false answer – the outcome is binary.  A dataset is labeled with the yes/no or true/false values.  This dataset is used to create a model to predict yes/no or true/false values on new data that is unlabeled.

An example of binary classification is a medical test for a specific disease.  Data is collected from a large group of patients who are known to have the disease and known not to have the disease.  New patients can then be tested by collecting the same data points and feeding them to a model.   The model will predict (with error rates) whether it believes the new patients have tehe disease.

Let’s explore how Amazon Machine Learning performs with a simple binary classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle Red Wine Quality dataset through the Amazon machine learning regression algorithms in the last post.  Why this dataset?  Because it was clean data with a relatively simple objective – to predict the wine quality from its chemical measurements.  We also had no data wrangling to perform – we simply uploaded the CSV to AWS and had our model created with an RMSE evaluation score ready to review.

This time we want to change things a bit.  Rather than treat the 0-10 sensory quality ratings as integers, we want to turn the quality ratings into a binary rating.  Good or bad.  Great or not-so-great.  This greatly simplifies our problem – rather than have such a wide swing of a ten point wine rating, we can simply categorize the wine as great and no-so-great.  In order to do this we need to edit our dataset and change the quality ratings to a one (great) or a zero (not-so-great).

Data preparation – feature engineering

Go ahead and download the “winequality-red.csv” file from Kaggle. Open up the .CSV file as a spreadsheet.  We need to replace the 0-10 quality scores with a 1 (great) or 0 (not-so-great).  Let’s assume most wines in this dataset are fairly average  – rated a 5 or 6.  The truly great wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and up are great and wines rated 6 and under are no-so-great.

All we have to do is edit our CSV files with the new 0 or 1 categories, easy right?  Well, kind of.  The spreadsheet has ~1600 ratings and manually doing a search and replace is tedious and not easily repeatable.  Most machine learning datasets aren’t coming from simple and small CVS files but rather from big datasets hosted in SQL/NoSQL databases, object stores, or even distributed filesystems like HDFS.  Manual editing often won’t work and definitely won’t scale for larger problems.

Most data scientists will spend a decent amount of time manipulating and cleaning up datasets with tools that utilize  some type of high level programming language.  Jupyter notebooks are a popular tool and can support your programming language of choice.  Jupyter notebook are much more efficient for data wrangling using code instead of working manually with spreadsheets.  Amazon even hosts cloud-base Jupyter notebooks within Amazon Sagemaker.

Converting the wine ratings from 0-10 to a binary 0/ 1 is pretty easy in Python.  Just open the CSV file, test if each quality rating is a 7 or higher (> 6.5) and convert the true/false to an integer by multiplying by 1.  Then we’ll write out the changes to a new CSV file which we’ll use for our new datasource.

Python code
import pandas as pd
wine = pd.read_csv('winequality-red.csv')
wine['quality'] = (wine['quality'] > 6.5)*1

Running the binary classification dataset through Amazon Machine Learning

Our dataset now has a binary wine rating – 1 for great wine and 0 for no-so-great wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our new binary CSV file.  The wizard will verify the schema and confirm that our file is clean.

What is different from linear regression is when we look at the schema, we want to make sure the ‘quality’ column is a ‘binary’ type rather than a ‘numerical’ type.  All other values are numerical, only the quality rating is binary.  This should be the default behavior but best to double check.  Also select “yes” to show that your first line in the CSV file is your column name to remove this name from the model.


After finishing the schema, select your target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s check out the results by opening the evaluation.


Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with a AUC score of .778.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance

By default the wizard saves 30% of our dataset for evaluation so we can measure the accuracy of our model’s predictions.  Binary classification algorithms are measured with an AUC or Area Under the Curve score.  The measurement is a value between 0 and 1 with 1 being a perfect model that predicts 100% of the values correctly.

Our score shows our model got 83% of our predictions correct and 17% incorrect.  Not bad!  What is nice about this type of scoring is we can also see our false positives (not-so-great wine classified as great) and false negatives (great wine that was predicted as no-so-great).  Specifically our model had 55 false positives and 25 false negative.

Predicting quality wine is not exactly a matter of life and death.  Which means we aren’t necessarily concerned with false positives or false negatives as long as we have a decent prediction model.  But for other binary classification problems we may want to adjust our model to avoid false positives or false negatives.  This adjustment is made using the slider on the model performance screen shown above.

The adjustments come at a cost – if we want less false positives (bad wine predicted as great) then we’ll have more false negatives (great wine that accidentally predicted as bad).  The reverse is also true, if we want less false negatives (great wine predicted as bad), we will have more false positives (bad wine predicted as great).

Final thoughts

AWS uses specific learning algorithms for solving the three types of machine learning problems supported.  For binary classification, Amazon ML uses logistic regression which is a logistic loss function plus SGD.  Most beginners won’t mind these limitations but if we want to use other algorithms we’ll have to look at more advanced services like Amazon SageMaker.

Machine learning as a service is easy, fast, and cheap.  By using the AWS black box approach we built a working binary classification model in less than 30 minutes with minimal coding and only a high level knowledge of machine learning.

Thanks for reading!


Amazon Machine Learning for a simple regression dataset

Amazon Machine Learning is a public cloud service that offers developers access to ML algorithms as a service.  An overview can be found in the Amazon documentation and in my last blog post.   Consider the Amazon Machine Learning service a black box that automatically solves a few types of common ML problems – binary classification, multi-class classification, and regression.  If a dataset fits into one of these categories, we can quickly get a machine learning project started with minimal effort.

But just how easy is it to use this service?  What is it good for and what are the limitations?  The best way to outline the strengths and weaknesses of Amazon’s offering is to run a few examples through the black box and see the results.  In this post we’ll take a look at how AWS holds up against a very clean and simple regression dataset.


First off, what is a machine learning regression problem?  Regression modeling takes a dataset and creates a predictive algorithm to approximate one of the numerical values in that dataset.  Basically its a statistical model used to estimate relationships between data points.

An example of regression modeling is a real estate price prediction algorithm.  Historical housing closing prices could be fed into an algorithm with as many attributes about each house as relevant (called features – ie. square feet, number of bathrooms, etc).  The model could then be fed attributes of houses about to go up for sale and predict the selling price.

Let’s explore how Amazon Machine Learning models and performs with a simple regression dataset.

Kaggle Red Wine Quality Dataset

Kaggle is a community-based data science and machine learning site that hosts thousands of public datasets covering the spectrum of machine learning problems.  Our interest lies in taking Amazon Machine Learning for a test drive on a simple regression problem so I’m using a simple and clean regression dataset from Kaggle – Red Wine Quality.

Why this dataset?  Because we aren’t going to cover any data cleaning or transformations – we more interested in finding out how the AWS black box performs on a dataset with minimal effort.  Of course this is not a real world example but it keeps things simple.

The dataset contains a set of numerical attributes of 1600 red wine variants of the Portuguese “Vinho Verde” wine along with a numerical quality sensory rating between 0 (worse) and 10 (best).  Our goal will be to train an AWS model to predict the quality of this variety of red wine based off the chemically measurable attributes (features).  Can data science predict the quality of wine?

Amazon Machine Learning Walkthrough

First download the “winequality-red.csv” file from Kaggle.  Open up the .CSV file in a spreadsheet and take a look around.  We can see the column names for each column which include a “quality” column that rates the wine from 0-10.  Notice all fields are populated with numbers and there are no missing values.

We’ll need to upload our .CSV file into AWS storage to get our process started so log into the AWS console and open the S3 service.

S3 console

Create a new bucket with the default settings to dedicate to our machine learning datasets.

Create a new bucket

Upload the winequality-red.csv file into this bucket and keep the default settings, permissions will be adjusted later.

Upload the winequality-red.csv file

Now open AWS Machine Learning service in the AWS console.

Amazon Machine Learning Console

Click the “get started” button and then “view dashboard”.

Machine Learning Dashboard Launch

Click “create a new datasource and ML model”.

Create new datasource and ML model

AWS will prompt us for the input data for the datasource, we will use our S3 bucket name and the wine quality CSV file that was uploaded (it should auto-populate once we start typing).  Click “verify” to confirm the CSV file is intact.  At this point AWS will adjust the permissions for the ML service to have access to the S3 bucket and file (click “yes” to give it permissions).  Once we have a successful validation message, click “continue”.

Create datasource from CSV in S3 bucket

Successful validation

The wizard will now review the schema.  The schema is simply the fields in our CSV file.  Because our CSV contained the column names (fixed acidity, volatile acidity, etc) we want to select “yes” so we don’t try to model the quality based off the column names, selecting “yes” will remove the names from the model.


Now we want to select the target value that we want the model to predict. As stated earlier, we want to predict the quality of the wine based off the measurable attributes so select “quality” and continue.


Keep the row identifier as default “no”.  We’ll skip this topic but it would be important if we were making predictions in a production environment.  Review the selections and “continue”.

Row Identifier


Accept the default ML model settings which will split our dataset and set aside 30% of the data for evaluating the performance of the model once it is trained using 70% of the data.

Review one last time and then select “create ML model”.

Model settings


We are done!  The dashboard and will show the progress, the entire process takes about 10-20 minutes. AWS will take over and read the data, perform a 70/30% split,  create a model, train the model using 70% of the data, and then evaluate the model using the remaining 30% of the data.

How did we do?

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s check out the results by opening the evaluation.

Once again, this isn’t a real world example.  A data scientist would typically spend much more time looking at the data prior to evaluating an algorithm.  But since this is a black box approach, we will entertain the notion of skipping to the end without doing any data exploration.


Looking at our evaluation summary, things look pretty good.  We have a green box saying our ML model’s quality score was measured with RMSE of 0.729 and was better than the baseline value of 0.793.  That summary is somewhat vague so lets click “explore model performance”.

ML Model performance

Since we saved 30% of our data for evaluation, we can measure the accuracy of our model’s predictions.  A common way to measure regression accuracy is the RMSE score or root-mean-square-error. This RMSE score is measuring the difference between the model’s prediction and the actual value in the evaluation set.  A RMSE of 0 is perfect, the smaller the value the better the model.

We can see a somewhat bell-shaped curve and our model is pretty close to zero which seems great.  However, if we look back at our summary we see that the RMSE baseline is .793 and our score is .729.  The baseline would be our score if we just guessed for median (middle) quality score for every prediction.  So although we area bit better, we aren’t better by much and would be close if we just ignored all the attributes and guessed the median quality score (6) every time

Amazon Machine Learning – the good

Although our results were not incredibly impressive, the easy wizard-driven process and time to results were very impressive.  We can feed the AWS ML black box data and have a working model with an evaluation of its accuracy in less than 30 minutes.  And this can be done with little to no knowledge of data science, ML techniques, computer programming, or statistics.

And this entire process can be automated via the AWS API sets.  Think of an automated process that collects data and automatically feeds it to the AWS machine learning engine using code instead manually clicking buttons in the  console.  Predictions could be generated automatically on constantly changing data using code and the black box machine learning algorithms in the public cloud.

Lastly, this the service is inexpensive.  It cost about $3 to half a dozen datasets through the service over the past month while researching this blog.  All we need is some data in S3 and the service can start provided ML insights at a very low cost.

Amazon Machine Learning – the bad

A few points become clear after using the ML service.  The first is we have to adjust our expectations when feeding the AWS black box data.  Regression problems fed to Amazon Machine Learning are going to be solved using linear regression with stochastic gradient descent.  If a problem is easily solved using those algorithms then Amazon will do a great job making predictions.  If other algorithms like random forests or decision trees get better results then we need to look at more advanced Amazon services to solve those types of problems.

The second point is that data needs to be explored, cleaned, and transformed prior to feeding it to the machine learning service.  While we can look at correlations and data distributions after we create a datasource in AWS, there is no easy way to manipulate your dataset other directly editing your CSV file.  There is an advanced function to use built-in data transformations as part of the ML model wizard which is more of an advanced topic and is limited to the data transformations referenced in the documentation.  Accepting the defaults and not wrangling your input data may get great results from our AWS ML black box.

Use case

Have a machine learning problem that is easily solved using linear regression with SGD?  Have an existing data pipeline that already cleans and transforms your data?  Amazon machine learning can outsource your machine learning very quickly and cheaply in this case without much time spent learning the service.  Just don’t expect a shortcut to your ML process, send Amazon clean data and you can quickly get an API prediction endpoint for linear regression problems.


Amazon Machine Learning – Commodity Machine Learning as a Service

Commodity turns into “as a service”

Find something in computing that is expensive and cumbersome to work with and Amazon will find a way to commoditize it.  AWS created commodity storage, networking, and compute services for end users with resources leftover from their online retail business.  The margins may have not been great, but Amazon could make this type of business profitable due to their scale.

But what happens now that Microsoft and Google can offer the same services at the same scale?  The need for differentiation drives innovative cloud products and platforms with the goal of attracting new customers and keeping those customers within a single ecosystem.  Using basic services like storage, networking, and compute may not be differentiated between the cloud providers but new software as a service (SaaS) offerings are more enticing when shopping for cloud services.

Software as a service (SaaS) probably isn’t the best way to describe these differentiated services offered by public cloud providers.  SaaS typically refers to subscription-based software running in the cloud like Salesforce.com or ServiceNow.  Recent cloud services are better described by their functionality with an “as a service” tacked on.  Database as a service.  Data warehousing as a service.  Analytics as a service.  Cloud providers build the physical infrastructure, write the software, and implement vertical scaling with a self-service provisioning portal for easy consumption.  Consumers simply implement the service within their application without worrying about the underlying details.

Legacy datacenter hardware and software vendors may not appreciate this approach due to lost revenue but the “as a service” model is a good thing for IT consumers.  Services that were previously unavailable to every day users have been democratized and are available to anyone with an AWS account and a credit card.  Cost isn’t necessarily the primary benefit to consumer but rather the accessibility and the ability to consume using a public utility model.  All users can have access now to previous exotic technologies and can pay by the hour to use them.

Machine Learning as a Service

Machine learning is at the peak of the emerging technology hype cycle.  But is it really all hype?  Machine learning (ML) has been around for 20+ years.  ML techniques allow users to “teach” computers to solve problems without hard-coding hundreds or thousands of explicit rules.    This isn’t a new idea but the hype is definitely a recent phenomenon for a number of reasons.

So why all the machine learning hype?  The startup cost both in terms of hardware/software and accessibility are much lower which presents an opportunity to implement machine learning that wasn’t available in the past.  Data is more abundant and data storage is cheap.  Computing resources are abundant and CPU/GPU cycles are (relatively) cheap.  Most importantly, the barriers have been lifted in terms of combining access to advanced computing resources and vast sets of data.  What used to require onsite specialized HPC clusters and expensive data storage arrays can now be performed in the cloud for a reasonable cost.  Services like Amazon Machine Learning are giving everyday users the ability to perform complicated machine learning computations on datasets that were only available to researcher at universities in the past.

How does Amazon provide machine learning as a service?  Think of the service as a black box.  The implementation details within the black box are unimportant.  Data is fed into the system, a predictive model/algorithm is created and trained, the system is tweaked for its effectiveness, and then the system is used to make predictions using the trained model.  Users can use the predictive system without really knowing what is happening inside the black box.

This machine learning black box isn’t magical.  It is limited to a few basic types of models (algorithms) – regression, binary classification, and multiclass classification.  More advanced type operations require users to look to AWS SageMaker and require a higher skill level than the basic AWS machine learning black box.  However, these three basic machine learning models can get you started on some real-world problems very quickly without really knowing much math or programming.

Amazon Machine Learning  Workflow

So how does this process work at a high level?  If a dataset and use case can be identified as a regression or binary/multiclass classification problem, then the data can simply be fed to the AWS machine learning black box.  AWS  will use the data to automatically select a model and train the model using your input data.  The effectiveness of the model is then evaluated with some measurable results that determine effectiveness of the model with a numerical score.  This model is ready to use at this point but can also be tweaked to improve the scoring.  Bulk data can get fed to the trained model for batch predictions or ad-hoc predictions can be performed using the AWS console or programmatically through the AWS API.

Knowing that a problem can be solved by AWS takes a bit of high-level machine learning knowledge.  The end user needs to have an understanding of their data and of the three model types offered in the AWS black box.  Reading through the AWS Machine Learning developer guide is a good start in terms of an overview.  Regressions models solve problems that need to predict numeric values.  Binary classification models predict binary outcomes (true/false, yes/no) and multiclass classification models can predict more than two outcomes (categories).

Why use Amazon machine learning?

For those starting out with machine learning this AWS service may sound overcomplicated or of questionable value.  Most tutorials show this type of work done on a laptop with free programming tools, why is AWS necessary?  The point is the novice user can do some basic machine learning in AWS without the high startup opportunity costs of learning programming or learning how to use machine learning software packages.  Simply feed the AWS machine learning engine data and get instant results.

Anyone can run a proof of concept data pipeline into AWS and perform some basic machine learning predictions.  Some light programming skills would be helpful but are not mandatory.  Having a dataset with a well-defined schema is a start as well as having a business goal of using that dataset to make predictions on similar sets of incoming data.  AWS can provide these predictions for pennies an hour and eliminate the startup costs that would normally delay or even halt these types of projects.

Amazon machine learning is a way to quickly get productive and get a project past the proof of concept phase in making predictions based on company data.  Access to the platform is open to all users so don’t rely on the AWS models for a competitive advantage or a product differentiator.  Instead use AWS machine learning as a tool to quickly get a machine learning project started without much investment.

Thanks for reading, some future blog posts here will include running some well-known machine learning problems through Amazon Machine Learning to highlight the process and results.