Exporting Lightsail instance snapshots to Amazon EC2

I’ve been using Amazon Lightsail for the past six months to host my WordPress blog site. I’m happy with Lightsail – I have full control over both my Linux instance OS and my WordPress blog site administration but I don’t have to deal with the headaches (or the price) of a full-blown EC2 instance.

A few months back I wrote a blog post listing the pros and cons of Amazon Lightsail.  Why would one use Lightsail versus Amazon EC2?  Within the Amazon ecosystem, Lightsail is the cheapest and easiest to use platform if the goal is to get a Linux instance and simple application up and running in AWS.

Think of Lightsail as a lightweight version of AWS EC2.  There really is no learning curve to get started on Lightsail – just pick some basic options and launch the instance.  The Lightsail pricing structure is straightforward and predictable.  Both the pricing and ease of use make Lightsail as a hosting platform  attractive for bloggers and developers to get started on Amazon and the experience is quick, cheap, and easy.

With this simplicity comes limitations.  A lightweight version of an Amazon EC2 instance doesn’t have the full compliment of every EC2 feature.  This makes sense – Lightsail is geared towards bloggers and developers and is kept separate from EC2.  A lightweight version works fine when starting a project but what if we hit a point where the advanced networking features of a VPC are required?  Or autoscaling groups?  Or spot instances?

Fortunately this has changed as of AWS re:Invent 2018 – we can now convert a Lightsail instance into an EC2 instance using Lightsail instance snapshots!   I’ll walk through the process in this blog and show some of the strategies and considerations needed to move a Lightsail instance out of the limited Lightsail environment into the feature-rich EC2 environment.

Why should I move my instance out of Lightsail?

Let’s face it – we aren’t going to move our instance out of Lightsail unless we have to.  Lightsail is the cheapest option for running an instance in AWS.  Instance monthly plans start as low as $3.50/month as long as the instance stays within the resource bounds of the pricing plan.

Adding more horsepower to a Lightsail instance is also not a reason to move to EC2.  Lightsail allows upgrading an instance to use more resources (for a price) – just use an instance snapshot to clone the existing instance to a new plan with more resources (the application inside the instance will persist using instance snapshots).

But what if we want to to take our application out of the Lightsail environment and place it into more of a production environment?  What if we have found limitations in Lightsail and found that our applications needs to reside in EC2 to take advantage of all the available EC2 features?  What if we need web-scale features, advanced networking, and access to services that are limited in the Lightsail environment?  In that use case we need to move to EC2.

AWS terminology and concepts

Before we go through the conversion process, let’s clear up some terminology and concepts that will help illustrate the task.

Instance – this is just Amazon’s fancy word for a virtual machine.  The term “virtual machine” probably sounded too much like VMware so Amazon calls virtual machines “instances”.  An instance is a virtual machine but runs on Amazon’s proprietary hypervisor environment.

Lightsail instance – An AWS image running in the Lightsail environment.  Lightsail is separate from other AWS environments and runs on a dedicated  hardware stack in selected data centers.

EC2 instance – An AWS image running in the EC2 environment.  EC2 is completely separate from Lightsail and runs on its own hardware and has many more management and configuration options.

Instance snapshot  – A snapshot of either a Lightsail or EC2 instance that not only includes the state of the virtual machine but also a snapshot of the attached OS disk.

Disk snapshot – Unlike an instance snapshot, a disk snapshot only tracks changes to the physical disk attached to the virtual machine.  No virtual machine state is captured so this type of snapshot is only good for keeping track of block disk data, not for the virtual machine state.

CloudFormation Stack – a template that tracks AWS resources associated with an AWS service through the use of a template.  CloudFormation can be used to define the configuration details of an AWS instance.

Key pair – quite simply, a .PEM file used to connect to an instance.  We can’t connect to an instance remotely (SSH, Putty) without a .PEM key pair.

AMI – a preconfigured template that is also associated with an EBS snapshot and is used to launch an instance with a specified configuration

How exporting an instance snapshot works

Instance snapshots are portable.  When an instance snapshot is taken of a Lightsail instance, the OS and disk state are stored to flat files and kept in some form of S3 object storage.  The flat files can be shared within Lightsail to create a new instance from the same snapshot – we can even assign more CPU or RAM to an instance snapshot to upgrade the instance to something faster.

With the latest AWS Lightsail EC2 instance snapshot export feature we can now take the same Lightsail snapshot and share it with AWS EC2 infrastructure.  This allows EC2 to import the Lightsail snapshot  to clone our Lightsail instance and spin it up in EC2.  Lightsail snapshots were always portable but the recent EC2 export feature enabled sharing of snapshots outside of Lightsail.

Behind the scenes Amazon uses CloudFormation to enable this conversion process.  Which simply means we automatically get a CloudFormation stack (or template) created when we move our Lightsail instance snapshot out to EC2.  But we don’t have to worry at all about CloudFormation, this is transparent to the user and doesn’t even require the user to open the CloudFormation console or interact at all with CloudFormation.

Walkthrough of moving a Lightsail instance to EC2

Exporting a Lightsail instance is very easy.  First log into your AWS account and open the Lightsail console.  Select your existing Lightsail instance to see the options below.

lightsail Console

Open the snapshots tab and create an instance snapshot (if you don’t already have one).  Use any name you like and click “create snapshot”.  The process will most likely take a few minutes and the progress is shown in the console.

Create Instance Snapshot in Lightsail console

After the snapshot is completed it will display in the list of “recent snapshots”.  We can then select this snapshot (three orange dots on the right) and select “Export to Amazon EC2”.  This will move the snapshot out of Lightsail object storage over to EC2.

Recent Snapshots in Lightsail Console

An informational pop-up will appear explaining the outcome (EBS snapshot and AMI) along with a warning that EC2 billing rates will apply, click “Yes, continue”.

Export Lightsail Snapshot to EC2 in Lightsail console

One last warning will appear alerting that the existing default Lightsail key pairs (.PEM file) should be replaced with an individual key pair.  Click “Acknowledged”.

Export Security Warning in Lightsail Console

The progress can be monitored using the top “gear” in the Lightsail console which will “spin” during the export process.  When completed, the gear will stop “spinning” and the task will show as completed and “Exported to EC2” in the task history.

Running Snapshot export task in Lightsail Console

Completed Snapshot export in lightsail console

Using the exported EC2 Lightsail instance

Now that we have exported our Lightsail instance snapshot to EC2, what happens next?  By default, nothing.  Looking at the completed task screenshot (above), we can open the EC2 console to see what we have or we can use the Lightsail console to launch in EC2 instance from our exported snapshot.

Lightsail is great and I love the simplified console but we really should use the EC2 console at this point.  Lightsail has all new wizard screens to manage our instance in EC2 but my feeling is that if we are going to export to EC2, we should switch over and learn to use the EC2 console.

See the screenshot below, open the EC2 console and make sure region is the same that was used for Lightsail (in my case I’m US East (Ohio)).

EC2 in the AWS Console

Open the “Elastic Block Storage” link on the left side of the EC2 console.  On the top right side of the console we can see the instance snapshot that was created by Lightsail.  Since this snapshot is an instance snapshot we will also find an AMI associated with this EBS snapshot.

elastic block storage snapshots in EC2 Console

EBS snapshots in ec2 console

Open the “Images” – “AMI” link in the EC2 console.  We can see in the upper right frame that the Lightsail snapshot has been converted to an EC2 AMI.  This was done with AWS CloudFormation but that doesn’t really matter, the AMI is ready for us to use without worrying about how it was created.

AMIs in ec2 console

AMis in ec2 console

Now the fun part, we can launch our Lightsail instance in EC2 using the AMI.  Just click “launch”!  A few things to consider when launching the AMI (the details are out of scope for this post):

Instance size – pick an EC2 image size for your Lightsail instance, the EC2 options have different pricing plans and different levels of CPU, RAM, GPU, storage, etc.

Security (key pair, security groups, VPC info) –  design a strategy (if you don’t already have one) to secure your new image with a VPC, security group, key pair, etc.

Additional storage – additional disk storage can be attached during the EC2 launch

Tags – optionally you can tag your instance for easier AWS billing.

Pick your options and you are ready to go, your Lightsail instance is now running in EC2!  Thanks for reading!

Handwritten digit recognition – MNIST via Amazon Machine Learning APIs

In our last post we explored how to use the Amazon Machine Learning APIs to solve machine learning problems.  We looked at the SDK, we looked at the machine learning client documentation, and we even looked at an example on Github.  But we didn’t write any code.

If it is so easy to consume AWS APIs with code then why don’t we take a shot at writing some code?  Point taken, this post will include code!  We will use Python to write some functions to call Amazon Machine Learning APIs to create a machine learning model and run some batch predictions.  The code I’ve written is heavily influenced by AWS Python sample code – I’ve written my functions and main body but borrowed a few bits like the polling functions and random character generation.

MNIST dataset – handwritten digit recognition

The MNIST dataset is a “hello world” type machine learning problem that engineers typically use to smoke test an algorithm or ML process.  The dataset contains 70,000 handwritten digits from 0-9 each scanned into a 28×28 pixel representation of each digit.  Each pixel value represents on of the 784 (28 x 28) pixel’s intensity with a value from 0 (white) to 255 (black).  See the Wikipedia entry for a sample visualization  of a subset of the images.

We are going to pull the MNIST data from Kaggle to make things easy.  Why pull it from Kaggle?  It saves us some time, the Kaggle version has conveniently split the data into 48,000 images in a training CSV and 28,000 images in a test CSV.  The training CSV has the labels in the first column which are just the numbers represented by all the pixels in the row.  The test CVS file has removed the label for use for predictions.  Kaggle has made things a bit easier for us, we don’t have to do any manipulation of the data.

Code on Github – https://github.com/keithxanderson

The code I have written to run MNIST through Amazon Machine learning via APIs is up on Github.  See the README.md for a detailed explanation of how to use the code and what exact steps are required to get things working.  We are going to use Python and specifically the ‘machinelearning’ client of Boto3 in order to pass our calls to AWS.

While the README.md explains things pretty well, let’s summarize here.  The main MNIST.py Python file uses the MNIST training data to create and evaluate an ML model and then uses the test data to create a batch prediction.  Data in the test file contains the pixels but no label, our model is predicting what digit is represented by the pixels after “learning” from the training data with labels (the answers).

There is a Python function written for each of the high level operations :

create_training_datasource – takes a training file from an S3 bucket, a schema file from an S3 bucket, and a percentage argument and splits the training data into a training dataset and an evaluation dataset.  The number of rows in each dataset is determined by the percentage set.  The schema file defines the schema (obviously).

create_model – takes a training datasource ID and a recipe file to create and train a model.  The function is hard-coded to create a multiclass classification model (multinomial logistic regression algorithm).

create_evaluation – takes an our model ID and our evaluation datasource ID and creates an evaluation which simply scores the performance of our model using the reserved evaluation data.  Results are seen manually in the AWS console but could be added to our code if we wanted.

create_batch_prediction_dataset – takes a test file from an S3 bucket and a schema file (a different schema file than our training schema file) and creates a batch dataset.  We will use this to make predictions, this data does not contain any labels.

batch_prediction – takes an existing model ID, a batch datasource ID and runs predictions of the batch data against our model.  Results are stored in the S3 output argument are written to a flat file.

Schema files and recipe

Our datasources all need a schema and we can define these with a schema file.  The schema for the training and evaluation data are the same and can use the same file.  The batch datasource does not contain labels so needs a separate schema file.

The training and evaluation schema file for MNIST can be found in the Github project with the Python code.  We simply define every column and declare each of the columns as “CATEGORICAL” since this is a multiclass classification model.  Each handwritten number is a category (0-9) and each pixel is a category (0-255).  The format of the schema file is defined by AWS, if you create a datasource manually in the console you can copy the AWS generated schema in your schema file.

The batch schema file is also found in the same project.  The only difference between the batch schema file and the training/evaluation schema file is that the batch schema does not contain the labels (numbers).

Lastly, our recipe file is found on Github with the Python code and schema files.  The recipe simply instructs AWS how to treat the input and output data.  Our example is very simply since all features (columns) are categorical and we just define all of our outputs as “ALL_CATEGORICAL”.  This saves us from manually declaring each pixels feature as a category.

Running the code

See the README.md for detailed instructions on running the code.  Provided we have our CSV files, schema files, and recipe file in an S3 bucket, we would just need to modify the MNIST.py file with the S3 locations for each file in the main body of the code along with an output S3 bucket for the prediction results (all can share the same S3 bucket).

After executing the MNIST.py file we will see the training, evaluation, and batch dataset creation execute in parallel and the model and evaluation process go pending.  Once we have our datasets created we will see the model training start and once we see the model training complete we will see the evaluation start.

A polling function is called within the batch prediction function to wait on the evaluation process to complete prior to running a batch prediction.  This is optional, there is really no dependency since the batch prediction can run in parallel with the evaluation scoring of our model.

Model evaluation performance

If all went well we will have an evaluation score in the AWS console.  Let’s take a look!  Looking at our ML model evaluation performance in the console we see an F1 score of .914 compared against a baseline score of .020 which is very good.

ML Model Performance

Exploring our model performance against our evaluation data we see that the model is very good at predicting the handwritten digits. The matrix below shows dark blue for every digit correctly classified as the numbers below.  This is not terribly surprising, most algorithms perform pretty well with the MNIST dataset but give credit to AWS for tuning the multiclass classification algorithm to this level of performance.

Explore Model Performance

Batch prediction results

We should also have an output file in our batch prediction output S3 bucket if everything ran to completion.  I’ve included a sample output CSV file within my MNIST project to view.  See the AWS documentation for more information on how to interpret the output.  Each column contains the categories for prediction which are the digits from 0-9.  Each row is the predicted probability that the test data belongs to the category.

For example, in our sample CSV file the first prediction shows about a 99% probability the number is a 2 digit and our second prediction shows a 99% probability the next number is a 0.

Why use Amazon Machine Learning?

We’ve discussed the value of using Amazon Machine Learning before but it is worth repeating.  It is very easy to incorporate basic machine learning algorithms into your application by offloading this work to AWS.  Avoid spending time and money building out infrastructure or expertise for an in-house machine learning engine.  Just follow the basic workflow outlined above and make a few API calls to Amazon Machine Learning.

Thanks for reading!

How to use the Amazon Machine Learning API

Machine learning, AI, deep learning – all of these technologies are popular and the hype cycle is intense.  When one digs into the topic of machine learning, it turns out to be a lot of math, data wrangling, and writing coding.  Which means one can quickly become overwhelmed by the details and the lose sight of the practicality of machine learning.

Take a step back from all the complexity and ask a fundamental question.  Why should we even look at machine learning?   Simple, we want our applications to get smarter and make better decisions.   Ignore the complexities of machine learning and think about the end result instead.  Machine learning is a powerful tool that can be used to our advantage when building applications.

Now consider how we would implement machine learning into our applications.  We could write lots of code within our app to pull historical data, use this data to create and train models, evaluate and tweak our models, and finally make predictions.  Or we could offload this work to public cloud services like Amazon Machine Learning.  All it takes is some code within our application to make calls Amazon Machine Learning APIs.

Machine learning via APIs?

I’ve been an infrastructure guy for most of my career.  While I have worked for software companies and have worked closely with software developers – I’m not a developer and have not made my living writing code.

That being said, machine learning requires learning to write code.  One wouldn’t go out and buy machine learning from a vendor but would rather write code to build and use machine learning models.  Furthermore, machine learning models don’t provide much value by themselves but are valuable when built into an application.  So coding is important for using machine learning.

Think along the same lines regarding consumption of public cloud.  Cloud isn’t just about cost, it’s about applications consuming infrastructure  and services with code.  Anything that can be created, configured, or consumed in AWS can (and should?) be done so with code.   So coding is also important for consuming public cloud.

Fortunately, writing code to use Amazon Machine Learning isn’t very hard at all.  Amazon publishes a machine learning developer guide and API reference documentation that shows how to write code to add machine learning to your application.  That combined with some machine learning examples on Github make the process easy follow.

Overview of an application using Amazon Machine Learning

What is the high-level architecture of an application using Amazon Machine Learning?  Let’s use an example as we walk through the architecture.  Think of a hypothetical customer loyalty application that sends rebates to customers we think may stop using our theoretical company’s services.  We could incorporate Amazon machine learning into the app to flag at-risk customers for rebates.

Raw data – first we need data.  This could be in a database, data warehouse, or even flat files in a filesystem.  In our example, these are the historical records of all customers that have used our service and have either remained loyal to our company or stopped using our services (the historical data is labeled with the outcomes).

Data processed and uploaded to S3 – Our historical data needs to be extracted from our original source (above), cleaned up, converted into a CSV file,  and then uploaded to an AWS S3 buckets.

Training datasource – our application needs to invoke Amazon Machine learning via an API and create a ML datasource out of our CSV file.

Model – our application needs to create and train a model via an AWS API using the datasource we created above.

Model evaluation – our app will need to measure how well our model performs with some of the data reserved from the datasource.  Our evaluation scoring gives us an idea of how well our model makes predictions on new data.

Prediction Datasource –  if we want to make predictions with machine learning, we need new customer data that matches our training data schema but does not contain the customer outcome.    So we need to collect this new data into a separate CSV file and upload it to S3.

Batch predictions – now for the fun part, we can now call our model via AWS APIs to predict the target value of the new data we uploaded to S3.  The output will be contained in a flat file in an S3 bucket we specify that we can read with our app and take appropriate action (do nothing or send a rebate).

How to get started?

First pick a programming language.  Python is popular both for machine learning and for interacting with AWS so we’ll assume Python is our language.  Next read through the AWS Python SDK Quickstart guide to get a feel for how to use AWS with Python.  Python uses the SDK Boto3 which needs to be installed and also needs to be configured to use your AWS account credentials.

Lastly, read through the machine learning section of the Boto3 SDK guide to get a feel of the logical flow of how to consume Amazon Machine Learning via APIs.  Within this guide we would need to call the Python methods below using the recommended syntax given.

create_data_source_from_s3 – how we create a datasource from a CSV file

create_ml_model – how we create an ML model from a datasource

create_evaluation – how we create an evaluation score for the performance of our model

create_batch_prediction –  how we predict target values for new datasources

Getting started with an example

At this point we have a good idea of what we need to do to incorporate Amazon machine learning into a Python app.  Those already proficient in Python could start writing new functions to call each of these Boto3 ML methods following the architecture we described above.

What about an example for those of use who aren’t as proficient in Python?  Fortunately there are plenty of examples online of how to write apps to call AWS APIs.  Even better, AWS released some in-depth ML examples on Github that walks through examples in several languages including Python.

Specifically lets look at the targeted marketing example for Python.  First check out the README.md.  We see that we have to have our AWS credentials setup locally where we will run the code so that we can authenticate to our AWS account.  We’ll also need Python installed locally (as well as Boto3) so that we can run Python executables (.py files).

Next take a look at the Python code to build an ML model.  A few things jump out in the program:

—  The code imports a few libraries to use in the code (boto3, base64, import json, import os, sys).

—  The code then defines the S3 bucket and CSV file that we will use to train our model (“TRAINING_DATA_S3_URL”).

—  Functions are written for each subtask that needs to be performed for our machine learning application.  We have functions to create a datasource (def create_data_sources), build a model (def build_model), train a model (def create_model), and performs an evaluation of the model (def create_evaluation).

—  The main body of the code (line 123 – ‘if __name__‘) ties everything together using the CSV file, a user defined schema file, and a user defined recipe file to call each function and results in a ML model with a performance measurement.

Schema and recipe

Along with the Python code we find a .schema file and recipe.json file.  We explored these in previous posts when using Amazon Machine Learning through the AWS console wizard.  When using machine learning with code through the AWS API, the schema and recipe must be created and formatted manually and passed as arguments when calling the Boto3 ML functions.

Data pulled out of a database probably already has a defined schema – these are just the rows and columns that describe the data fields and define the types of data inside them.  The ML schema needs to be defined in the correct format with our attributes and our target value along with some other information.

A recipe is a bit more complicated, the format for creating a recipe file gives the details.  The recipe simply defines how our model treats the input and output data defined in our schema.  Our example recipe file only defines outputs using some predefined groups (ALL_TEXT, ALL_NUMERIC, ALL_CATEGORICAL, ALL_BINARY ) and does a few transformations on the data (quantile bin the numbers and convert text to lower case).

Making predictions in our application 

Assuming everything worked when we ran the ‘build_model.py’ executable, we would have a model available to make predictions on unlabeled data.  The ‘use_model.py’ executable does just that.  We just need to get this new unlabeled in a CVS file and into an S3 bucket (“UNSCORED_DATA_S3_URL”) and allow the ‘use_model.py’ code create a new datasource to use for predictions.  The main section of this ‘use_model.py’ code ties everything together and calls the functions with the input/output parameters required.

Think about it, we have all the code syntax, formatting, and necessary steps necessary to add predictive ability to our application by simply calling AWS APIs.  No complex rules in our application, no need to write an entire machine learning framework into our application, just offload to Amazon Machine Learning via APIs.

Thanks for reading!


Amazon Machine Learning for a multiclass classification dataset

We’ve taken a tour of Amazon Machine Learning over the last three posts.  Quickly recapping, Amazon supports three types of ML models with their machine learning as a service (MLaaS) engine – regression, binary classification, and multiclass classification.  Public cloud economics and automation make MLaaS an attractive option to prospective users looking to outsource their machine learning using public cloud API endpoints.

To demonstrate Amazon’s service we’ve taken the Kaggle red wine quality dataset and adjusted the dataset to demonstrate each of the AWS MLaaS model types.  Regression worked fairly well with very little effort.  Binary classification worked better with a bit of effort.  Now to finish the tour of Amazon Machine Learning we will look at altering the wine dataset once more to turn our wine quality dataset prediction engine into a multiclass classification problem.

Multiclass Classification

What is a machine learning multiclass classification problem?  It’s one that tries to predict if an instance or set of data belongs to one of three or more categories.  A dataset is first labeled with the known categories and used to train an ML model.  That model is then fed new unlabeled data and attempts to predict what categories the new data belongs to.

An example of multiclass classification is an image recognition system.  Say we wanted to digitally scan handwritten zip codes on envelopes and have our image recognition system predict the numbers.  Our multiclass classification model would first need to be trained to detect ten digits (0-9) using existing labeled data.  Then our model could use newly scanned handwritten zip codes (unlabeled) and predict (with an accuracy value) what digits were written on the envelope.

Let’s explore how Amazon Machine Learning performs with a mulitclass classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle Red Wine Quality dataset untouched through the Amazon machine learning regression algorithm.  The algorithm interpreted the scores as floating point numbers rather than integer categories which isn’t necessarily what we were after.  We then doctored up the ratings into binary categories of “great” and “not-so-great” ran the dataset through the binary classification algorithm.  The binary categories were closer to what we wanted, treat the ratings as categories instead of decimal numbers.

This time we want to change things again.  We are going to add a third category to our rating system and change the quality score again.  Rather than try to predict the wide swing of a 0-10 rating system, we will label the quality ratings as either good, bad, or average – three classes.  And Amazon ML will create a multiclass regression model to predict one of the three ratings on new data.

Data preparation – converting to three classes

We will go ahead and download the “winequality-red.csv” file again from Kaggle.  We need to replace the existing 0-10 quality scores with with categories of bad, average, and good.  We’ll use numeric categories to represent quality – bad will be 1, average will be 2, and good will be 3.

Looking at the dataset histogram below, most wines in this dataset are average  – rated a 5 or 6.  Good wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and higher are good (3), wines rated 5 or 6 are average (2), and those rated less than 5 are bad (1).  Even though the rating system is 0-10, we don’t have any wines rated less than 3 or greater than 10.

Wine quality Histogram

We could use spreadsheet formulas or write a simply python ‘for’ loop to append our CSV “quality” column.  Replace the ratings from 3-4 with a 1 (bad), replace our 5-6 ratings with a 2 (average), and replace our 7-8 ratings with a 3 (good).  Then we’ll write out the changes to a new CSV file which we’ll use for our new ML model datasource.

Wine quality Histogram AFTER PROCESSING

Running the dataset through Amazon Machine Learning

Our wine dataset now has a multiclass style rating – 1 for bad wine, 2 for average wine, and 3 for good wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our multiclass CSV file.  The wizard will verify the schema and confirm that our file is clean.

We need to identify our quality rating as a category rather than numeric for Amazon to recognize this as a multiclass classification problem.  When setting up the schema, edit the ‘quality’ column and set the data type to ‘categorical’ type rather than the default  ‘numerical’.  We will again select “yes” to show our CSV file contains column names.


After finishing the schema, select the target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

Let’s first look at our source dataset prior to the 70/30 split just to get an idea of our data distribution.  82% of the wines are rated a 2 (average), 14% are rated a 3 (good),and 4% are rated a 1 (bad).  By default, the ML wizard is going to randomly split this data 70/30 to use for training and performance evaluation.

Datasource Histogram prior to Train/test split

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s take a look at our evaluation summary and see how we did.


Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with an average F1 score of 0.421.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance- Confusion matrix

Multiclass prediction accuracy can be measured as a weighted sum of the individual binary predictions.  How well did we predict good wines (3), average wines (2), and bad wines (1)?  Each category’s accuracy score is combined into what is called an F1 score where a higher F1 score is better than a lower score.

A visualization of this accuracy is displayed above in a confusion matrix.  The matrix makes it easy to see where the model is successful and where it is not so successful.  The darker the blue, the more accurate the correct prediction, the darker the orange/red, the more inaccurate the prediction.

Looking at our accuracy above, it seems we are good at predicting average wine (2) – our accuracy is almost 90% and we predicted 361/404 wines correctly as average (2).  However, the model was not so good at predicting good (3) and bad (1) wines.  We only correctly predicted 13/47 (28%) as good and only correctly predicted 2/26 (.08%) as bad (1).  Our model is good at predicting the easy (2) category but not so good at predicting the more difficult and less common (1) and (3) categories.

Can we do better?

Our model is disappointing, we could have done almost as well if we just predicted that every wine was average (2).  We want train a model that is better at predicting good (3) wines and bad (1) wines.  But how?  We could use a few ML techniques to get a better model:

Collect more data – More data would help train our model better, especially having more samples of good (3) and bad (1) wines.  In our case this isn’t possible since we are working with a static public dataset.

Remove some features – Our algorithm may have an easier time evaluating less features.  We have 11 features (pH, sulphates, alcohol, etc) and we really aren’t sure if all 11 features have a direct impact on wine quality.  Removing some features could make for a better model but would require some trial and error.  We’ll skip this for now.

Use a more powerful algorithm – Amazon uses  multinomial logistic regression (multinomial logistic loss + SGD) for multiclass classification problems.  This may not be the best algorithm for our wine dataset, there may be a more powerful algorithm that works better.  However this isn’t an option when using Amazon Machine Learning – we’d have to look at a more powerful tool like Amazon SageMaker if we wanted to experiment with different algorithms.

Data wrangling –  Feeding raw data to the Amazon Machine Learning service untouched is easy but if we aren’t getting good results we will need to perform some pre-processing to get a better model.  Some features have wide ranges and some do not – for example the citric acid values range from 0-1 while the total sulfur dioxide values range from 6-289.  So some feature scaling might be a good idea.

Also, the default random 70/30 training and testing data split may not be the greatest way to train and test our model.  We may want to use a more powerful method to split the folds of the dataset ourselves rather than to let Amazon randomly split it.  Running a Stratified ShuffleSplit might be helpful prior to uploading it to Amazon.

Lastly, Amazon Machine Learning uses a technique called quantile binning for numeric values.  Instead of treating a range of numbers as discreet values, Amazon puts the range of values in “bins” and converts them into categories.  This may work well for non-linear features but may not work great for features with a direct linear correlation to our quality ratings.  Amazon recommends some experimentation with their data transformation recipes to tweak model performance.  The default recipes may not be the best for all problems.

Final Thoughts

Machine learning is hard.  And while Amazon’s MLaaS is a powerful tool – it isn’t perfect and it doesn’t do everything.  Data needs to be clean going in and most likely needs some manipulation using various ML techniques if we want to get decent predictions from Amazon algorithms.

Just for fun I did some data wrangling on the red wine dataset to see if I could get some better prediction results.  I manually split the dataset myself using a stratified shuffle split and then ran it through the machine learning wizard using the “custom” option which allowed me to turn off the AWS 70/30 split.  The results?  Just doing a bit of work I improved the prediction accuracy for our good (3) wines to 65% correct, bad (1) wines to 25% correct, and raised the F1 score to .59 (up from .42, higher is better).

Confusion matrix with stratified shuffling

Thanks for reading!

Amazon Machine Learning for a binary classification dataset

Machine learning as a service is real.  Clean data with a well organized schema can be fed to cloud-based machine learning services with a decent ML model returned in less than 30 minutes.  The resulting model can be used for inferring  target values of new datasets in the form of batch or real-time predictions.  All three public cloud vendors (AWS, Microsoft, Google) are competing in this space which makes the services cheap and easy to consume.

In our last discussion we ran the Kaggle red wine quality dataset through the Amazon Machine Learning service.  The data was fed to AWS without any manipulation which AWS interpreted as a regression problem with a linear regression model returned.  Each of the subjective wine quality ratings were treated as an integer from 0 (worse) to 10 (best) with a resulting model that could predict wine quality scores.  Honestly, the results weren’t spectacular – we could have gotten similar results by just guessing the median value (6) every time and we almost would have scored just as well on our RMSE value.

Our goal was to demonstrate Amazon’s machine learning capabilities in solving a regression problem and was not to create the most accurate model.  Linear regression may not be the best way to approach our Kaggle red wine quality dataset.  A (somewhat) arbitrary judge’s score from 0-10 probably does not have a linear relationship with all of the wine’s chemical measurements.

What other options do we have to solve this problem using the Amazon Machine Learning service?

Binary Classification

What is a machine learning binary classification problem?  It’s one that tries to predict a yes/no or true/false answer – the outcome is binary.  A dataset is labeled with the yes/no or true/false values.  This dataset is used to create a model to predict yes/no or true/false values on new data that is unlabeled.

An example of binary classification is a medical test for a specific disease.  Data is collected from a large group of patients who are known to have the disease and known not to have the disease.  New patients can then be tested by collecting the same data points and feeding them to a model.   The model will predict (with error rates) whether it believes the new patients have tehe disease.

Let’s explore how Amazon Machine Learning performs with a simple binary classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle Red Wine Quality dataset through the Amazon machine learning regression algorithms in the last post.  Why this dataset?  Because it was clean data with a relatively simple objective – to predict the wine quality from its chemical measurements.  We also had no data wrangling to perform – we simply uploaded the CSV to AWS and had our model created with an RMSE evaluation score ready to review.

This time we want to change things a bit.  Rather than treat the 0-10 sensory quality ratings as integers, we want to turn the quality ratings into a binary rating.  Good or bad.  Great or not-so-great.  This greatly simplifies our problem – rather than have such a wide swing of a ten point wine rating, we can simply categorize the wine as great and no-so-great.  In order to do this we need to edit our dataset and change the quality ratings to a one (great) or a zero (not-so-great).

Data preparation – feature engineering

Go ahead and download the “winequality-red.csv” file from Kaggle. Open up the .CSV file as a spreadsheet.  We need to replace the 0-10 quality scores with a 1 (great) or 0 (not-so-great).  Let’s assume most wines in this dataset are fairly average  – rated a 5 or 6.  The truly great wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and up are great and wines rated 6 and under are no-so-great.

All we have to do is edit our CSV files with the new 0 or 1 categories, easy right?  Well, kind of.  The spreadsheet has ~1600 ratings and manually doing a search and replace is tedious and not easily repeatable.  Most machine learning datasets aren’t coming from simple and small CVS files but rather from big datasets hosted in SQL/NoSQL databases, object stores, or even distributed filesystems like HDFS.  Manual editing often won’t work and definitely won’t scale for larger problems.

Most data scientists will spend a decent amount of time manipulating and cleaning up datasets with tools that utilize  some type of high level programming language.  Jupyter notebooks are a popular tool and can support your programming language of choice.  Jupyter notebook are much more efficient for data wrangling using code instead of working manually with spreadsheets.  Amazon even hosts cloud-base Jupyter notebooks within Amazon Sagemaker.

Converting the wine ratings from 0-10 to a binary 0/ 1 is pretty easy in Python.  Just open the CSV file, test if each quality rating is a 7 or higher (> 6.5) and convert the true/false to an integer by multiplying by 1.  Then we’ll write out the changes to a new CSV file which we’ll use for our new datasource.

Python code
import pandas as pd
wine = pd.read_csv('winequality-red.csv')
wine['quality'] = (wine['quality'] > 6.5)*1

Running the binary classification dataset through Amazon Machine Learning

Our dataset now has a binary wine rating – 1 for great wine and 0 for no-so-great wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our new binary CSV file.  The wizard will verify the schema and confirm that our file is clean.

What is different from linear regression is when we look at the schema, we want to make sure the ‘quality’ column is a ‘binary’ type rather than a ‘numerical’ type.  All other values are numerical, only the quality rating is binary.  This should be the default behavior but best to double check.  Also select “yes” to show that your first line in the CSV file is your column name to remove this name from the model.


After finishing the schema, select your target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s check out the results by opening the evaluation.


Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with a AUC score of .778.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance

By default the wizard saves 30% of our dataset for evaluation so we can measure the accuracy of our model’s predictions.  Binary classification algorithms are measured with an AUC or Area Under the Curve score.  The measurement is a value between 0 and 1 with 1 being a perfect model that predicts 100% of the values correctly.

Our score shows our model got 83% of our predictions correct and 17% incorrect.  Not bad!  What is nice about this type of scoring is we can also see our false positives (not-so-great wine classified as great) and false negatives (great wine that was predicted as no-so-great).  Specifically our model had 55 false positives and 25 false negative.

Predicting quality wine is not exactly a matter of life and death.  Which means we aren’t necessarily concerned with false positives or false negatives as long as we have a decent prediction model.  But for other binary classification problems we may want to adjust our model to avoid false positives or false negatives.  This adjustment is made using the slider on the model performance screen shown above.

The adjustments come at a cost – if we want less false positives (bad wine predicted as great) then we’ll have more false negatives (great wine that accidentally predicted as bad).  The reverse is also true, if we want less false negatives (great wine predicted as bad), we will have more false positives (bad wine predicted as great).

Final thoughts

AWS uses specific learning algorithms for solving the three types of machine learning problems supported.  For binary classification, Amazon ML uses logistic regression which is a logistic loss function plus SGD.  Most beginners won’t mind these limitations but if we want to use other algorithms we’ll have to look at more advanced services like Amazon SageMaker.

Machine learning as a service is easy, fast, and cheap.  By using the AWS black box approach we built a working binary classification model in less than 30 minutes with minimal coding and only a high level knowledge of machine learning.

Thanks for reading!


Amazon Machine Learning – Commodity Machine Learning as a Service

Commodity turns into “as a service”

Find something in computing that is expensive and cumbersome to work with and Amazon will find a way to commoditize it.  AWS created commodity storage, networking, and compute services for end users with resources leftover from their online retail business.  The margins may have not been great, but Amazon could make this type of business profitable due to their scale.

But what happens now that Microsoft and Google can offer the same services at the same scale?  The need for differentiation drives innovative cloud products and platforms with the goal of attracting new customers and keeping those customers within a single ecosystem.  Using basic services like storage, networking, and compute may not be differentiated between the cloud providers but new software as a service (SaaS) offerings are more enticing when shopping for cloud services.

Software as a service (SaaS) probably isn’t the best way to describe these differentiated services offered by public cloud providers.  SaaS typically refers to subscription-based software running in the cloud like Salesforce.com or ServiceNow.  Recent cloud services are better described by their functionality with an “as a service” tacked on.  Database as a service.  Data warehousing as a service.  Analytics as a service.  Cloud providers build the physical infrastructure, write the software, and implement vertical scaling with a self-service provisioning portal for easy consumption.  Consumers simply implement the service within their application without worrying about the underlying details.

Legacy datacenter hardware and software vendors may not appreciate this approach due to lost revenue but the “as a service” model is a good thing for IT consumers.  Services that were previously unavailable to every day users have been democratized and are available to anyone with an AWS account and a credit card.  Cost isn’t necessarily the primary benefit to consumer but rather the accessibility and the ability to consume using a public utility model.  All users can have access now to previous exotic technologies and can pay by the hour to use them.

Machine Learning as a Service

Machine learning is at the peak of the emerging technology hype cycle.  But is it really all hype?  Machine learning (ML) has been around for 20+ years.  ML techniques allow users to “teach” computers to solve problems without hard-coding hundreds or thousands of explicit rules.    This isn’t a new idea but the hype is definitely a recent phenomenon for a number of reasons.

So why all the machine learning hype?  The startup cost both in terms of hardware/software and accessibility are much lower which presents an opportunity to implement machine learning that wasn’t available in the past.  Data is more abundant and data storage is cheap.  Computing resources are abundant and CPU/GPU cycles are (relatively) cheap.  Most importantly, the barriers have been lifted in terms of combining access to advanced computing resources and vast sets of data.  What used to require onsite specialized HPC clusters and expensive data storage arrays can now be performed in the cloud for a reasonable cost.  Services like Amazon Machine Learning are giving everyday users the ability to perform complicated machine learning computations on datasets that were only available to researcher at universities in the past.

How does Amazon provide machine learning as a service?  Think of the service as a black box.  The implementation details within the black box are unimportant.  Data is fed into the system, a predictive model/algorithm is created and trained, the system is tweaked for its effectiveness, and then the system is used to make predictions using the trained model.  Users can use the predictive system without really knowing what is happening inside the black box.

This machine learning black box isn’t magical.  It is limited to a few basic types of models (algorithms) – regression, binary classification, and multiclass classification.  More advanced type operations require users to look to AWS SageMaker and require a higher skill level than the basic AWS machine learning black box.  However, these three basic machine learning models can get you started on some real-world problems very quickly without really knowing much math or programming.

Amazon Machine Learning  Workflow

So how does this process work at a high level?  If a dataset and use case can be identified as a regression or binary/multiclass classification problem, then the data can simply be fed to the AWS machine learning black box.  AWS  will use the data to automatically select a model and train the model using your input data.  The effectiveness of the model is then evaluated with some measurable results that determine effectiveness of the model with a numerical score.  This model is ready to use at this point but can also be tweaked to improve the scoring.  Bulk data can get fed to the trained model for batch predictions or ad-hoc predictions can be performed using the AWS console or programmatically through the AWS API.

Knowing that a problem can be solved by AWS takes a bit of high-level machine learning knowledge.  The end user needs to have an understanding of their data and of the three model types offered in the AWS black box.  Reading through the AWS Machine Learning developer guide is a good start in terms of an overview.  Regressions models solve problems that need to predict numeric values.  Binary classification models predict binary outcomes (true/false, yes/no) and multiclass classification models can predict more than two outcomes (categories).

Why use Amazon machine learning?

For those starting out with machine learning this AWS service may sound overcomplicated or of questionable value.  Most tutorials show this type of work done on a laptop with free programming tools, why is AWS necessary?  The point is the novice user can do some basic machine learning in AWS without the high startup opportunity costs of learning programming or learning how to use machine learning software packages.  Simply feed the AWS machine learning engine data and get instant results.

Anyone can run a proof of concept data pipeline into AWS and perform some basic machine learning predictions.  Some light programming skills would be helpful but are not mandatory.  Having a dataset with a well-defined schema is a start as well as having a business goal of using that dataset to make predictions on similar sets of incoming data.  AWS can provide these predictions for pennies an hour and eliminate the startup costs that would normally delay or even halt these types of projects.

Amazon machine learning is a way to quickly get productive and get a project past the proof of concept phase in making predictions based on company data.  Access to the platform is open to all users so don’t rely on the AWS models for a competitive advantage or a product differentiator.  Instead use AWS machine learning as a tool to quickly get a machine learning project started without much investment.

Thanks for reading, some future blog posts here will include running some well-known machine learning problems through Amazon Machine Learning to highlight the process and results.

Hello world

“Hello world” blog posts are lame.  They are not interesting and I avoid reading them.  But I need to post this one.

Why?  Well, I’m in the IT industry and I’ve blogged on my previous employer’s communities site.  I have recently left that company and joined a new one.  The point-and-click corporate blogging experience I used previously made it easy to push out content but I wasn’t crazy about the rigid formatting or the need to keep writing content related to products sold by my company.  Now that I have I left I want the freedom to build my own site and blog about whatever.

So what is the best way for an IT infrastructure guy to start blogging?  I’ve recently gotten interested (and certified) in AWS.  Mind you, I’m not an AWS expert but I’ve gotten through the learning curve and can get around AWS fairly easily.  So I figured I’d find a way to spin up a blog in AWS to keep my skills sharp while working on the site.

I started looking around AWS and found Lightsail.  Lightsail has its pros/cons which I’ll continue to explore in future posts but it takes some of the drudgery away from getting an instance running in the AWS cloud.  That ease of use comes at the expense of scalability, resiliency, and performance.   But it is quick & easy to run and will scale up (a bit) if you wind up needing more resources.

I’m specifically using WordPress on Lightsail.  WordPress for me is a work in progress, I haven’t worked with websites since the late 90s.  So I’m using a canned WordPress template and working out the kinks as I go.  Some things work, some things don’t.  And the formatting looks pretty goofy.

Which all means to say that this “hello world” post is a way to help me work through the setup with a short intro post.  I’ll do more posts about various things soon and get everything working.

So there it is, a short post to help get this blog off the ground.  It isn’t particularly pretty or creative, but thanks for reading anyway.