Exporting Lightsail instance snapshots to Amazon EC2

I’ve been using Amazon Lightsail for the past six months to host my WordPress blog site. I’m happy with Lightsail – I have full control over both my Linux instance OS and my WordPress blog site administration but I don’t have to deal with the headaches (or the price) of a full-blown EC2 instance.

A few months back I wrote a blog post listing the pros and cons of Amazon Lightsail.  Why would one use Lightsail versus Amazon EC2?  Within the Amazon ecosystem, Lightsail is the cheapest and easiest to use platform if the goal is to get a Linux instance and simple application up and running in AWS.

Think of Lightsail as a lightweight version of AWS EC2.  There really is no learning curve to get started on Lightsail – just pick some basic options and launch the instance.  The Lightsail pricing structure is straightforward and predictable.  Both the pricing and ease of use make Lightsail as a hosting platform  attractive for bloggers and developers to get started on Amazon and the experience is quick, cheap, and easy.

With this simplicity comes limitations.  A lightweight version of an Amazon EC2 instance doesn’t have the full compliment of every EC2 feature.  This makes sense – Lightsail is geared towards bloggers and developers and is kept separate from EC2.  A lightweight version works fine when starting a project but what if we hit a point where the advanced networking features of a VPC are required?  Or autoscaling groups?  Or spot instances?

Fortunately this has changed as of AWS re:Invent 2018 – we can now convert a Lightsail instance into an EC2 instance using Lightsail instance snapshots!   I’ll walk through the process in this blog and show some of the strategies and considerations needed to move a Lightsail instance out of the limited Lightsail environment into the feature-rich EC2 environment.

Why should I move my instance out of Lightsail?

Let’s face it – we aren’t going to move our instance out of Lightsail unless we have to.  Lightsail is the cheapest option for running an instance in AWS.  Instance monthly plans start as low as $3.50/month as long as the instance stays within the resource bounds of the pricing plan.

Adding more horsepower to a Lightsail instance is also not a reason to move to EC2.  Lightsail allows upgrading an instance to use more resources (for a price) – just use an instance snapshot to clone the existing instance to a new plan with more resources (the application inside the instance will persist using instance snapshots).

But what if we want to to take our application out of the Lightsail environment and place it into more of a production environment?  What if we have found limitations in Lightsail and found that our applications needs to reside in EC2 to take advantage of all the available EC2 features?  What if we need web-scale features, advanced networking, and access to services that are limited in the Lightsail environment?  In that use case we need to move to EC2.

AWS terminology and concepts

Before we go through the conversion process, let’s clear up some terminology and concepts that will help illustrate the task.

Instance – this is just Amazon’s fancy word for a virtual machine.  The term “virtual machine” probably sounded too much like VMware so Amazon calls virtual machines “instances”.  An instance is a virtual machine but runs on Amazon’s proprietary hypervisor environment.

Lightsail instance – An AWS image running in the Lightsail environment.  Lightsail is separate from other AWS environments and runs on a dedicated  hardware stack in selected data centers.

EC2 instance – An AWS image running in the EC2 environment.  EC2 is completely separate from Lightsail and runs on its own hardware and has many more management and configuration options.

Instance snapshot  – A snapshot of either a Lightsail or EC2 instance that not only includes the state of the virtual machine but also a snapshot of the attached OS disk.

Disk snapshot – Unlike an instance snapshot, a disk snapshot only tracks changes to the physical disk attached to the virtual machine.  No virtual machine state is captured so this type of snapshot is only good for keeping track of block disk data, not for the virtual machine state.

CloudFormation Stack – a template that tracks AWS resources associated with an AWS service through the use of a template.  CloudFormation can be used to define the configuration details of an AWS instance.

Key pair – quite simply, a .PEM file used to connect to an instance.  We can’t connect to an instance remotely (SSH, Putty) without a .PEM key pair.

AMI – a preconfigured template that is also associated with an EBS snapshot and is used to launch an instance with a specified configuration

How exporting an instance snapshot works

Instance snapshots are portable.  When an instance snapshot is taken of a Lightsail instance, the OS and disk state are stored to flat files and kept in some form of S3 object storage.  The flat files can be shared within Lightsail to create a new instance from the same snapshot – we can even assign more CPU or RAM to an instance snapshot to upgrade the instance to something faster.

With the latest AWS Lightsail EC2 instance snapshot export feature we can now take the same Lightsail snapshot and share it with AWS EC2 infrastructure.  This allows EC2 to import the Lightsail snapshot  to clone our Lightsail instance and spin it up in EC2.  Lightsail snapshots were always portable but the recent EC2 export feature enabled sharing of snapshots outside of Lightsail.

Behind the scenes Amazon uses CloudFormation to enable this conversion process.  Which simply means we automatically get a CloudFormation stack (or template) created when we move our Lightsail instance snapshot out to EC2.  But we don’t have to worry at all about CloudFormation, this is transparent to the user and doesn’t even require the user to open the CloudFormation console or interact at all with CloudFormation.

Walkthrough of moving a Lightsail instance to EC2

Exporting a Lightsail instance is very easy.  First log into your AWS account and open the Lightsail console.  Select your existing Lightsail instance to see the options below.

lightsail Console

Open the snapshots tab and create an instance snapshot (if you don’t already have one).  Use any name you like and click “create snapshot”.  The process will most likely take a few minutes and the progress is shown in the console.

Create Instance Snapshot in Lightsail console

After the snapshot is completed it will display in the list of “recent snapshots”.  We can then select this snapshot (three orange dots on the right) and select “Export to Amazon EC2”.  This will move the snapshot out of Lightsail object storage over to EC2.

Recent Snapshots in Lightsail Console

An informational pop-up will appear explaining the outcome (EBS snapshot and AMI) along with a warning that EC2 billing rates will apply, click “Yes, continue”.

Export Lightsail Snapshot to EC2 in Lightsail console

One last warning will appear alerting that the existing default Lightsail key pairs (.PEM file) should be replaced with an individual key pair.  Click “Acknowledged”.

Export Security Warning in Lightsail Console

The progress can be monitored using the top “gear” in the Lightsail console which will “spin” during the export process.  When completed, the gear will stop “spinning” and the task will show as completed and “Exported to EC2” in the task history.

Running Snapshot export task in Lightsail Console

Completed Snapshot export in lightsail console

Using the exported EC2 Lightsail instance

Now that we have exported our Lightsail instance snapshot to EC2, what happens next?  By default, nothing.  Looking at the completed task screenshot (above), we can open the EC2 console to see what we have or we can use the Lightsail console to launch in EC2 instance from our exported snapshot.

Lightsail is great and I love the simplified console but we really should use the EC2 console at this point.  Lightsail has all new wizard screens to manage our instance in EC2 but my feeling is that if we are going to export to EC2, we should switch over and learn to use the EC2 console.

See the screenshot below, open the EC2 console and make sure region is the same that was used for Lightsail (in my case I’m US East (Ohio)).

EC2 in the AWS Console

Open the “Elastic Block Storage” link on the left side of the EC2 console.  On the top right side of the console we can see the instance snapshot that was created by Lightsail.  Since this snapshot is an instance snapshot we will also find an AMI associated with this EBS snapshot.

elastic block storage snapshots in EC2 Console

EBS snapshots in ec2 console

Open the “Images” – “AMI” link in the EC2 console.  We can see in the upper right frame that the Lightsail snapshot has been converted to an EC2 AMI.  This was done with AWS CloudFormation but that doesn’t really matter, the AMI is ready for us to use without worrying about how it was created.

AMIs in ec2 console

AMis in ec2 console

Now the fun part, we can launch our Lightsail instance in EC2 using the AMI.  Just click “launch”!  A few things to consider when launching the AMI (the details are out of scope for this post):

Instance size – pick an EC2 image size for your Lightsail instance, the EC2 options have different pricing plans and different levels of CPU, RAM, GPU, storage, etc.

Security (key pair, security groups, VPC info) –  design a strategy (if you don’t already have one) to secure your new image with a VPC, security group, key pair, etc.

Additional storage – additional disk storage can be attached during the EC2 launch

Tags – optionally you can tag your instance for easier AWS billing.

Pick your options and you are ready to go, your Lightsail instance is now running in EC2!  Thanks for reading!

Handwritten digit recognition – MNIST via Amazon Machine Learning APIs

In our last post we explored how to use the Amazon Machine Learning APIs to solve machine learning problems.  We looked at the SDK, we looked at the machine learning client documentation, and we even looked at an example on Github.  But we didn’t write any code.

If it is so easy to consume AWS APIs with code then why don’t we take a shot at writing some code?  Point taken, this post will include code!  We will use Python to write some functions to call Amazon Machine Learning APIs to create a machine learning model and run some batch predictions.  The code I’ve written is heavily influenced by AWS Python sample code – I’ve written my functions and main body but borrowed a few bits like the polling functions and random character generation.

MNIST dataset – handwritten digit recognition

The MNIST dataset is a “hello world” type machine learning problem that engineers typically use to smoke test an algorithm or ML process.  The dataset contains 70,000 handwritten digits from 0-9 each scanned into a 28×28 pixel representation of each digit.  Each pixel value represents on of the 784 (28 x 28) pixel’s intensity with a value from 0 (white) to 255 (black).  See the Wikipedia entry for a sample visualization  of a subset of the images.

We are going to pull the MNIST data from Kaggle to make things easy.  Why pull it from Kaggle?  It saves us some time, the Kaggle version has conveniently split the data into 48,000 images in a training CSV and 28,000 images in a test CSV.  The training CSV has the labels in the first column which are just the numbers represented by all the pixels in the row.  The test CVS file has removed the label for use for predictions.  Kaggle has made things a bit easier for us, we don’t have to do any manipulation of the data.

Code on Github – https://github.com/keithxanderson

The code I have written to run MNIST through Amazon Machine learning via APIs is up on Github.  See the README.md for a detailed explanation of how to use the code and what exact steps are required to get things working.  We are going to use Python and specifically the ‘machinelearning’ client of Boto3 in order to pass our calls to AWS.

While the README.md explains things pretty well, let’s summarize here.  The main MNIST.py Python file uses the MNIST training data to create and evaluate an ML model and then uses the test data to create a batch prediction.  Data in the test file contains the pixels but no label, our model is predicting what digit is represented by the pixels after “learning” from the training data with labels (the answers).

There is a Python function written for each of the high level operations :

create_training_datasource – takes a training file from an S3 bucket, a schema file from an S3 bucket, and a percentage argument and splits the training data into a training dataset and an evaluation dataset.  The number of rows in each dataset is determined by the percentage set.  The schema file defines the schema (obviously).

create_model – takes a training datasource ID and a recipe file to create and train a model.  The function is hard-coded to create a multiclass classification model (multinomial logistic regression algorithm).

create_evaluation – takes an our model ID and our evaluation datasource ID and creates an evaluation which simply scores the performance of our model using the reserved evaluation data.  Results are seen manually in the AWS console but could be added to our code if we wanted.

create_batch_prediction_dataset – takes a test file from an S3 bucket and a schema file (a different schema file than our training schema file) and creates a batch dataset.  We will use this to make predictions, this data does not contain any labels.

batch_prediction – takes an existing model ID, a batch datasource ID and runs predictions of the batch data against our model.  Results are stored in the S3 output argument are written to a flat file.

Schema files and recipe

Our datasources all need a schema and we can define these with a schema file.  The schema for the training and evaluation data are the same and can use the same file.  The batch datasource does not contain labels so needs a separate schema file.

The training and evaluation schema file for MNIST can be found in the Github project with the Python code.  We simply define every column and declare each of the columns as “CATEGORICAL” since this is a multiclass classification model.  Each handwritten number is a category (0-9) and each pixel is a category (0-255).  The format of the schema file is defined by AWS, if you create a datasource manually in the console you can copy the AWS generated schema in your schema file.

The batch schema file is also found in the same project.  The only difference between the batch schema file and the training/evaluation schema file is that the batch schema does not contain the labels (numbers).

Lastly, our recipe file is found on Github with the Python code and schema files.  The recipe simply instructs AWS how to treat the input and output data.  Our example is very simply since all features (columns) are categorical and we just define all of our outputs as “ALL_CATEGORICAL”.  This saves us from manually declaring each pixels feature as a category.

Running the code

See the README.md for detailed instructions on running the code.  Provided we have our CSV files, schema files, and recipe file in an S3 bucket, we would just need to modify the MNIST.py file with the S3 locations for each file in the main body of the code along with an output S3 bucket for the prediction results (all can share the same S3 bucket).

After executing the MNIST.py file we will see the training, evaluation, and batch dataset creation execute in parallel and the model and evaluation process go pending.  Once we have our datasets created we will see the model training start and once we see the model training complete we will see the evaluation start.

A polling function is called within the batch prediction function to wait on the evaluation process to complete prior to running a batch prediction.  This is optional, there is really no dependency since the batch prediction can run in parallel with the evaluation scoring of our model.

Model evaluation performance

If all went well we will have an evaluation score in the AWS console.  Let’s take a look!  Looking at our ML model evaluation performance in the console we see an F1 score of .914 compared against a baseline score of .020 which is very good.

ML Model Performance

Exploring our model performance against our evaluation data we see that the model is very good at predicting the handwritten digits. The matrix below shows dark blue for every digit correctly classified as the numbers below.  This is not terribly surprising, most algorithms perform pretty well with the MNIST dataset but give credit to AWS for tuning the multiclass classification algorithm to this level of performance.

Explore Model Performance

Batch prediction results

We should also have an output file in our batch prediction output S3 bucket if everything ran to completion.  I’ve included a sample output CSV file within my MNIST project to view.  See the AWS documentation for more information on how to interpret the output.  Each column contains the categories for prediction which are the digits from 0-9.  Each row is the predicted probability that the test data belongs to the category.

For example, in our sample CSV file the first prediction shows about a 99% probability the number is a 2 digit and our second prediction shows a 99% probability the next number is a 0.

Why use Amazon Machine Learning?

We’ve discussed the value of using Amazon Machine Learning before but it is worth repeating.  It is very easy to incorporate basic machine learning algorithms into your application by offloading this work to AWS.  Avoid spending time and money building out infrastructure or expertise for an in-house machine learning engine.  Just follow the basic workflow outlined above and make a few API calls to Amazon Machine Learning.

Thanks for reading!

How to use the Amazon Machine Learning API

Machine learning, AI, deep learning – all of these technologies are popular and the hype cycle is intense.  When one digs into the topic of machine learning, it turns out to be a lot of math, data wrangling, and writing coding.  Which means one can quickly become overwhelmed by the details and the lose sight of the practicality of machine learning.

Take a step back from all the complexity and ask a fundamental question.  Why should we even look at machine learning?   Simple, we want our applications to get smarter and make better decisions.   Ignore the complexities of machine learning and think about the end result instead.  Machine learning is a powerful tool that can be used to our advantage when building applications.

Now consider how we would implement machine learning into our applications.  We could write lots of code within our app to pull historical data, use this data to create and train models, evaluate and tweak our models, and finally make predictions.  Or we could offload this work to public cloud services like Amazon Machine Learning.  All it takes is some code within our application to make calls Amazon Machine Learning APIs.

Machine learning via APIs?

I’ve been an infrastructure guy for most of my career.  While I have worked for software companies and have worked closely with software developers – I’m not a developer and have not made my living writing code.

That being said, machine learning requires learning to write code.  One wouldn’t go out and buy machine learning from a vendor but would rather write code to build and use machine learning models.  Furthermore, machine learning models don’t provide much value by themselves but are valuable when built into an application.  So coding is important for using machine learning.

Think along the same lines regarding consumption of public cloud.  Cloud isn’t just about cost, it’s about applications consuming infrastructure  and services with code.  Anything that can be created, configured, or consumed in AWS can (and should?) be done so with code.   So coding is also important for consuming public cloud.

Fortunately, writing code to use Amazon Machine Learning isn’t very hard at all.  Amazon publishes a machine learning developer guide and API reference documentation that shows how to write code to add machine learning to your application.  That combined with some machine learning examples on Github make the process easy follow.

Overview of an application using Amazon Machine Learning

What is the high-level architecture of an application using Amazon Machine Learning?  Let’s use an example as we walk through the architecture.  Think of a hypothetical customer loyalty application that sends rebates to customers we think may stop using our theoretical company’s services.  We could incorporate Amazon machine learning into the app to flag at-risk customers for rebates.

Raw data – first we need data.  This could be in a database, data warehouse, or even flat files in a filesystem.  In our example, these are the historical records of all customers that have used our service and have either remained loyal to our company or stopped using our services (the historical data is labeled with the outcomes).

Data processed and uploaded to S3 – Our historical data needs to be extracted from our original source (above), cleaned up, converted into a CSV file,  and then uploaded to an AWS S3 buckets.

Training datasource – our application needs to invoke Amazon Machine learning via an API and create a ML datasource out of our CSV file.

Model – our application needs to create and train a model via an AWS API using the datasource we created above.

Model evaluation – our app will need to measure how well our model performs with some of the data reserved from the datasource.  Our evaluation scoring gives us an idea of how well our model makes predictions on new data.

Prediction Datasource –  if we want to make predictions with machine learning, we need new customer data that matches our training data schema but does not contain the customer outcome.    So we need to collect this new data into a separate CSV file and upload it to S3.

Batch predictions – now for the fun part, we can now call our model via AWS APIs to predict the target value of the new data we uploaded to S3.  The output will be contained in a flat file in an S3 bucket we specify that we can read with our app and take appropriate action (do nothing or send a rebate).

How to get started?

First pick a programming language.  Python is popular both for machine learning and for interacting with AWS so we’ll assume Python is our language.  Next read through the AWS Python SDK Quickstart guide to get a feel for how to use AWS with Python.  Python uses the SDK Boto3 which needs to be installed and also needs to be configured to use your AWS account credentials.

Lastly, read through the machine learning section of the Boto3 SDK guide to get a feel of the logical flow of how to consume Amazon Machine Learning via APIs.  Within this guide we would need to call the Python methods below using the recommended syntax given.

create_data_source_from_s3 – how we create a datasource from a CSV file

create_ml_model – how we create an ML model from a datasource

create_evaluation – how we create an evaluation score for the performance of our model

create_batch_prediction –  how we predict target values for new datasources

Getting started with an example

At this point we have a good idea of what we need to do to incorporate Amazon machine learning into a Python app.  Those already proficient in Python could start writing new functions to call each of these Boto3 ML methods following the architecture we described above.

What about an example for those of use who aren’t as proficient in Python?  Fortunately there are plenty of examples online of how to write apps to call AWS APIs.  Even better, AWS released some in-depth ML examples on Github that walks through examples in several languages including Python.

Specifically lets look at the targeted marketing example for Python.  First check out the README.md.  We see that we have to have our AWS credentials setup locally where we will run the code so that we can authenticate to our AWS account.  We’ll also need Python installed locally (as well as Boto3) so that we can run Python executables (.py files).

Next take a look at the Python code to build an ML model.  A few things jump out in the program:

—  The code imports a few libraries to use in the code (boto3, base64, import json, import os, sys).

—  The code then defines the S3 bucket and CSV file that we will use to train our model (“TRAINING_DATA_S3_URL”).

—  Functions are written for each subtask that needs to be performed for our machine learning application.  We have functions to create a datasource (def create_data_sources), build a model (def build_model), train a model (def create_model), and performs an evaluation of the model (def create_evaluation).

—  The main body of the code (line 123 – ‘if __name__‘) ties everything together using the CSV file, a user defined schema file, and a user defined recipe file to call each function and results in a ML model with a performance measurement.

Schema and recipe

Along with the Python code we find a .schema file and recipe.json file.  We explored these in previous posts when using Amazon Machine Learning through the AWS console wizard.  When using machine learning with code through the AWS API, the schema and recipe must be created and formatted manually and passed as arguments when calling the Boto3 ML functions.

Data pulled out of a database probably already has a defined schema – these are just the rows and columns that describe the data fields and define the types of data inside them.  The ML schema needs to be defined in the correct format with our attributes and our target value along with some other information.

A recipe is a bit more complicated, the format for creating a recipe file gives the details.  The recipe simply defines how our model treats the input and output data defined in our schema.  Our example recipe file only defines outputs using some predefined groups (ALL_TEXT, ALL_NUMERIC, ALL_CATEGORICAL, ALL_BINARY ) and does a few transformations on the data (quantile bin the numbers and convert text to lower case).

Making predictions in our application 

Assuming everything worked when we ran the ‘build_model.py’ executable, we would have a model available to make predictions on unlabeled data.  The ‘use_model.py’ executable does just that.  We just need to get this new unlabeled in a CVS file and into an S3 bucket (“UNSCORED_DATA_S3_URL”) and allow the ‘use_model.py’ code create a new datasource to use for predictions.  The main section of this ‘use_model.py’ code ties everything together and calls the functions with the input/output parameters required.

Think about it, we have all the code syntax, formatting, and necessary steps necessary to add predictive ability to our application by simply calling AWS APIs.  No complex rules in our application, no need to write an entire machine learning framework into our application, just offload to Amazon Machine Learning via APIs.

Thanks for reading!

 

Amazon Machine Learning for a multiclass classification dataset

We’ve taken a tour of Amazon Machine Learning over the last three posts.  Quickly recapping, Amazon supports three types of ML models with their machine learning as a service (MLaaS) engine – regression, binary classification, and multiclass classification.  Public cloud economics and automation make MLaaS an attractive option to prospective users looking to outsource their machine learning using public cloud API endpoints.

To demonstrate Amazon’s service we’ve taken the Kaggle red wine quality dataset and adjusted the dataset to demonstrate each of the AWS MLaaS model types.  Regression worked fairly well with very little effort.  Binary classification worked better with a bit of effort.  Now to finish the tour of Amazon Machine Learning we will look at altering the wine dataset once more to turn our wine quality dataset prediction engine into a multiclass classification problem.

Multiclass Classification

What is a machine learning multiclass classification problem?  It’s one that tries to predict if an instance or set of data belongs to one of three or more categories.  A dataset is first labeled with the known categories and used to train an ML model.  That model is then fed new unlabeled data and attempts to predict what categories the new data belongs to.

An example of multiclass classification is an image recognition system.  Say we wanted to digitally scan handwritten zip codes on envelopes and have our image recognition system predict the numbers.  Our multiclass classification model would first need to be trained to detect ten digits (0-9) using existing labeled data.  Then our model could use newly scanned handwritten zip codes (unlabeled) and predict (with an accuracy value) what digits were written on the envelope.

Let’s explore how Amazon Machine Learning performs with a mulitclass classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle Red Wine Quality dataset untouched through the Amazon machine learning regression algorithm.  The algorithm interpreted the scores as floating point numbers rather than integer categories which isn’t necessarily what we were after.  We then doctored up the ratings into binary categories of “great” and “not-so-great” ran the dataset through the binary classification algorithm.  The binary categories were closer to what we wanted, treat the ratings as categories instead of decimal numbers.

This time we want to change things again.  We are going to add a third category to our rating system and change the quality score again.  Rather than try to predict the wide swing of a 0-10 rating system, we will label the quality ratings as either good, bad, or average – three classes.  And Amazon ML will create a multiclass regression model to predict one of the three ratings on new data.

Data preparation – converting to three classes

We will go ahead and download the “winequality-red.csv” file again from Kaggle.  We need to replace the existing 0-10 quality scores with with categories of bad, average, and good.  We’ll use numeric categories to represent quality – bad will be 1, average will be 2, and good will be 3.

Looking at the dataset histogram below, most wines in this dataset are average  – rated a 5 or 6.  Good wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and higher are good (3), wines rated 5 or 6 are average (2), and those rated less than 5 are bad (1).  Even though the rating system is 0-10, we don’t have any wines rated less than 3 or greater than 10.

Wine quality Histogram

We could use spreadsheet formulas or write a simply python ‘for’ loop to append our CSV “quality” column.  Replace the ratings from 3-4 with a 1 (bad), replace our 5-6 ratings with a 2 (average), and replace our 7-8 ratings with a 3 (good).  Then we’ll write out the changes to a new CSV file which we’ll use for our new ML model datasource.

Wine quality Histogram AFTER PROCESSING

Running the dataset through Amazon Machine Learning

Our wine dataset now has a multiclass style rating – 1 for bad wine, 2 for average wine, and 3 for good wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our multiclass CSV file.  The wizard will verify the schema and confirm that our file is clean.

We need to identify our quality rating as a category rather than numeric for Amazon to recognize this as a multiclass classification problem.  When setting up the schema, edit the ‘quality’ column and set the data type to ‘categorical’ type rather than the default  ‘numerical’.  We will again select “yes” to show our CSV file contains column names.

SCHEMA WITH CATEGORICAL DATA TYPE

After finishing the schema, select the target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

Let’s first look at our source dataset prior to the 70/30 split just to get an idea of our data distribution.  82% of the wines are rated a 2 (average), 14% are rated a 3 (good),and 4% are rated a 1 (bad).  By default, the ML wizard is going to randomly split this data 70/30 to use for training and performance evaluation.

Datasource Histogram prior to Train/test split

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s take a look at our evaluation summary and see how we did.

EVALUATION SUMMARY

Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with an average F1 score of 0.421.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance- Confusion matrix

Multiclass prediction accuracy can be measured as a weighted sum of the individual binary predictions.  How well did we predict good wines (3), average wines (2), and bad wines (1)?  Each category’s accuracy score is combined into what is called an F1 score where a higher F1 score is better than a lower score.

A visualization of this accuracy is displayed above in a confusion matrix.  The matrix makes it easy to see where the model is successful and where it is not so successful.  The darker the blue, the more accurate the correct prediction, the darker the orange/red, the more inaccurate the prediction.

Looking at our accuracy above, it seems we are good at predicting average wine (2) – our accuracy is almost 90% and we predicted 361/404 wines correctly as average (2).  However, the model was not so good at predicting good (3) and bad (1) wines.  We only correctly predicted 13/47 (28%) as good and only correctly predicted 2/26 (.08%) as bad (1).  Our model is good at predicting the easy (2) category but not so good at predicting the more difficult and less common (1) and (3) categories.

Can we do better?

Our model is disappointing, we could have done almost as well if we just predicted that every wine was average (2).  We want train a model that is better at predicting good (3) wines and bad (1) wines.  But how?  We could use a few ML techniques to get a better model:

Collect more data – More data would help train our model better, especially having more samples of good (3) and bad (1) wines.  In our case this isn’t possible since we are working with a static public dataset.

Remove some features – Our algorithm may have an easier time evaluating less features.  We have 11 features (pH, sulphates, alcohol, etc) and we really aren’t sure if all 11 features have a direct impact on wine quality.  Removing some features could make for a better model but would require some trial and error.  We’ll skip this for now.

Use a more powerful algorithm – Amazon uses  multinomial logistic regression (multinomial logistic loss + SGD) for multiclass classification problems.  This may not be the best algorithm for our wine dataset, there may be a more powerful algorithm that works better.  However this isn’t an option when using Amazon Machine Learning – we’d have to look at a more powerful tool like Amazon SageMaker if we wanted to experiment with different algorithms.

Data wrangling –  Feeding raw data to the Amazon Machine Learning service untouched is easy but if we aren’t getting good results we will need to perform some pre-processing to get a better model.  Some features have wide ranges and some do not – for example the citric acid values range from 0-1 while the total sulfur dioxide values range from 6-289.  So some feature scaling might be a good idea.

Also, the default random 70/30 training and testing data split may not be the greatest way to train and test our model.  We may want to use a more powerful method to split the folds of the dataset ourselves rather than to let Amazon randomly split it.  Running a Stratified ShuffleSplit might be helpful prior to uploading it to Amazon.

Lastly, Amazon Machine Learning uses a technique called quantile binning for numeric values.  Instead of treating a range of numbers as discreet values, Amazon puts the range of values in “bins” and converts them into categories.  This may work well for non-linear features but may not work great for features with a direct linear correlation to our quality ratings.  Amazon recommends some experimentation with their data transformation recipes to tweak model performance.  The default recipes may not be the best for all problems.

Final Thoughts

Machine learning is hard.  And while Amazon’s MLaaS is a powerful tool – it isn’t perfect and it doesn’t do everything.  Data needs to be clean going in and most likely needs some manipulation using various ML techniques if we want to get decent predictions from Amazon algorithms.

Just for fun I did some data wrangling on the red wine dataset to see if I could get some better prediction results.  I manually split the dataset myself using a stratified shuffle split and then ran it through the machine learning wizard using the “custom” option which allowed me to turn off the AWS 70/30 split.  The results?  Just doing a bit of work I improved the prediction accuracy for our good (3) wines to 65% correct, bad (1) wines to 25% correct, and raised the F1 score to .59 (up from .42, higher is better).

Confusion matrix with stratified shuffling

Thanks for reading!

Amazon Machine Learning for a binary classification dataset

Machine learning as a service is real.  Clean data with a well organized schema can be fed to cloud-based machine learning services with a decent ML model returned in less than 30 minutes.  The resulting model can be used for inferring  target values of new datasets in the form of batch or real-time predictions.  All three public cloud vendors (AWS, Microsoft, Google) are competing in this space which makes the services cheap and easy to consume.

In our last discussion we ran the Kaggle red wine quality dataset through the Amazon Machine Learning service.  The data was fed to AWS without any manipulation which AWS interpreted as a regression problem with a linear regression model returned.  Each of the subjective wine quality ratings were treated as an integer from 0 (worse) to 10 (best) with a resulting model that could predict wine quality scores.  Honestly, the results weren’t spectacular – we could have gotten similar results by just guessing the median value (6) every time and we almost would have scored just as well on our RMSE value.

Our goal was to demonstrate Amazon’s machine learning capabilities in solving a regression problem and was not to create the most accurate model.  Linear regression may not be the best way to approach our Kaggle red wine quality dataset.  A (somewhat) arbitrary judge’s score from 0-10 probably does not have a linear relationship with all of the wine’s chemical measurements.

What other options do we have to solve this problem using the Amazon Machine Learning service?

Binary Classification

What is a machine learning binary classification problem?  It’s one that tries to predict a yes/no or true/false answer – the outcome is binary.  A dataset is labeled with the yes/no or true/false values.  This dataset is used to create a model to predict yes/no or true/false values on new data that is unlabeled.

An example of binary classification is a medical test for a specific disease.  Data is collected from a large group of patients who are known to have the disease and known not to have the disease.  New patients can then be tested by collecting the same data points and feeding them to a model.   The model will predict (with error rates) whether it believes the new patients have tehe disease.

Let’s explore how Amazon Machine Learning performs with a simple binary classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle Red Wine Quality dataset through the Amazon machine learning regression algorithms in the last post.  Why this dataset?  Because it was clean data with a relatively simple objective – to predict the wine quality from its chemical measurements.  We also had no data wrangling to perform – we simply uploaded the CSV to AWS and had our model created with an RMSE evaluation score ready to review.

This time we want to change things a bit.  Rather than treat the 0-10 sensory quality ratings as integers, we want to turn the quality ratings into a binary rating.  Good or bad.  Great or not-so-great.  This greatly simplifies our problem – rather than have such a wide swing of a ten point wine rating, we can simply categorize the wine as great and no-so-great.  In order to do this we need to edit our dataset and change the quality ratings to a one (great) or a zero (not-so-great).

Data preparation – feature engineering

Go ahead and download the “winequality-red.csv” file from Kaggle. Open up the .CSV file as a spreadsheet.  We need to replace the 0-10 quality scores with a 1 (great) or 0 (not-so-great).  Let’s assume most wines in this dataset are fairly average  – rated a 5 or 6.  The truly great wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and up are great and wines rated 6 and under are no-so-great.

All we have to do is edit our CSV files with the new 0 or 1 categories, easy right?  Well, kind of.  The spreadsheet has ~1600 ratings and manually doing a search and replace is tedious and not easily repeatable.  Most machine learning datasets aren’t coming from simple and small CVS files but rather from big datasets hosted in SQL/NoSQL databases, object stores, or even distributed filesystems like HDFS.  Manual editing often won’t work and definitely won’t scale for larger problems.

Most data scientists will spend a decent amount of time manipulating and cleaning up datasets with tools that utilize  some type of high level programming language.  Jupyter notebooks are a popular tool and can support your programming language of choice.  Jupyter notebook are much more efficient for data wrangling using code instead of working manually with spreadsheets.  Amazon even hosts cloud-base Jupyter notebooks within Amazon Sagemaker.

Converting the wine ratings from 0-10 to a binary 0/ 1 is pretty easy in Python.  Just open the CSV file, test if each quality rating is a 7 or higher (> 6.5) and convert the true/false to an integer by multiplying by 1.  Then we’ll write out the changes to a new CSV file which we’ll use for our new datasource.

Python code
import pandas as pd
wine = pd.read_csv('winequality-red.csv')
wine['quality'] = (wine['quality'] > 6.5)*1
wine.to_csv('binary.wine.csv')

Running the binary classification dataset through Amazon Machine Learning

Our dataset now has a binary wine rating – 1 for great wine and 0 for no-so-great wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our new binary CSV file.  The wizard will verify the schema and confirm that our file is clean.

What is different from linear regression is when we look at the schema, we want to make sure the ‘quality’ column is a ‘binary’ type rather than a ‘numerical’ type.  All other values are numerical, only the quality rating is binary.  This should be the default behavior but best to double check.  Also select “yes” to show that your first line in the CSV file is your column name to remove this name from the model.

SCHEMA

After finishing the schema, select your target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s check out the results by opening the evaluation.

evaluation

Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with a AUC score of .778.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance

By default the wizard saves 30% of our dataset for evaluation so we can measure the accuracy of our model’s predictions.  Binary classification algorithms are measured with an AUC or Area Under the Curve score.  The measurement is a value between 0 and 1 with 1 being a perfect model that predicts 100% of the values correctly.

Our score shows our model got 83% of our predictions correct and 17% incorrect.  Not bad!  What is nice about this type of scoring is we can also see our false positives (not-so-great wine classified as great) and false negatives (great wine that was predicted as no-so-great).  Specifically our model had 55 false positives and 25 false negative.

Predicting quality wine is not exactly a matter of life and death.  Which means we aren’t necessarily concerned with false positives or false negatives as long as we have a decent prediction model.  But for other binary classification problems we may want to adjust our model to avoid false positives or false negatives.  This adjustment is made using the slider on the model performance screen shown above.

The adjustments come at a cost – if we want less false positives (bad wine predicted as great) then we’ll have more false negatives (great wine that accidentally predicted as bad).  The reverse is also true, if we want less false negatives (great wine predicted as bad), we will have more false positives (bad wine predicted as great).

Final thoughts

AWS uses specific learning algorithms for solving the three types of machine learning problems supported.  For binary classification, Amazon ML uses logistic regression which is a logistic loss function plus SGD.  Most beginners won’t mind these limitations but if we want to use other algorithms we’ll have to look at more advanced services like Amazon SageMaker.

Machine learning as a service is easy, fast, and cheap.  By using the AWS black box approach we built a working binary classification model in less than 30 minutes with minimal coding and only a high level knowledge of machine learning.

Thanks for reading!

 

Amazon Machine Learning for a simple regression dataset

Amazon Machine Learning is a public cloud service that offers developers access to ML algorithms as a service.  An overview can be found in the Amazon documentation and in my last blog post.   Consider the Amazon Machine Learning service a black box that automatically solves a few types of common ML problems – binary classification, multi-class classification, and regression.  If a dataset fits into one of these categories, we can quickly get a machine learning project started with minimal effort.

But just how easy is it to use this service?  What is it good for and what are the limitations?  The best way to outline the strengths and weaknesses of Amazon’s offering is to run a few examples through the black box and see the results.  In this post we’ll take a look at how AWS holds up against a very clean and simple regression dataset.

Regression

First off, what is a machine learning regression problem?  Regression modeling takes a dataset and creates a predictive algorithm to approximate one of the numerical values in that dataset.  Basically its a statistical model used to estimate relationships between data points.

An example of regression modeling is a real estate price prediction algorithm.  Historical housing closing prices could be fed into an algorithm with as many attributes about each house as relevant (called features – ie. square feet, number of bathrooms, etc).  The model could then be fed attributes of houses about to go up for sale and predict the selling price.

Let’s explore how Amazon Machine Learning models and performs with a simple regression dataset.

Kaggle Red Wine Quality Dataset

Kaggle is a community-based data science and machine learning site that hosts thousands of public datasets covering the spectrum of machine learning problems.  Our interest lies in taking Amazon Machine Learning for a test drive on a simple regression problem so I’m using a simple and clean regression dataset from Kaggle – Red Wine Quality.

Why this dataset?  Because we aren’t going to cover any data cleaning or transformations – we more interested in finding out how the AWS black box performs on a dataset with minimal effort.  Of course this is not a real world example but it keeps things simple.

The dataset contains a set of numerical attributes of 1600 red wine variants of the Portuguese “Vinho Verde” wine along with a numerical quality sensory rating between 0 (worse) and 10 (best).  Our goal will be to train an AWS model to predict the quality of this variety of red wine based off the chemically measurable attributes (features).  Can data science predict the quality of wine?

Amazon Machine Learning Walkthrough

First download the “winequality-red.csv” file from Kaggle.  Open up the .CSV file in a spreadsheet and take a look around.  We can see the column names for each column which include a “quality” column that rates the wine from 0-10.  Notice all fields are populated with numbers and there are no missing values.

We’ll need to upload our .CSV file into AWS storage to get our process started so log into the AWS console and open the S3 service.

S3 console

Create a new bucket with the default settings to dedicate to our machine learning datasets.

Create a new bucket

Upload the winequality-red.csv file into this bucket and keep the default settings, permissions will be adjusted later.

Upload the winequality-red.csv file

Now open AWS Machine Learning service in the AWS console.

Amazon Machine Learning Console

Click the “get started” button and then “view dashboard”.

Machine Learning Dashboard Launch

Click “create a new datasource and ML model”.

Create new datasource and ML model

AWS will prompt us for the input data for the datasource, we will use our S3 bucket name and the wine quality CSV file that was uploaded (it should auto-populate once we start typing).  Click “verify” to confirm the CSV file is intact.  At this point AWS will adjust the permissions for the ML service to have access to the S3 bucket and file (click “yes” to give it permissions).  Once we have a successful validation message, click “continue”.

Create datasource from CSV in S3 bucket

Successful validation

The wizard will now review the schema.  The schema is simply the fields in our CSV file.  Because our CSV contained the column names (fixed acidity, volatile acidity, etc) we want to select “yes” so we don’t try to model the quality based off the column names, selecting “yes” will remove the names from the model.

Schema

Now we want to select the target value that we want the model to predict. As stated earlier, we want to predict the quality of the wine based off the measurable attributes so select “quality” and continue.

Target

Keep the row identifier as default “no”.  We’ll skip this topic but it would be important if we were making predictions in a production environment.  Review the selections and “continue”.

Row Identifier

Review

Accept the default ML model settings which will split our dataset and set aside 30% of the data for evaluating the performance of the model once it is trained using 70% of the data.

Review one last time and then select “create ML model”.

Model settings

Review

We are done!  The dashboard and will show the progress, the entire process takes about 10-20 minutes. AWS will take over and read the data, perform a 70/30% split,  create a model, train the model using 70% of the data, and then evaluate the model using the remaining 30% of the data.

How did we do?

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s check out the results by opening the evaluation.

Once again, this isn’t a real world example.  A data scientist would typically spend much more time looking at the data prior to evaluating an algorithm.  But since this is a black box approach, we will entertain the notion of skipping to the end without doing any data exploration.

EVALUATION SUMMARY

Looking at our evaluation summary, things look pretty good.  We have a green box saying our ML model’s quality score was measured with RMSE of 0.729 and was better than the baseline value of 0.793.  That summary is somewhat vague so lets click “explore model performance”.

ML Model performance

Since we saved 30% of our data for evaluation, we can measure the accuracy of our model’s predictions.  A common way to measure regression accuracy is the RMSE score or root-mean-square-error. This RMSE score is measuring the difference between the model’s prediction and the actual value in the evaluation set.  A RMSE of 0 is perfect, the smaller the value the better the model.

We can see a somewhat bell-shaped curve and our model is pretty close to zero which seems great.  However, if we look back at our summary we see that the RMSE baseline is .793 and our score is .729.  The baseline would be our score if we just guessed for median (middle) quality score for every prediction.  So although we area bit better, we aren’t better by much and would be close if we just ignored all the attributes and guessed the median quality score (6) every time

Amazon Machine Learning – the good

Although our results were not incredibly impressive, the easy wizard-driven process and time to results were very impressive.  We can feed the AWS ML black box data and have a working model with an evaluation of its accuracy in less than 30 minutes.  And this can be done with little to no knowledge of data science, ML techniques, computer programming, or statistics.

And this entire process can be automated via the AWS API sets.  Think of an automated process that collects data and automatically feeds it to the AWS machine learning engine using code instead manually clicking buttons in the  console.  Predictions could be generated automatically on constantly changing data using code and the black box machine learning algorithms in the public cloud.

Lastly, this the service is inexpensive.  It cost about $3 to half a dozen datasets through the service over the past month while researching this blog.  All we need is some data in S3 and the service can start provided ML insights at a very low cost.

Amazon Machine Learning – the bad

A few points become clear after using the ML service.  The first is we have to adjust our expectations when feeding the AWS black box data.  Regression problems fed to Amazon Machine Learning are going to be solved using linear regression with stochastic gradient descent.  If a problem is easily solved using those algorithms then Amazon will do a great job making predictions.  If other algorithms like random forests or decision trees get better results then we need to look at more advanced Amazon services to solve those types of problems.

The second point is that data needs to be explored, cleaned, and transformed prior to feeding it to the machine learning service.  While we can look at correlations and data distributions after we create a datasource in AWS, there is no easy way to manipulate your dataset other directly editing your CSV file.  There is an advanced function to use built-in data transformations as part of the ML model wizard which is more of an advanced topic and is limited to the data transformations referenced in the documentation.  Accepting the defaults and not wrangling your input data may get great results from our AWS ML black box.

Use case

Have a machine learning problem that is easily solved using linear regression with SGD?  Have an existing data pipeline that already cleans and transforms your data?  Amazon machine learning can outsource your machine learning very quickly and cheaply in this case without much time spent learning the service.  Just don’t expect a shortcut to your ML process, send Amazon clean data and you can quickly get an API prediction endpoint for linear regression problems.

 

Amazon Machine Learning – Commodity Machine Learning as a Service

Commodity turns into “as a service”

Find something in computing that is expensive and cumbersome to work with and Amazon will find a way to commoditize it.  AWS created commodity storage, networking, and compute services for end users with resources leftover from their online retail business.  The margins may have not been great, but Amazon could make this type of business profitable due to their scale.

But what happens now that Microsoft and Google can offer the same services at the same scale?  The need for differentiation drives innovative cloud products and platforms with the goal of attracting new customers and keeping those customers within a single ecosystem.  Using basic services like storage, networking, and compute may not be differentiated between the cloud providers but new software as a service (SaaS) offerings are more enticing when shopping for cloud services.

Software as a service (SaaS) probably isn’t the best way to describe these differentiated services offered by public cloud providers.  SaaS typically refers to subscription-based software running in the cloud like Salesforce.com or ServiceNow.  Recent cloud services are better described by their functionality with an “as a service” tacked on.  Database as a service.  Data warehousing as a service.  Analytics as a service.  Cloud providers build the physical infrastructure, write the software, and implement vertical scaling with a self-service provisioning portal for easy consumption.  Consumers simply implement the service within their application without worrying about the underlying details.

Legacy datacenter hardware and software vendors may not appreciate this approach due to lost revenue but the “as a service” model is a good thing for IT consumers.  Services that were previously unavailable to every day users have been democratized and are available to anyone with an AWS account and a credit card.  Cost isn’t necessarily the primary benefit to consumer but rather the accessibility and the ability to consume using a public utility model.  All users can have access now to previous exotic technologies and can pay by the hour to use them.

Machine Learning as a Service

Machine learning is at the peak of the emerging technology hype cycle.  But is it really all hype?  Machine learning (ML) has been around for 20+ years.  ML techniques allow users to “teach” computers to solve problems without hard-coding hundreds or thousands of explicit rules.    This isn’t a new idea but the hype is definitely a recent phenomenon for a number of reasons.

So why all the machine learning hype?  The startup cost both in terms of hardware/software and accessibility are much lower which presents an opportunity to implement machine learning that wasn’t available in the past.  Data is more abundant and data storage is cheap.  Computing resources are abundant and CPU/GPU cycles are (relatively) cheap.  Most importantly, the barriers have been lifted in terms of combining access to advanced computing resources and vast sets of data.  What used to require onsite specialized HPC clusters and expensive data storage arrays can now be performed in the cloud for a reasonable cost.  Services like Amazon Machine Learning are giving everyday users the ability to perform complicated machine learning computations on datasets that were only available to researcher at universities in the past.

How does Amazon provide machine learning as a service?  Think of the service as a black box.  The implementation details within the black box are unimportant.  Data is fed into the system, a predictive model/algorithm is created and trained, the system is tweaked for its effectiveness, and then the system is used to make predictions using the trained model.  Users can use the predictive system without really knowing what is happening inside the black box.

This machine learning black box isn’t magical.  It is limited to a few basic types of models (algorithms) – regression, binary classification, and multiclass classification.  More advanced type operations require users to look to AWS SageMaker and require a higher skill level than the basic AWS machine learning black box.  However, these three basic machine learning models can get you started on some real-world problems very quickly without really knowing much math or programming.

Amazon Machine Learning  Workflow

So how does this process work at a high level?  If a dataset and use case can be identified as a regression or binary/multiclass classification problem, then the data can simply be fed to the AWS machine learning black box.  AWS  will use the data to automatically select a model and train the model using your input data.  The effectiveness of the model is then evaluated with some measurable results that determine effectiveness of the model with a numerical score.  This model is ready to use at this point but can also be tweaked to improve the scoring.  Bulk data can get fed to the trained model for batch predictions or ad-hoc predictions can be performed using the AWS console or programmatically through the AWS API.

Knowing that a problem can be solved by AWS takes a bit of high-level machine learning knowledge.  The end user needs to have an understanding of their data and of the three model types offered in the AWS black box.  Reading through the AWS Machine Learning developer guide is a good start in terms of an overview.  Regressions models solve problems that need to predict numeric values.  Binary classification models predict binary outcomes (true/false, yes/no) and multiclass classification models can predict more than two outcomes (categories).

Why use Amazon machine learning?

For those starting out with machine learning this AWS service may sound overcomplicated or of questionable value.  Most tutorials show this type of work done on a laptop with free programming tools, why is AWS necessary?  The point is the novice user can do some basic machine learning in AWS without the high startup opportunity costs of learning programming or learning how to use machine learning software packages.  Simply feed the AWS machine learning engine data and get instant results.

Anyone can run a proof of concept data pipeline into AWS and perform some basic machine learning predictions.  Some light programming skills would be helpful but are not mandatory.  Having a dataset with a well-defined schema is a start as well as having a business goal of using that dataset to make predictions on similar sets of incoming data.  AWS can provide these predictions for pennies an hour and eliminate the startup costs that would normally delay or even halt these types of projects.

Amazon machine learning is a way to quickly get productive and get a project past the proof of concept phase in making predictions based on company data.  Access to the platform is open to all users so don’t rely on the AWS models for a competitive advantage or a product differentiator.  Instead use AWS machine learning as a tool to quickly get a machine learning project started without much investment.

Thanks for reading, some future blog posts here will include running some well-known machine learning problems through Amazon Machine Learning to highlight the process and results.

Are VMware customers really afraid of the public cloud?

I spent last week in Las Vegas attending VMworld 2018.  Over the years I’ve been to many IT conferences both as an individual attendee and while representing a vendor.  This time around, I had on my vendor hat doing booth duty.  Which means I stood in the expo hall at my employer’s booth with my fellow engineers and spoke to the passersby who wanted to see a demo of our product and talk about our technology.

My employer was very kind this year in giving booth staff 4 hour shifts instead of a full 8-hour day (thanks!).  I worked 3 days straight for 4 hours a day for a total of 12 hours (give or take).  Each hour I probably spoke to about 4-5 people at length giving a full demo of our product and talking about their current storage technology.  For the sake of argument, let’s say I spoke to a total about 50 people in total during my 12 hours working at the conference.

I had a fun time working at the conference and all the conversations I had at the booth with customers were great.  But after talking about their current technology and their future business goals for data storage, one thing struck me as honestly surprising.  These customers were not using and were generally not interested in public cloud storage.

Some context

I currently work in the field of data protection and data management.  I’ve been specializing in data storage for most of my career.  I’ve seen firsthand the explosion in the sheer amount of data stored in the enterprise datacenter over the past 15+ years.  We generate more data today than we ever had in the past.  And one of the greatest challenges to solve for is how to efficiently store this data without spending an inordinate amount of time and money on the problem.

Just because we have lots of data doesn’t mean it’s all important. Most data is probably not that important.  But it’s often more difficult to classify data as important than it is to just keep it around.  It also difficult for individuals and their managers to decide if things can be deleted, will the business need this data later?  Isn’t it easier (and safer) to just keep everything?  So what companies often do is just keep everything and try to find a cheap bulk platform for this data of questionable value.

Bulk data storage platforms have been around for many years and are most often located in the same datacenter as the networking, servers, and the storage platforms for more important data.  Tape is the most common storage medium for bulk data historically but spinning disk arrays are also popular for storing petabytes of archive data.  Software and data protection applications have the ability to move data around and usually have the ability to take older data that is less frequently accessed and stick it on bulk storage that is somewhere in the datacenter on a tape or slow spinning disk drive.  This was true 10 years ago and is still true today.

And I get it.  You get a solution that has worked for many years and it seems like a good enough solution.  But onsite bulk storage systems might gradually become expensive to maintain for the limited value they provide.  And their vendors may deliver very little innovation over the years and keep the architecture in business as usual mode.  But then again, the system is what everyone is used to, what everyone is comfortable with, and it mostly works.

Public cloud, no thanks!

Public cloud for bulk storage?  No thanks!  This was the overwhelming feeling I got when speaking with VMworld attendees. These were customers, not other vendors, who were working every day with data protection, data management, and data storage.  I would simply ask the folks I spoke with where they were keeping their bulk data and if their plans included moving this data to the big public cloud vendors – AWS, Microsoft, or Google.  And among the ~50 people I spoke with, the answer was a resounding “no”.

I’ll be the first to admit my sampling methods here are flawed and this is in no way representative of all customers at the conference.  Maybe only big iron customers wander into the vendor hall.  Or maybe only old-school sysadmins ask a vendor for a demo.  I doubt either of these were the case, but I will say this was an unscientific poll performed by me and I’m using some casual conversations I had to make a point.

Traditional enterprise customers that buy hardware and software from their favorite tried-and-true legacy vendors seem to have little interest in exploring the options and benefits offered by the major public cloud vendors.  Maybe this was reflective of their conservative employers that are risk and change adverse.  Or maybe this was reflective of personal views of these customers.  Either way, this seems odd to me and flies in the face of all current IT perceived trends and cost savings promised by the public cloud.

Operators and planners

I’ll break down the types of customers I spoke with into two categories.  This is admittedly oversimplified but will make my point easier.  End users of technology I spoke to at the conference were usually either technology operators or technology planners.

Technology operators are deep into the details of hardware and software solutions and run the day-to-day operations of IT shops.  They press the buttons, turns the nobs, and fix things when they break.  Operators become experts at the things they manage and know what vendor’s solutions works well and what might not work so well.  Operators are often seen as domain experts in their field and their advice is valued by non-experts.

Technology planners are often looking at the big picture in terms of architectures, project timelines, and cost.  Planners are less interested in the fine details and will go find an operator if that is necessary.  A planner could be an architect who is building a larger solution of several point products.  Or a planner might be a manager working directly with another set of planners on providing a solution that meets a set of business requirements.  Planners are trying to solve a business problem and keep the solution at a certain cost.

When I asked technology operators about public cloud storage, these people looked at me quizzically and said they didn’t currently use it.  Or they answered with an emphatic NO, as if AWS were going to take their job away and be a horrible end to an otherwise great career.

Technology planners had a bit more nuanced answer, they simply said that the business wouldn’t allow it due to some form of bureaucracy, lack of cost savings, or legal rules.  They had looked at it before and it was a dead end – the project was never implemented, there was a perceived data sovereignty problem, or the savings just couldn’t be justified.

This is easy!

Putting bulk backup and archive data in the public cloud is pretty easy.  Some might say it’s even a boring use case.  Because both the onsite data storage vendors and the public cloud vendors have already solved this problem.  Some may offer better integrations, some may be better solutions for certain use cases, but in general this problem has already been solved by the industry.

Is public cloud really cheaper?  For bulk data storage, it is most likely cheaper than your current onsite storage product but your mileage may vary.  Bulk storage prices are a race to the bottom in this market, if you are willing to pay by the month you can probably get a better deal and will continue to get a better deal as commodity storage prices drop.  You can even tier data within the public cloud for even cheaper prices.

Is my data secure in the public cloud?  Yes, if you encrypt everything at rest and in-flight and work with your vendors to make sure you get all the implementation details correct (hint – your storage operators can help here).  Now if you have laws against your data movement or other data sovereignty issues you may be out of luck.  But I wouldn’t get hung up on security, these are industry standard solutions that deploy agreed upon encryption algorithms.

This isn’t a commercial for my employer.  And I get it, cloud may not be the right fit for every use case or every business.  But your business may benefit greatly from storing bulk data on the cheapest possible platform which is the public cloud.  If you aren’t using public cloud as a component in your current storage architecture, you are going to get left behind.  Why get bogged down with millions of dollars with of complex storage gear onsite to store petabytes of data?  It is only a matter of time before your business competitors are certainly going to take advantage of cloud prices and efficiency.

Call to action

Technology operators – start to learn about public cloud storage.  You don’t need to be a cloud expert but take a look at object storage, learn about the tiers from AWS, Azure, and GCP, learn about the pricing, and compare those tools to the ones you have onsite in your datacenter.  Learn about ingress, egress, puts, and gets.  Learn how similar Glacier is to tape and how it differs from standard S3.  Most importantly, you will be more valuable to your business if you have more tools available to you and are able to use them when it makes sense.

Technology planners, work with enlightened technology operators who are able to objectively build solutions with you that may not include the point products you have used for 15+ years.  Revisit the possibility of storing data in different places and how you can save time and money.  Find a vendor or trusted partner to walk you through a TCO and evaluate something new that might take advantage of public cloud.

I hear business leaders who are served by the IT department say “we aren’t in the datacenter business” but I don’t see much action taken in seriously looking at the public cloud.  Bulk storage is the easiest thing to put in the cloud so take a fresh look at the solutions available!

Thanks for reading!