Amazon Machine Learning is a public cloud service that offers developers access to ML algorithms as a service. An overview can be found in the Amazon documentation and in my last blog post. Consider the Amazon Machine Learning service a black box that automatically solves a few types of common ML problems – binary classification, multi-class classification, and regression. If a dataset fits into one of these categories, we can quickly get a machine learning project started with minimal effort.
But just how easy is it to use this service? What is it good for and what are the limitations? The best way to outline the strengths and weaknesses of Amazon’s offering is to run a few examples through the black box and see the results. In this post we’ll take a look at how AWS holds up against a very clean and simple regression dataset.
First off, what is a machine learning regression problem? Regression modeling takes a dataset and creates a predictive algorithm to approximate one of the numerical values in that dataset. Basically its a statistical model used to estimate relationships between data points.
An example of regression modeling is a real estate price prediction algorithm. Historical housing closing prices could be fed into an algorithm with as many attributes about each house as relevant (called features – ie. square feet, number of bathrooms, etc). The model could then be fed attributes of houses about to go up for sale and predict the selling price.
Let’s explore how Amazon Machine Learning models and performs with a simple regression dataset.
Kaggle Red Wine Quality Dataset
Kaggle is a community-based data science and machine learning site that hosts thousands of public datasets covering the spectrum of machine learning problems. Our interest lies in taking Amazon Machine Learning for a test drive on a simple regression problem so I’m using a simple and clean regression dataset from Kaggle – Red Wine Quality.
Why this dataset? Because we aren’t going to cover any data cleaning or transformations – we more interested in finding out how the AWS black box performs on a dataset with minimal effort. Of course this is not a real world example but it keeps things simple.
The dataset contains a set of numerical attributes of 1600 red wine variants of the Portuguese “Vinho Verde” wine along with a numerical quality sensory rating between 0 (worse) and 10 (best). Our goal will be to train an AWS model to predict the quality of this variety of red wine based off the chemically measurable attributes (features). Can data science predict the quality of wine?
Amazon Machine Learning Walkthrough
First download the “winequality-red.csv” file from Kaggle. Open up the .CSV file in a spreadsheet and take a look around. We can see the column names for each column which include a “quality” column that rates the wine from 0-10. Notice all fields are populated with numbers and there are no missing values.
We’ll need to upload our .CSV file into AWS storage to get our process started so log into the AWS console and open the S3 service.
Create a new bucket with the default settings to dedicate to our machine learning datasets.
Create a new bucket
Upload the winequality-red.csv file into this bucket and keep the default settings, permissions will be adjusted later.
Upload the winequality-red.csv file
Now open AWS Machine Learning service in the AWS console.
Amazon Machine Learning Console
Click the “get started” button and then “view dashboard”.
Machine Learning Dashboard Launch
Click “create a new datasource and ML model”.
Create new datasource and ML model
AWS will prompt us for the input data for the datasource, we will use our S3 bucket name and the wine quality CSV file that was uploaded (it should auto-populate once we start typing). Click “verify” to confirm the CSV file is intact. At this point AWS will adjust the permissions for the ML service to have access to the S3 bucket and file (click “yes” to give it permissions). Once we have a successful validation message, click “continue”.
Create datasource from CSV in S3 bucket
The wizard will now review the schema. The schema is simply the fields in our CSV file. Because our CSV contained the column names (fixed acidity, volatile acidity, etc) we want to select “yes” so we don’t try to model the quality based off the column names, selecting “yes” will remove the names from the model.
Now we want to select the target value that we want the model to predict. As stated earlier, we want to predict the quality of the wine based off the measurable attributes so select “quality” and continue.
Keep the row identifier as default “no”. We’ll skip this topic but it would be important if we were making predictions in a production environment. Review the selections and “continue”.
Accept the default ML model settings which will split our dataset and set aside 30% of the data for evaluating the performance of the model once it is trained using 70% of the data.
Review one last time and then select “create ML model”.
We are done! The dashboard and will show the progress, the entire process takes about 10-20 minutes. AWS will take over and read the data, perform a 70/30% split, create a model, train the model using 70% of the data, and then evaluate the model using the remaining 30% of the data.
How did we do?
If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard. Let’s check out the results by opening the evaluation.
Once again, this isn’t a real world example. A data scientist would typically spend much more time looking at the data prior to evaluating an algorithm. But since this is a black box approach, we will entertain the notion of skipping to the end without doing any data exploration.
Looking at our evaluation summary, things look pretty good. We have a green box saying our ML model’s quality score was measured with RMSE of 0.729 and was better than the baseline value of 0.793. That summary is somewhat vague so lets click “explore model performance”.
ML Model performance
Since we saved 30% of our data for evaluation, we can measure the accuracy of our model’s predictions. A common way to measure regression accuracy is the RMSE score or root-mean-square-error. This RMSE score is measuring the difference between the model’s prediction and the actual value in the evaluation set. A RMSE of 0 is perfect, the smaller the value the better the model.
We can see a somewhat bell-shaped curve and our model is pretty close to zero which seems great. However, if we look back at our summary we see that the RMSE baseline is .793 and our score is .729. The baseline would be our score if we just guessed for median (middle) quality score for every prediction. So although we area bit better, we aren’t better by much and would be close if we just ignored all the attributes and guessed the median quality score (6) every time
Amazon Machine Learning – the good
Although our results were not incredibly impressive, the easy wizard-driven process and time to results were very impressive. We can feed the AWS ML black box data and have a working model with an evaluation of its accuracy in less than 30 minutes. And this can be done with little to no knowledge of data science, ML techniques, computer programming, or statistics.
And this entire process can be automated via the AWS API sets. Think of an automated process that collects data and automatically feeds it to the AWS machine learning engine using code instead manually clicking buttons in the console. Predictions could be generated automatically on constantly changing data using code and the black box machine learning algorithms in the public cloud.
Lastly, this the service is inexpensive. It cost about $3 to half a dozen datasets through the service over the past month while researching this blog. All we need is some data in S3 and the service can start provided ML insights at a very low cost.
Amazon Machine Learning – the bad
A few points become clear after using the ML service. The first is we have to adjust our expectations when feeding the AWS black box data. Regression problems fed to Amazon Machine Learning are going to be solved using linear regression with stochastic gradient descent. If a problem is easily solved using those algorithms then Amazon will do a great job making predictions. If other algorithms like random forests or decision trees get better results then we need to look at more advanced Amazon services to solve those types of problems.
The second point is that data needs to be explored, cleaned, and transformed prior to feeding it to the machine learning service. While we can look at correlations and data distributions after we create a datasource in AWS, there is no easy way to manipulate your dataset other directly editing your CSV file. There is an advanced function to use built-in data transformations as part of the ML model wizard which is more of an advanced topic and is limited to the data transformations referenced in the documentation. Accepting the defaults and not wrangling your input data may get great results from our AWS ML black box.
Have a machine learning problem that is easily solved using linear regression with SGD? Have an existing data pipeline that already cleans and transforms your data? Amazon machine learning can outsource your machine learning very quickly and cheaply in this case without much time spent learning the service. Just don’t expect a shortcut to your ML process, send Amazon clean data and you can quickly get an API prediction endpoint for linear regression problems.