Amazon Machine Learning for a multiclass classification dataset

We’ve taken a tour of Amazon Machine Learning over the last three posts.  Quickly recapping, Amazon supports three types of ML models with their machine learning as a service (MLaaS) engine – regression, binary classification, and multiclass classification.  Public cloud economics and automation make MLaaS an attractive option to prospective users looking to outsource their machine learning using public cloud API endpoints.

To demonstrate Amazon’s service we’ve taken the Kaggle red wine quality dataset and adjusted the dataset to demonstrate each of the AWS MLaaS model types.  Regression worked fairly well with very little effort.  Binary classification worked better with a bit of effort.  Now to finish the tour of Amazon Machine Learning we will look at altering the wine dataset once more to turn our wine quality dataset prediction engine into a multiclass classification problem.

Multiclass Classification

What is a machine learning multiclass classification problem?  It’s one that tries to predict if an instance or set of data belongs to one of three or more categories.  A dataset is first labeled with the known categories and used to train an ML model.  That model is then fed new unlabeled data and attempts to predict what categories the new data belongs to.

An example of multiclass classification is an image recognition system.  Say we wanted to digitally scan handwritten zip codes on envelopes and have our image recognition system predict the numbers.  Our multiclass classification model would first need to be trained to detect ten digits (0-9) using existing labeled data.  Then our model could use newly scanned handwritten zip codes (unlabeled) and predict (with an accuracy value) what digits were written on the envelope.

Let’s explore how Amazon Machine Learning performs with a mulitclass classification dataset.

Kaggle Red Wine Quality Dataset

We ran the Kaggle Red Wine Quality dataset untouched through the Amazon machine learning regression algorithm.  The algorithm interpreted the scores as floating point numbers rather than integer categories which isn’t necessarily what we were after.  We then doctored up the ratings into binary categories of “great” and “not-so-great” ran the dataset through the binary classification algorithm.  The binary categories were closer to what we wanted, treat the ratings as categories instead of decimal numbers.

This time we want to change things again.  We are going to add a third category to our rating system and change the quality score again.  Rather than try to predict the wide swing of a 0-10 rating system, we will label the quality ratings as either good, bad, or average – three classes.  And Amazon ML will create a multiclass regression model to predict one of the three ratings on new data.

Data preparation – converting to three classes

We will go ahead and download the “winequality-red.csv” file again from Kaggle.  We need to replace the existing 0-10 quality scores with with categories of bad, average, and good.  We’ll use numeric categories to represent quality – bad will be 1, average will be 2, and good will be 3.

Looking at the dataset histogram below, most wines in this dataset are average  – rated a 5 or 6.  Good wines are rated > 6 and the bad wines are < 5.  So for the sake of this exercise, we’ll say wines rated 7 and higher are good (3), wines rated 5 or 6 are average (2), and those rated less than 5 are bad (1).  Even though the rating system is 0-10, we don’t have any wines rated less than 3 or greater than 10.

Wine quality Histogram

We could use spreadsheet formulas or write a simply python ‘for’ loop to append our CSV “quality” column.  Replace the ratings from 3-4 with a 1 (bad), replace our 5-6 ratings with a 2 (average), and replace our 7-8 ratings with a 3 (good).  Then we’ll write out the changes to a new CSV file which we’ll use for our new ML model datasource.

Wine quality Histogram AFTER PROCESSING

Running the dataset through Amazon Machine Learning

Our wine dataset now has a multiclass style rating – 1 for bad wine, 2 for average wine, and 3 for good wine,  Upload this new CSV file to an AWS S3 bucket that we will use for machine learning.

The process to create a dataset, model, and evaluation of the model is the same for binary classification as we documented in the blog post about linear regression.  Create a new datasource and ML model from the main dashboard and point the datasource to the S3 bucket with our multiclass CSV file.  The wizard will verify the schema and confirm that our file is clean.

We need to identify our quality rating as a category rather than numeric for Amazon to recognize this as a multiclass classification problem.  When setting up the schema, edit the ‘quality’ column and set the data type to ‘categorical’ type rather than the default  ‘numerical’.  We will again select “yes” to show our CSV file contains column names.

SCHEMA WITH CATEGORICAL DATA TYPE

After finishing the schema, select the target prediction value which is the ‘quality’ column.  Continue through the rest of the wizard and accept all the default values.  Return to the machine learning dashboard and wait for the model to complete and get evaluated.

How did we do?

Let’s first look at our source dataset prior to the 70/30 split just to get an idea of our data distribution.  82% of the wines are rated a 2 (average), 14% are rated a 3 (good),and 4% are rated a 1 (bad).  By default, the ML wizard is going to randomly split this data 70/30 to use for training and performance evaluation.

Datasource Histogram prior to Train/test split

If all went well, we should have a completed model and evaluation of that model with no failures in our dashboard.  Let’s take a look at our evaluation summary and see how we did.

EVALUATION SUMMARY

Looking at our evaluation summary, things look pretty good.  We have a summary saying our ML model’s quality score was measured with an average F1 score of 0.421.  That summary is somewhat vague so lets click “explore model performance”.

Model Performance- Confusion matrix

Multiclass prediction accuracy can be measured as a weighted sum of the individual binary predictions.  How well did we predict good wines (3), average wines (2), and bad wines (1)?  Each category’s accuracy score is combined into what is called an F1 score where a higher F1 score is better than a lower score.

A visualization of this accuracy is displayed above in a confusion matrix.  The matrix makes it easy to see where the model is successful and where it is not so successful.  The darker the blue, the more accurate the correct prediction, the darker the orange/red, the more inaccurate the prediction.

Looking at our accuracy above, it seems we are good at predicting average wine (2) – our accuracy is almost 90% and we predicted 361/404 wines correctly as average (2).  However, the model was not so good at predicting good (3) and bad (1) wines.  We only correctly predicted 13/47 (28%) as good and only correctly predicted 2/26 (.08%) as bad (1).  Our model is good at predicting the easy (2) category but not so good at predicting the more difficult and less common (1) and (3) categories.

Can we do better?

Our model is disappointing, we could have done almost as well if we just predicted that every wine was average (2).  We want train a model that is better at predicting good (3) wines and bad (1) wines.  But how?  We could use a few ML techniques to get a better model:

Collect more data – More data would help train our model better, especially having more samples of good (3) and bad (1) wines.  In our case this isn’t possible since we are working with a static public dataset.

Remove some features – Our algorithm may have an easier time evaluating less features.  We have 11 features (pH, sulphates, alcohol, etc) and we really aren’t sure if all 11 features have a direct impact on wine quality.  Removing some features could make for a better model but would require some trial and error.  We’ll skip this for now.

Use a more powerful algorithm – Amazon uses  multinomial logistic regression (multinomial logistic loss + SGD) for multiclass classification problems.  This may not be the best algorithm for our wine dataset, there may be a more powerful algorithm that works better.  However this isn’t an option when using Amazon Machine Learning – we’d have to look at a more powerful tool like Amazon SageMaker if we wanted to experiment with different algorithms.

Data wrangling –  Feeding raw data to the Amazon Machine Learning service untouched is easy but if we aren’t getting good results we will need to perform some pre-processing to get a better model.  Some features have wide ranges and some do not – for example the citric acid values range from 0-1 while the total sulfur dioxide values range from 6-289.  So some feature scaling might be a good idea.

Also, the default random 70/30 training and testing data split may not be the greatest way to train and test our model.  We may want to use a more powerful method to split the folds of the dataset ourselves rather than to let Amazon randomly split it.  Running a Stratified ShuffleSplit might be helpful prior to uploading it to Amazon.

Lastly, Amazon Machine Learning uses a technique called quantile binning for numeric values.  Instead of treating a range of numbers as discreet values, Amazon puts the range of values in “bins” and converts them into categories.  This may work well for non-linear features but may not work great for features with a direct linear correlation to our quality ratings.  Amazon recommends some experimentation with their data transformation recipes to tweak model performance.  The default recipes may not be the best for all problems.

Final Thoughts

Machine learning is hard.  And while Amazon’s MLaaS is a powerful tool – it isn’t perfect and it doesn’t do everything.  Data needs to be clean going in and most likely needs some manipulation using various ML techniques if we want to get decent predictions from Amazon algorithms.

Just for fun I did some data wrangling on the red wine dataset to see if I could get some better prediction results.  I manually split the dataset myself using a stratified shuffle split and then ran it through the machine learning wizard using the “custom” option which allowed me to turn off the AWS 70/30 split.  The results?  Just doing a bit of work I improved the prediction accuracy for our good (3) wines to 65% correct, bad (1) wines to 25% correct, and raised the F1 score to .59 (up from .42, higher is better).

Confusion matrix with stratified shuffling

Thanks for reading!