Algorithms in Amazon SageMaker AI

Amazon SageMaker AI is a fully managed machine learning service provided by AWS that enables developers and data scientists to build, train, and deploy machine learning models at scale.

Amazon SageMaker AI is a cloud-based platform that simplifies the machine learning workflow by providing:

Pre-built algorithms for various ML tasks.
Managed infrastructure for training and deployment.
Integrated tools for data preprocessing, model tuning, and monitoring.

It supports a wide range of machine learning algorithms across different categories:

Types of algorithm supported in SageMaker AI in different Categories are

Time-Series

SageMaker AI provides algorithms that are tailored to the analysis of time-series data for forecasting product demand, server loads, webpage requests, and more.

DeepAR

The Amazon SageMaker AI DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series.
- Type: Supervised
- Purpose: Forecast scalar (1D) time-series data using RNNs.
- Use Cases: Demand forecasting, server load prediction, web traffic estimation.
- Key Features:
  - Learns across multiple related time series.
  - Outperforms classical methods like ARIMA and ETS.

Text

SageMaker AI provides algorithms that are tailored to the analysis of textual documents used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.

BlazingText

BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.
- Type: Supervised
- Purpose: Word embeddings (Word2Vec) and text classification.
- Use Cases: Sentiment analysis, document classification, search ranking.
- Key Features:
  - Highly optimized for speed and scalability.
  - Supports multi-threading and GPU acceleration

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.
- Type: Unsupervised
- Purpose: Topic modeling.
- Use Cases: Discovering themes in document corpora.
- Key Features:
  - Learns topics as distributions over words.
  - CPU-only, single-instance training.

NTM

NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example.Topic modeling provides a way to visualize the contents of a large document corpus in terms of the learned topics.

Although you can use both the Amazon SageMaker AI NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data.From a practicality standpoint regarding hardware and compute power, SageMaker NTM hardware is more flexible than LDA and can scale better because NTM can run on CPU and GPU and can be parallelized across multiple GPU instances, whereas LDA only supports single-instance CPU training.
- Type: Unsupervised
- Purpose: Topic modeling using neural networks.
- Use Cases: Visualizing document clusters by topic.
- Key Features:
  - Scales better than LDA.
  - Supports GPU and multi-instance training.

Object2Vec

Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression. Object2Vec generalizes the well-known Word2Vec embedding technique for words that is optimized in the SageMaker AI BlazingText algorithm. For a blog post that discusses how to apply Object2Vec to some practical use cases, see Introduction to Amazon SageMaker AI Object2Vec.
- - Type: Supervised
    - Purpose: Learn embeddings for high-dimensional objects.
    - Use Cases: Similarity search, clustering, feature engineering.
    - Key Features:
      - Generalizes Word2Vec for arbitrary objects.
      - Useful for downstream classification/regression.

Sequence to Sequence

Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker AI seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures.
- Type: Supervised
- Purpose: Map input sequences to output sequences.
- Use Cases: Machine translation, summarization, speech-to-text.
- Key Features:
  - Uses RNNs and CNNs with attention mechanisms.
  - Encoder-decoder architecture.

Text Classification TensorFlow

Text Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Hub. Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of text data is not available. The text classification algorithm takes a text string as input and outputs a probability for each of the class labels. Training datasets must be in CSV format.
- - Type: Supervised
    - Purpose: Classify text using pretrained models.
    - Use Cases: Spam detection, sentiment analysis.
    - Key Features:
      - Transfer learning via TensorFlow Hub.
      - Requires CSV input format.

Tabular

AutoGluon-Tabular

AutoGluon-Tabular is a popular open-source AutoML framework that trains highly accurate machine learning models on an unprocessed tabular dataset. Unlike existing AutoML frameworks that primarily focus on model and hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. This page includes information about Amazon EC2 instance recommendations and sample notebooks for AutoGluon-Tabular.
- Type: AutoML (Supervised)
- Purpose: Automatically train and ensemble models.
- Use Cases: Predictive modeling on structured data.
- Key Features:
  - Stacks multiple models.
  - Minimal tuning required

CatBoost

CatBoost is a popular and high-performance open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.

CatBoost introduces two critical algorithmic advances to GBDT:
1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm
2. An innovative algorithm for processing categorical features

SageMaker AI CatBoost currently only trains using CPUs. CatBoost is a memory-bound (as opposed to compute-bound) algorithm.

- Type: Supervised (GBDT)
  - Purpose: Classification and regression.
  - Use Cases: Credit scoring, churn prediction.
  - Key Features:
    - Handles categorical features natively.
    - CPU-only, memory-bound.

Factorization Machines

The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.
- - Type: Supervised
    - Purpose: Capture feature interactions in sparse data.
    - Use Cases: Click prediction, recommendation systems.
    - Key Features:
      - Efficient for high-dimensional sparse datasets.

k-nearest neighbors (k-NN)

k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression. For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label. For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.

Training with the k-NN algorithm has three steps: sampling, dimension reduction, and index building. Sampling reduces the size of the initial dataset so that it fits into memory. For dimension reduction, the algorithm decreases the feature dimension of the data to reduce the footprint of the k-NN model in memory and inference latency. We provide two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform. Typically, you use dimension reduction for high-dimensional (d >1000) datasets to avoid the “curse of dimensionality” that troubles the statistical analysis of data that becomes sparse as dimensionality increases. The main objective of k-NN's training is to construct the index. The index enables efficient lookups of distances between points whose values or class labels have not yet been determined and the k nearest points to use for inference.
- - Type: Supervised
    - Purpose: Classification and regression via similarity.
    - Use Cases: Recommendation systems, anomaly detection.
    - Key Features:
      - Index-based lookup.
      - Includes sampling and dimensionality reduction.

LightGBM

LightGBM is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT. This page includes information about Amazon EC2 instance recommendations and sample notebooks for LightGBM.
- LightGBM
  - Type: Supervised (GBDT)
  - Purpose: Classification and regression.
  - Use Cases: Tabular modeling, ranking.
  - Key Features:
    - Efficient and scalable.
    - Supports large datasets.

Linear learner algorithm

The Amazon SageMaker AI linear learner algorithm provides a solution for both classification and regression problems.The linear learner algorithm supports both recordIO-wrapped protobuf and CSV formats.
- - Type: Supervised
    - Purpose: Linear models for classification/regression.
    - Use Cases: Binary classification, regression tasks.
    - Key Features:
      - Fast training.
      - Supports CSV and RecordIO formats.

TabTransformer

TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The TabTransformer architecture is built on self-attention-based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability. This page includes information about Amazon EC2 instance recommendations and sample notebooks for TabTransformer.
- TabTransformer
  - Type: Supervised
  - Purpose: Deep learning for tabular data.
  - Use Cases: Predictive modeling with categorical features.
  - Key Features:
    - Uses Transformer architecture.
    - Robust to missing/noisy data.

XGBoost

The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
- Its robust handling of a variety of data types, relationships, distributions.
- The variety of hyperparameters that you can fine-tune.

You can use XGBoost for regression, classification (binary and multiclass), and ranking problems.

- Type: Supervised (GBDT)
  - Purpose: Classification, regression, ranking.
  - Use Cases: ML competitions, structured data modeling.
  - Key Features:
    - Highly tunable.
    - Handles various data types and distributions.

Unsupervised

IP insights

Amazon SageMaker AI IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers. You can use it to identify a user attempting to log into a web service from an anomalous IP address, for example. Or you can use it to identify an account that is attempting to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an endpoint for making real-time predictions or used for processing batch transforms.

SageMaker AI IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker AI IP Insights model returns a score that infers how anomalous the pattern of the event is. For example, when a user attempts to log in from an IP address, if the IP Insights score is high enough, a web login server might decide to trigger a multi-factor authentication system. In more advanced solutions, you can feed the IP Insights score into another machine learning model. For example, you can combine the IP Insight score with other features to rank the findings of another security system, such as those from Amazon GuardDuty.

The SageMaker AI IP Insights algorithm can also learn vector representations of IP addresses, known as embeddings. You can use vector-encoded embeddings as features in downstream machine learning tasks that use the information observed in the IP addresses. For example, you can use them in tasks such as measuring similarities between IP addresses in clustering and visualization tasks.
- Type: Unsupervised
- Purpose: Detect anomalous IP usage patterns.
- Use Cases: Fraud detection, security monitoring.
- Key Features:
  - Learns entity-IP associations.
  - Outputs anomaly scores.

K-means

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity.The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. The n attributes in each row represent a point in n-dimensional space. The Euclidean distance between these points represents the similarity of the corresponding observations.
- - Type: Unsupervised
    - Purpose: Clustering.
    - Use Cases: Customer segmentation, pattern discovery.
    - Key Features:
      - Uses Euclidean distance.
      - Requires tabular data.

PCA

PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.In Amazon SageMaker AI, PCA operates in two modes, depending on the scenario:
- regular: For datasets with sparse data and a moderate number of observations and features.
- randomized: For datasets with both a large number of observations and features. This mode uses an approximation algorithm.

PCA uses tabular data.

Type: Unsupervised
Purpose: Dimensionality reduction.
Use Cases: Visualization, preprocessing.
Key Features:
- - Regular and randomized modes.
    - Works on tabular data.

Random Cut Forest (RCF)

Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data.Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points.With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.
- Type: Unsupervised
- Purpose: Anomaly detection.
- Use Cases: Detecting outliers in time-series or structured data.
- Key Features:
  - Assigns anomaly scores.
  - Suitable for streaming data.

Vision

Image classification-MXNet

The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classification. It takes an image as input and outputs one or more labels assigned to that image. It uses a convolutional neural network that can be trained from scratch or trained using transfer learning when a large number of training images are not available.Image classification in Amazon SageMaker AI can be run in two modes: full training and transfer learning. In full training mode, the network is initialized with random weights and trained on user data from scratch. In transfer learning mode, the network is initialized with pre-trained weights and just the top fully connected layer is initialized with random weights. Then, the whole network is fine-tuned with new data. In this mode, training can be achieved even with a smaller dataset. This is because the network is already trained and therefore can be used in cases without sufficient training data.
- Type: Supervised
- Purpose: Multi-label image classification.
- Use Cases: Object recognition, medical imaging.
- Key Features:
  - Supports full training and transfer learning.
  - Uses CNNs.

Image Classification - TensorFlow

The Amazon SageMaker Image Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Hub . Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The image classification algorithm takes an image as input and outputs a probability for each provided class label.
- Type: Supervised
- Purpose: Image classification using pretrained models.
- Use Cases: Visual recognition with limited data.
- Key Features:
  - Transfer learning via TensorFlow Hub.
  - Outputs class probabilities.

Object Detection - MXNet

The Amazon SageMaker AI Object Detection - MXNet algorithm detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene. The object is categorized into one of the classes in a specified collection with a confidence score that it belongs to the class. Its location and scale in the image are indicated by a rectangular bounding box. It uses the Single Shot multibox Detector (SSD) framework and supports two base networks: VGG and ResNet . The network can be trained from scratch, or trained with models that have been pre-trained on the ImageNet dataset.
- Type: Supervised
- Purpose: Detect and classify objects in images.
- Use Cases: Surveillance, autonomous vehicles.
- Key Features:
  - SSD framework with VGG/ResNet.
  - Outputs bounding boxes and class scores.

Object Detection - TensorFlow

The Amazon SageMaker AI Object Detection - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Model Garden . Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The object detection algorithm takes an image as input and outputs a list of bounding boxes.
- Type: Supervised
- Purpose: Object detection using pretrained models.
- Use Cases: Retail analytics, robotics.
- Key Features:
  - Transfer learning via TensorFlow Model Garden.
  - Outputs bounding boxes.

Semantic segmentation

The SageMaker AI semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. Tagging is fundamental for understanding scenes, which is critical to an increasing number of computer vision applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing.
- Type: Supervised
- Purpose: Pixel-level image classification.
- Use Cases: Medical diagnostics, autonomous driving.
- Key Features:
  - Tags each pixel with a class label.
  - Enables fine-grained scene understanding.

Scenario-to-Algorithm Mapping Table

Scenario	Algorithm	Type	Example Use Case
Forecasting product demand	DeepAR	Supervised (Time-Series)	Predicting weekly sales
Sentiment analysis	BlazingText	Supervised (NLP)	Classifying tweets as positive/negative
Topic modeling in documents	LDA / NTM	Unsupervised (NLP)	Discovering themes in news articles
Object similarity search	Object2Vec	Supervised (Embedding)	Recommending similar products
Machine translation	Sequence-to-Sequence	Supervised (NLP)	Translating English to French
Text classification with limited data	Text Classification (TensorFlow)	Supervised (Transfer Learning)	Spam detection in emails
Predicting customer churn	AutoGluon-Tabular	AutoML (Tabular)	Churn prediction from customer data
Click prediction in sparse data	Factorization Machines	Supervised (Tabular)	Ad click-through rate prediction
Fraud detection via IP patterns	IP Insights	Unsupervised	Detecting login anomalies
Customer segmentation	K-Means	Unsupervised	Grouping users by behavior
Dimensionality reduction	PCA	Unsupervised	Visualizing high-dimensional data
Anomaly detection in logs	Random Cut Forest	Unsupervised	Detecting unusual spikes in server logs
Image classification	Image Classification (MXNet / TensorFlow)	Supervised (Vision)	Identifying dog breeds
Object detection in images	Object Detection (MXNet / TensorFlow)	Supervised (Vision)	Detecting cars in traffic footage
Scene understanding	Semantic Segmentation	Supervised (Vision)	Medical image diagnostics
Tabular classification with categorical features	TabTransformer	Supervised (Tabular)	Predicting loan defaults
Binary classification	Linear Learner	Supervised	Predicting if a transaction is fraudulent
High-performance tabular modeling	CatBoost / LightGBM / XGBoost	Supervised	Credit scoring, sales prediction
Nearest neighbor search	k-NN	Supervised	Recommending similar users

Above content is derived from https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html

Command Palette

Types of algorithm supported in SageMaker AI in different Categories are

Time-Series

DeepAR

Text

BlazingText

Latent Dirichlet Allocation (LDA)

NTM

Object2Vec

Sequence to Sequence

Text Classification TensorFlow

Tabular

AutoGluon-Tabular

CatBoost

Factorization Machines

k-nearest neighbors (k-NN)

LightGBM

LightGBM

Linear learner algorithm

TabTransformer

TabTransformer

XGBoost

Unsupervised

IP insights

K-means

PCA

Random Cut Forest (RCF)

Vision

Image classification-MXNet

Image Classification - TensorFlow

Object Detection - MXNet

Object Detection - TensorFlow

Semantic segmentation

Scenario-to-Algorithm Mapping Table

Comments

Machine Learning

Linear Regression

More from this blog