Algorithms in Amazon SageMaker AI

Amazon SageMaker AI is a fully managed machine learning service provided by AWS that enables developers and data scientists to build, train, and deploy machine learning models at scale.
Amazon SageMaker AI is a cloud-based platform that simplifies the machine learning workflow by providing:
Pre-built algorithms for various ML tasks.
Managed infrastructure for training and deployment.
Integrated tools for data preprocessing, model tuning, and monitoring.
It supports a wide range of machine learning algorithms across different categories:
Types of algorithm supported in SageMaker AI in different Categories are
Time-Series
SageMaker AI provides algorithms that are tailored to the analysis of time-series data for forecasting product demand, server loads, webpage requests, and more.
DeepAR
The Amazon SageMaker AI DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series.
Type: Supervised
Purpose: Forecast scalar (1D) time-series data using RNNs.
Use Cases: Demand forecasting, server load prediction, web traffic estimation.
Key Features:
Learns across multiple related time series.
Outperforms classical methods like ARIMA and ETS.
Text
SageMaker AI provides algorithms that are tailored to the analysis of textual documents used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.
BlazingText
BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.
Type: Supervised
Purpose: Word embeddings (Word2Vec) and text classification.
Use Cases: Sentiment analysis, document classification, search ranking.
Key Features:
Highly optimized for speed and scalability.
Supports multi-threading and GPU acceleration
Latent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.
Type: Unsupervised
Purpose: Topic modeling.
Use Cases: Discovering themes in document corpora.
Key Features:
Learns topics as distributions over words.
CPU-only, single-instance training.
NTM
NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example.Topic modeling provides a way to visualize the contents of a large document corpus in terms of the learned topics.
Although you can use both the Amazon SageMaker AI NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data.From a practicality standpoint regarding hardware and compute power, SageMaker NTM hardware is more flexible than LDA and can scale better because NTM can run on CPU and GPU and can be parallelized across multiple GPU instances, whereas LDA only supports single-instance CPU training.
Type: Unsupervised
Purpose: Topic modeling using neural networks.
Use Cases: Visualizing document clusters by topic.
Key Features:
Scales better than LDA.
Supports GPU and multi-instance training.
Object2Vec
Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression. Object2Vec generalizes the well-known Word2Vec embedding technique for words that is optimized in the SageMaker AI BlazingText algorithm. For a blog post that discusses how to apply Object2Vec to some practical use cases, see Introduction to Amazon SageMaker AI Object2Vec.
Type: Supervised
Purpose: Learn embeddings for high-dimensional objects.
Use Cases: Similarity search, clustering, feature engineering.
Key Features:
Generalizes Word2Vec for arbitrary objects.
Useful for downstream classification/regression.
Sequence to Sequence
Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker AI seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures.
Type: Supervised
Purpose: Map input sequences to output sequences.
Use Cases: Machine translation, summarization, speech-to-text.
Key Features:
Uses RNNs and CNNs with attention mechanisms.
Encoder-decoder architecture.
Text Classification TensorFlow
Text Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Hub. Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of text data is not available. The text classification algorithm takes a text string as input and outputs a probability for each of the class labels. Training datasets must be in CSV format.
Type: Supervised
Purpose: Classify text using pretrained models.
Use Cases: Spam detection, sentiment analysis.
Key Features:
Transfer learning via TensorFlow Hub.
Requires CSV input format.
Tabular
AutoGluon-Tabular
AutoGluon-Tabular is a popular open-source AutoML framework that trains highly accurate machine learning models on an unprocessed tabular dataset. Unlike existing AutoML frameworks that primarily focus on model and hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. This page includes information about Amazon EC2 instance recommendations and sample notebooks for AutoGluon-Tabular.
Type: AutoML (Supervised)
Purpose: Automatically train and ensemble models.
Use Cases: Predictive modeling on structured data.
Key Features:
Stacks multiple models.
Minimal tuning required
CatBoost
CatBoost is a popular and high-performance open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.
CatBoost introduces two critical algorithmic advances to GBDT:
SageMaker AI CatBoost currently only trains using CPUs. CatBoost is a memory-bound (as opposed to compute-bound) algorithm.
Type: Supervised (GBDT)
Purpose: Classification and regression.
Use Cases: Credit scoring, churn prediction.
Key Features:
Handles categorical features natively.
CPU-only, memory-bound.
Factorization Machines
The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.
Type: Supervised
Purpose: Capture feature interactions in sparse data.
Use Cases: Click prediction, recommendation systems.
Key Features:
- Efficient for high-dimensional sparse datasets.
k-nearest neighbors (k-NN)
k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression. For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label. For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.
Training with the k-NN algorithm has three steps: sampling, dimension reduction, and index building. Sampling reduces the size of the initial dataset so that it fits into memory. For dimension reduction, the algorithm decreases the feature dimension of the data to reduce the footprint of the k-NN model in memory and inference latency. We provide two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform. Typically, you use dimension reduction for high-dimensional (d >1000) datasets to avoid the “curse of dimensionality” that troubles the statistical analysis of data that becomes sparse as dimensionality increases. The main objective of k-NN's training is to construct the index. The index enables efficient lookups of distances between points whose values or class labels have not yet been determined and the k nearest points to use for inference.
Type: Supervised
Purpose: Classification and regression via similarity.
Use Cases: Recommendation systems, anomaly detection.
Key Features:
Index-based lookup.
Includes sampling and dimensionality reduction.
LightGBM
LightGBM is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT. This page includes information about Amazon EC2 instance recommendations and sample notebooks for LightGBM.
LightGBM
Type: Supervised (GBDT)
Purpose: Classification and regression.
Use Cases: Tabular modeling, ranking.
Key Features:
Efficient and scalable.
Supports large datasets.
Linear learner algorithm
The Amazon SageMaker AI linear learner algorithm provides a solution for both classification and regression problems.The linear learner algorithm supports both
recordIO-wrapped protobufandCSVformats.Type: Supervised
Purpose: Linear models for classification/regression.
Use Cases: Binary classification, regression tasks.
Key Features:
Fast training.
Supports CSV and RecordIO formats.
TabTransformer
TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The TabTransformer architecture is built on self-attention-based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability. This page includes information about Amazon EC2 instance recommendations and sample notebooks for TabTransformer.
TabTransformer
Type: Supervised
Purpose: Deep learning for tabular data.
Use Cases: Predictive modeling with categorical features.
Key Features:
Uses Transformer architecture.
Robust to missing/noisy data.
XGBoost
The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
You can use XGBoost for regression, classification (binary and multiclass), and ranking problems.
Type: Supervised (GBDT)
Purpose: Classification, regression, ranking.
Use Cases: ML competitions, structured data modeling.
Key Features:
Highly tunable.
Handles various data types and distributions.
Unsupervised
IP insights
Amazon SageMaker AI IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers. You can use it to identify a user attempting to log into a web service from an anomalous IP address, for example. Or you can use it to identify an account that is attempting to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an endpoint for making real-time predictions or used for processing batch transforms.
SageMaker AI IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker AI IP Insights model returns a score that infers how anomalous the pattern of the event is. For example, when a user attempts to log in from an IP address, if the IP Insights score is high enough, a web login server might decide to trigger a multi-factor authentication system. In more advanced solutions, you can feed the IP Insights score into another machine learning model. For example, you can combine the IP Insight score with other features to rank the findings of another security system, such as those from Amazon GuardDuty.
The SageMaker AI IP Insights algorithm can also learn vector representations of IP addresses, known as embeddings. You can use vector-encoded embeddings as features in downstream machine learning tasks that use the information observed in the IP addresses. For example, you can use them in tasks such as measuring similarities between IP addresses in clustering and visualization tasks.
Type: Unsupervised
Purpose: Detect anomalous IP usage patterns.
Use Cases: Fraud detection, security monitoring.
Key Features:
Learns entity-IP associations.
Outputs anomaly scores.
K-means
K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity.The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. The n attributes in each row represent a point in n-dimensional space. The Euclidean distance between these points represents the similarity of the corresponding observations.
Type: Unsupervised
Purpose: Clustering.
Use Cases: Customer segmentation, pattern discovery.
Key Features:
Uses Euclidean distance.
Requires tabular data.
PCA
PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.In Amazon SageMaker AI, PCA operates in two modes, depending on the scenario:
regular: For datasets with sparse data and a moderate number of observations and features.
randomized: For datasets with both a large number of observations and features. This mode uses an approximation algorithm.
PCA uses tabular data.
Type: Unsupervised
Purpose: Dimensionality reduction.
Use Cases: Visualization, preprocessing.
Key Features:
Regular and randomized modes.
- Works on tabular data.
Random Cut Forest (RCF)
Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data.Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points.With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.
Type: Unsupervised
Purpose: Anomaly detection.
Use Cases: Detecting outliers in time-series or structured data.
Key Features:
Assigns anomaly scores.
Suitable for streaming data.
Vision
Image classification-MXNet
The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classification. It takes an image as input and outputs one or more labels assigned to that image. It uses a convolutional neural network that can be trained from scratch or trained using transfer learning when a large number of training images are not available.Image classification in Amazon SageMaker AI can be run in two modes: full training and transfer learning. In full training mode, the network is initialized with random weights and trained on user data from scratch. In transfer learning mode, the network is initialized with pre-trained weights and just the top fully connected layer is initialized with random weights. Then, the whole network is fine-tuned with new data. In this mode, training can be achieved even with a smaller dataset. This is because the network is already trained and therefore can be used in cases without sufficient training data.
Type: Supervised
Purpose: Multi-label image classification.
Use Cases: Object recognition, medical imaging.
Key Features:
Supports full training and transfer learning.
Uses CNNs.
Image Classification - TensorFlow
The Amazon SageMaker Image Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Hub. Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The image classification algorithm takes an image as input and outputs a probability for each provided class label.
Type: Supervised
Purpose: Image classification using pretrained models.
Use Cases: Visual recognition with limited data.
Key Features:
Transfer learning via TensorFlow Hub.
Outputs class probabilities.
Object Detection - MXNet
The Amazon SageMaker AI Object Detection - MXNet algorithm detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene. The object is categorized into one of the classes in a specified collection with a confidence score that it belongs to the class. Its location and scale in the image are indicated by a rectangular bounding box. It uses the Single Shot multibox Detector (SSD) framework and supports two base networks: VGG and ResNet. The network can be trained from scratch, or trained with models that have been pre-trained on the ImageNet dataset.
Type: Supervised
Purpose: Detect and classify objects in images.
Use Cases: Surveillance, autonomous vehicles.
Key Features:
SSD framework with VGG/ResNet.
Outputs bounding boxes and class scores.
Object Detection - TensorFlow
The Amazon SageMaker AI Object Detection - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Model Garden. Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The object detection algorithm takes an image as input and outputs a list of bounding boxes.
Type: Supervised
Purpose: Object detection using pretrained models.
Use Cases: Retail analytics, robotics.
Key Features:
Transfer learning via TensorFlow Model Garden.
Outputs bounding boxes.
Semantic segmentation
The SageMaker AI semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. Tagging is fundamental for understanding scenes, which is critical to an increasing number of computer vision applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing.
Type: Supervised
Purpose: Pixel-level image classification.
Use Cases: Medical diagnostics, autonomous driving.
Key Features:
Tags each pixel with a class label.
Enables fine-grained scene understanding.
Scenario-to-Algorithm Mapping Table
| Scenario | Algorithm | Type | Example Use Case |
| Forecasting product demand | DeepAR | Supervised (Time-Series) | Predicting weekly sales |
| Sentiment analysis | BlazingText | Supervised (NLP) | Classifying tweets as positive/negative |
| Topic modeling in documents | LDA / NTM | Unsupervised (NLP) | Discovering themes in news articles |
| Object similarity search | Object2Vec | Supervised (Embedding) | Recommending similar products |
| Machine translation | Sequence-to-Sequence | Supervised (NLP) | Translating English to French |
| Text classification with limited data | Text Classification (TensorFlow) | Supervised (Transfer Learning) | Spam detection in emails |
| Predicting customer churn | AutoGluon-Tabular | AutoML (Tabular) | Churn prediction from customer data |
| Click prediction in sparse data | Factorization Machines | Supervised (Tabular) | Ad click-through rate prediction |
| Fraud detection via IP patterns | IP Insights | Unsupervised | Detecting login anomalies |
| Customer segmentation | K-Means | Unsupervised | Grouping users by behavior |
| Dimensionality reduction | PCA | Unsupervised | Visualizing high-dimensional data |
| Anomaly detection in logs | Random Cut Forest | Unsupervised | Detecting unusual spikes in server logs |
| Image classification | Image Classification (MXNet / TensorFlow) | Supervised (Vision) | Identifying dog breeds |
| Object detection in images | Object Detection (MXNet / TensorFlow) | Supervised (Vision) | Detecting cars in traffic footage |
| Scene understanding | Semantic Segmentation | Supervised (Vision) | Medical image diagnostics |
| Tabular classification with categorical features | TabTransformer | Supervised (Tabular) | Predicting loan defaults |
| Binary classification | Linear Learner | Supervised | Predicting if a transaction is fraudulent |
| High-performance tabular modeling | CatBoost / LightGBM / XGBoost | Supervised | Credit scoring, sales prediction |
| Nearest neighbor search | k-NN | Supervised | Recommending similar users |
Above content is derived from https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html




