Path To Machine Learning

Building an Agentic CI/CD Pipeline with Amazon Bedrock, GitLab CI, and AWS CDK

Nitin Sharma — Sat, 21 Mar 2026 14:30:00 GMT

Overview

What if your CI/CD pipeline could think? Not just run lint and tests, but actually understand your code changes, generate tests for new functions, and write a risk assessment with inline code review comments -- all automatically on every merge request?

This blog walks through building exactly that: an agentic CI/CD pipeline that integrates Amazon Bedrock's foundation models directly into GitLab CI, turning a traditional pipeline into an intelligent code review and quality assurance system.

The result is a 6-stage pipeline where two stages are AI-powered -- the agent generates tests for changed code and performs a holistic merge request review with risk scoring, all deployed as infrastructure-as-code with AWS CDK.

How AI Can Transform CI/CD Productivity

Traditional CI/CD pipelines are deterministic: they run the same checks the same way every time. Lint passes or fails. Tests pass or fail. There's no interpretation, no context, no judgement.

AI agents change this fundamentally:

Automated test generation -- When a developer adds a new function, the agent can inspect the code and generate meaningful test cases, not just import checks but actual logic tests with edge cases.
Holistic code review -- Instead of isolated tool outputs, an AI agent can correlate lint violations, test failures, security findings, and the actual diff to produce a unified risk assessment.
Contextual inline comments -- The agent can point to specific lines of code with architectural suggestions, anti-pattern warnings, or security concerns that static tools miss entirely.
Risk-based merge gating -- Rather than binary pass/fail, the pipeline produces a risk score (0-100) with weighted drivers, giving reviewers actionable context to make merge decisions.

The key insight is that AI agents don't replace existing CI tools -- they augment them. Ruff still lints, pytest still tests, bandit still scans. The agent sits on top, consuming all their outputs plus the raw diff to produce something none of them could alone.

What This Solution Solves

In a typical development workflow, code review is a bottleneck. Reviewers have to:

Manually read every diff line
Cross-reference lint, test, and security outputs
Assess overall risk based on what changed
Write comments and suggest improvements
Decide whether to approve

This solution automates steps 1-4 completely and provides structured input for step 5:

AI-generated tests catch untested functions before the MR is even reviewed
Automated risk scoring quantifies merge risk on a 0-100 scale with weighted risk drivers
MR summary notes are posted directly to GitLab with risk score, review summary, and artifact links
Inline code comments highlight specific issues on the diff itself
Deterministic fallbacks ensure the pipeline never breaks even if the AI is unavailable

Running It Yourself

Prerequisites

Docker & Docker Compose
AWS CLI configured with Bedrock access
Node.js + AWS CDK CLI
Python 3.12

For a POC, self-hosted GitLab CE via Docker Compose gives full control over CI/CD configuration, runner setup, and API access without SaaS limitations. The Docker executor runs jobs in isolated containers with the Python 3.12 image.

Solution Architecture

The system has three main components: a self-hosted GitLab instance with CI/CD runners, an AWS backend with API Gateway + Lambda + Bedrock Agent, and CI scripts that orchestrate everything.

Architecture Overview

Component Breakdown

GitLab (Docker Compose)

Self-hosted GitLab CE on port 8080
GitLab Runner with Docker executor
Python 3.12 base image with ruff, pytest, bandit, pip-audit

AWS Cloud (CDK-deployed)

API Gateway (REST) with API key authentication
Lambda function (Python 3.12, 512MB, 120s timeout)
Amazon Bedrock Agent using Nova Pro v1:0 foundation model
IAM roles for agent invocation and model access

CI Scripts (Python)

detect_changes.py -- git diff analysis
generate_tests_from_diff.py -- AI test generation with fallback
validate_generated_tests.py -- AST safety validation
persist_generated_tests.py -- optional bot commit
mr_review_agent.py -- MR review with risk scoring and GitLab API posting

The 6-Stage Pipeline

The pipeline runs on every merge request update. Two stages are AI-powered (Prepare and Agent Review), while the middle three are traditional CI tools.

Stage 1: PREPARE (AI-Powered)

This stage runs four jobs sequentially:

detect_changes -- Runs git diff against the target branch to produce three artifacts: the list of changed files, the list of changed Python files, and the full diff text. These artifacts feed every downstream job.

generate_tests -- Sends the diff and changed file list to the Bedrock Agent via API Gateway. The agent analyzes the code changes and generates pytest test files. If the agent returns malformed JSON, the Lambda retries up to 2 times. If all retries fail, the CI script falls back to AST-based test generation that produces import + function existence tests.

validate_generated_tests -- Parses every generated test file through Python's AST module to ensure they're syntactically valid and safe to run. No os.system, no subprocess, no eval.

persist_generated_tests -- Optionally commits the generated tests back to the feature branch (controlled by AUTO_COMMIT_GENERATED_TESTS variable).

Stage 2: LINT

Runs ruff check . and saves the output to artifacts/lint.txt. Ruff is a fast Python linter that catches style violations, import errors, and anti-patterns.

Stage 3: TEST

Runs pytest -v --tb=short --junitxml=artifacts/junit.xml to execute all tests -- both manually written tests and AI-generated tests from Stage 1. Results are saved as a JUnit XML report that GitLab renders in the MR UI.

Stage 4: SECURITY

Runs bandit (static security analysis) and pip-audit (dependency vulnerability scanning). Output is saved to artifacts/security.txt.

Stage 5: AGENT REVIEW (AI-Powered)

The most sophisticated stage. The mr_review_agent.py script:

Collects all artifacts from previous stages (lint.txt, test output, security.txt, changes.diff)
Calls the Bedrock Agent via API Gateway with the full context
Computes a deterministic baseline risk score from the artifacts (test failures, lint violations, security findings)
Merges AI and deterministic scores -- the final risk score is always >= the deterministic baseline
Posts results to GitLab as an MR summary note with risk drivers table, review summary, and artifact links
Posts inline comments on specific lines of the diff for critical findings

Stage 6: DEPLOY

Manual deployment to staging, only available on the main branch.

The AWS Stack

The entire AWS infrastructure is defined in a single CDK stack (AgentGateStack), making it fully reproducible with cdk deploy.

Data Flow: End to End

Here's how data flows through the entire system from developer push to MR comment:

Developer pushes to a feature branch and opens/updates an MR
GitLab triggers the pipeline with merge_request_event source
detect_changes produces the diff and file lists as CI artifacts
generate_tests sends the diff to Bedrock and writes test files
Lint, test, and security stages run independently, producing their own artifacts
mr_review collects ALL artifacts, calls Bedrock for a holistic review
The review result is posted as an MR note with risk score and inline comments

Risk Scoring: How It Works

The risk scoring system combines deterministic analysis with AI judgement:

AI Enhancement

The Bedrock Agent analyzes the actual code changes and produces its own risk score with drivers. The final score is the maximum of the AI score and the deterministic baseline -- ensuring that hard signals (failing tests, security vulnerabilities) can never be downplayed by the AI.

Agent response in merge request of gitlab merge screen

Conclusion

This project demonstrates that AI agents can be meaningfully integrated into CI/CD pipelines today -- not as replacements for existing tools, but as an intelligent layer that synthesizes their outputs into actionable insights.

project repo used for this POC is at agentic ci/cd pipeline

Reliable JSON Responses from LLMs

Nitin Sharma — Sun, 08 Mar 2026 20:38:21 GMT

Getting reliable, structured (JSON) responses from Large Language Models is harder than it looks. The magentic library, paired with Pydantic, lets you define the shape of your expected output as a Python class and receive it back as a validated object — no manual prompt engineering or fragile JSON parsing required.

The Problem

When you call an LLM through its API, the response comes back as free-form text. If your application needs that data in a structured format — say, a JSON object with specific fields — you're left writing brittle prompt instructions like "Please respond in JSON with keys name, age, and summary" and then wrapping everything in try/except json.loads(...).

This leads to:

Unreliable outputs — the model may add commentary, change key names, or break JSON syntax.
Wasted tokens — lengthy system prompts that explain the desired format eat into your context window.
Messy code — parsing logic, validation, and retry handling clutter your business logic.

How Magentic + Pydantic Solve This

The magentic library introduces a simple decorator-based approach:

Define your output schema using a Pydantic model — the same way you'd define any data class in modern Python.
Decorate a function with @prompt, providing your prompt template.
Call the function — magentic handles the system prompt injection, API call, and Pydantic-led response parsing under the hood.

This means:

The decorator manages the underlying system prompts that instruct the LLM to return structured data.
Pydantic validates and parses the response automatically.
Token usage is optimised because the formatting instructions are handled efficiently by the library.
Your codebase stays clean — no raw JSON wrangling.

Quick Example (Python + OpenAI API Key)

Install

pip install magentic pydantic openai

Code

"""
Structured Output Binding with Magentic + Pydantic + OpenAI
------------------------------------------------------------
pip install magentic pydantic openai python-dotenv
"""

import os
from dotenv import load_dotenv
from pydantic import BaseModel
from openai import OpenAI
from magentic import prompt

# Load API key from .env file
load_dotenv()


# =============================================
# 1. Define output schemas (shared by both approaches)
# =============================================

class MovieReview(BaseModel):
    title: str
    rating: float
    summary: str


class RecipeSuggestion(BaseModel):
    name: str
    ingredients: list[str]
    steps: list[str]
    prep_time_minutes: int


# =============================================
# APPROACH 1: Direct OpenAI API call (explicit LLM call)
# =============================================

def review_movie_direct(movie_name: str) -> MovieReview:
    """Calls the OpenAI API directly and parses response into a Pydantic model."""
    client = OpenAI()  # uses OPENAI_API_KEY from env

    # This is the actual LLM call ↓
    completion = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a movie critic. Return structured JSON."},
            {"role": "user", "content": f"Give a short review of the movie {movie_name}"},
        ],
        response_format=MovieReview,  # OpenAI parses response into this Pydantic model
    )

    return completion.choices[0].message.parsed


# =============================================
# APPROACH 2: Magentic decorator (hides the LLM call)
# =============================================
# Under the hood, @prompt sends your text to the OpenAI API
# and automatically parses the response into the return type.

@prompt("Give a short review of the movie {movie_name}")
def review_movie_magentic(movie_name: str) -> MovieReview: ...


@prompt("Suggest a simple recipe using {ingredient} as the main ingredient")
def suggest_recipe(ingredient: str) -> RecipeSuggestion: ...


# =============================================
# 3. Run & test
# =============================================

if __name__ == "__main__":
    assert os.environ.get("OPENAI_API_KEY"), "Set OPENAI_API_KEY in your .env file first!"

    # --- Approach 1: Direct OpenAI call ---
    print("=== Movie Review (Direct OpenAI Call) ===")
    review = review_movie_direct("Inception")
    print(f"Title   : {review.title}")
    print(f"Rating  : {review.rating}")
    print(f"Summary : {review.summary}")

    print()

    # --- Approach 2: Magentic decorator ---
    print("=== Movie Review (Magentic Decorator) ===")
    review2 = review_movie_magentic("The Matrix")
    print(f"Title   : {review2.title}")
    print(f"Rating  : {review2.rating}")
    print(f"Summary : {review2.summary}")

    print()

    print("=== Recipe Suggestion (Magentic Decorator) ===")
    recipe = suggest_recipe("chicken")
    print(f"Name        : {recipe.name}")
    print(f"Ingredients : {', '.join(recipe.ingredients)}")
    print(f"Prep Time   : {recipe.prep_time_minutes} min")
    print("Steps:")
    for i, step in enumerate(recipe.steps, 1):
        print(f"  {i}. {step}")

Formatted Output

=== Movie Review (Direct OpenAI Call) ===
Title   : Inception
Rating  : 9.0
Summary : Inception is a mind-bending thriller directed by Christopher Nolan that explores the intricacies of dreams and the subconscious. With a stellar cast led by Leonardo DiCaprio, the film masterfully blends action, science fiction, and emotional depth. Its intricate plot and stunning visuals keep audiences engaged, while the haunting score by Hans Zimmer heightens the tension throughout. A thought-provoking narrative that challenges perceptions of reality, Inception remains a landmark achievement in modern cinema.

=== Movie Review (Magentic Decorator) ===
Title   : The Matrix
Rating  : 4.8
Summary : The Matrix is a groundbreaking science fiction film that blends mind-bending action with philosophical depth. Directed by the Wachowskis, it introduces a dystopian future where reality is simulated by AI, and human resistance fights against the machine overlords. The film's innovative special effects, particularly the iconic 'bullet-time' sequences, and its exploration of themes such as freedom, reality, and identity make it a landmark in cinematic history. With a stellar performance by Keanu Reeves as Neo, The Matrix has become a cultural phenomenon and a must-watch for any sci-fi fan.

What Happens Behind the Scenes

@prompt sends your template (plus automatic formatting instructions) to the OpenAI API.
The LLM returns structured data matching the MovieReview schema.
Magentic parses and validates the response through Pydantic before handing it back to you as a Python object.

No manual JSON parsing. No retry loops. No prompt gymnastics.

Key Takeaways

Benefit	Without Magentic	With Magentic
Output format	Free text / fragile JSON	Validated Pydantic model
Prompt overhead	Manual formatting instructions	Handled by decorator
Parsing code	`json.loads` + validation	Automatic
Token efficiency	Extra tokens for format prompts	Optimised
Code cleanliness	Scattered parsing logic	Single decorated function

Multi-Agent Loan Processing AgenticAI

Nitin Sharma — Wed, 18 Feb 2026 21:50:19 GMT

In this blog post, we will explore the implementation of a multi-agent loan processing system using the crewAI framework. This innovative approach leverages advanced artificial intelligence to streamline and automate the various stages of loan processing, ensuring efficiency and accuracy.

We will employ a hierarchical architecture, commonly referred to as the Supervisor/Orchestrator pattern. This design effectively organizes agents into a structured framework that mimics a corporate team environment. Within this framework, a manager (the supervisor) oversees a group of specialized agents (the specialists), each with specific expertise.

Supervisor (Parent) Agent

The Supervisor agent serves as the manager of the system. Its primary responsibilities encompass several crucial tasks, including:

Planning the Workflow: The Supervisor outlines the entire loan processing cycle, establishing the order and priority of tasks to be completed.
Delegating Sub-Tasks: Based on the nature of each loan application, the Supervisor assigns specific tasks to the appropriate specialist agents, ensuring that each agent is tasked with functions aligned with their expertise.
Monitoring Progress: Continuous oversight allows the Supervisor to track the status of each task, providing updates on progress and identifying any potential issues or bottlenecks in the process.
Synthesizing Results: After all tasks have been executed, the Supervisor consolidates the findings from the specialists to create a comprehensive overview of the loan application status.

Specialist (Sub-Agent)

Specialist agents are the individual experts within the multi-agent system. Each specialist is programmed to perform a particular task with a high degree of proficiency. Their roles include:

Performing Specific Tasks: Each specialist is designed for tasks such as document validation, credit score retrieval, risk assessment, and compliance checking. This specialization enables them to execute their functions efficiently and effectively.
Executing Under Supervision: Upon receiving a task from the Supervisor, each specialist utilizes its dedicated tools and algorithms to complete the assigned job, ensuring that the output is accurate and meets the necessary requirements.
Returning Results: Once a task is completed, the specialist sends the result back to the Supervisor for further processing and analysis.

Hierarchical Agent system is shown below

Workflow Process

The overall workflow within this multi-agent system is divided into several distinct tasks, each mapped to a specific component of the process:

Document Fetch: The first step involves retrieving the content of the loan application document. This is done at the preprocessing stage, where the document is fetched and made available for subsequent validation and analysis.
Document Validation: Once the document content is fetched, the Document Validation Specialist checks its validity and completeness. This step ensures that all required information is present and adheres to the predefined standards.
Credit Check: Following validation, the Credit Check Agent is tasked with retrieving the borrower’s credit score. This process relies on the borrower’s unique customer ID to access accurate credit information, which is vital for assessing the borrower’s creditworthiness.
Risk Assessment: The next phase is conducted by the Risk Assessment Analyst, who analyzes the data gathered from the document, credit score, and borrower’s income. This analysis determines the risk level associated with granting the loan.
Compliance Check: Finally, the Compliance Check Agent evaluates the entire decision-making process to ensure that it aligns with established lending regulations and legal requirements, safeguarding the institution against potential compliance issues.

By structuring the loan processing system in this way, we can ensure a streamlined and effective approach to handling loan applications. Each agent plays a crucial role in contributing to a well-coordinated effort, ultimately enhancing the quality and speed of the loan processing workflow.

According to our workflow our orchestrator agent and four specialist agents will look like this.

CrewAI framework for building AgenticAI Loan-procesing application

Key concepts in CrewAI encompass several crucial elements that define how the system operates and how agents collaborate effectively:

Agents: Agents are the core components of CrewAI, each uniquely defined by a specific role, goal, and backstory. They utilize a language model (LLM) tailored to their persona, which allows them to communicate in a way that aligns with their expertise or field of knowledge. Additionally, agents may have access to specialized tools that enhance their capabilities. This role-playing aspect not only helps agents embody their defined personas but also enables nuanced interactions and more effective problem-solving based on their areas of expertise.

Tasks: Within CrewAI, tasks are well-defined assignments given to agents, each with a clear description and expected outcomes. Tasks vary in complexity and purpose, and they are assigned to specific agents based on their roles and skill sets. Additionally, tasks can be structured in a manner that allows for chaining—where the output of one task can seamlessly serve as the input for another task. This chaining mechanism facilitates efficient workflows and enhances the overall productivity of the agents.

Tools: Tools serve as essential functions or capabilities that agents can use to interact with external systems or execute specific actions. Examples of these tools include web search capabilities, API access, and data processing functions. In the CrewAI framework, tools often derive from a BaseTool class, ensuring consistency and reliability in their performance. The availability of various tools empowers agents to perform diverse tasks more effectively, expanding their range of capabilities.

Crew: The crew represents the collaborative assembly of agents working together to accomplish a set of objectives or tasks. The composition of the crew is vital, as it determines how well agents can leverage each other's strengths and skills. Effective collaboration within the crew relies on clear communication and the coordinated execution of tasks, enhancing the overall effectiveness of the team.

Process: The process describes the systematic workflow or methodology that the crew adheres to in order to carry out tasks. Various processes may be utilized, including sequential execution, where tasks are completed one after the other, or hierarchical execution, where a lead or manager agent is responsible for delegating tasks to ensure an orderly approach. The defined process is critical for maintaining organization and efficiency in task execution, thereby enabling the crew to meet its objectives strategically and systematically.

Pre-Requisites

VSCode - https://code.visualstudio.com/download

Install Python --- https://www.python.org/downloads/
CrewAI Installation:

Run following commands

python -m .venv ( to create virtual environment)

source .venv/bin/activate ( to activate virtual environment)

pip install uv (Python package installer)

uv --version (to check version)

uv tool install crewai (install the crewai )

uv tool update-shell (to set path)
Create Project

crewai create crew loan_processing ( to create skelaton project loan_processing)

This will create project in structure as shown below
Create requirements.txt file for setting up all libraries and save it at the root of the project
```
 boto3
 crewai[tools]
 crewai-tools[mcp]
 streamlit==1.49.1
 ratelimit
 tenacity
```
install the libraries by running

pip install -r requirements.txt

Update the agents.yml file to add agents for the app as shown below

doc_specialist:
  role: >
    Document Validation Specialist
  goal: >
    Validate the completeness and format of a new loan application provided as a JSON string.
  backstory: >
    You are a meticulous agent responsible for the first step of loan processing.

credit_analyst:
  role: >
    Credit Check Agent
  goal: >
    Query the credit bureau API to retrieve an applicant's credit score.
  backstory: >
    You are a specialized agent that interacts with the Credit Bureau.


risk_assessor:
  role: >
    Risk Assessment Analyst
  goal: >
    Calculate the financial risk score for a loan application.
  backstory: >
    You are a quantitative analyst agent.


compliance_officer:
  role: >
    Compliance Officer
  goal: >
    Check the application against all internal lending policies and compliance rules.
  backstory: >
    You are the final checkpoint for policy and compliance.


manager:
  role: >
    Loan Processing Manager
  goal: >
    Manage the loan application workflow and compile the final report.
  backstory: >
    You are the manager responsible for orchestrating the loan processing pipeline

Update the tasks.yml file to add tasks

task_validate:
  description: >
    Validate the loan application provided as a JSON string: '{document_content}'.
    Pass this string to the 'Validate Document Fields' tool.
  expected_output: >
    A JSON string with the validation status
  agent: doc_specialist

task_credit:
  description: >
    Extract customer_id and call Query Credit Bureau API.
  expected_output: >
    A JSON string containing the credit_score.
  agent: credit_analyst
  context:
    - task_validate

task_risk:
  description: >
    Extract loan details and credit score, then Calculate Risk Score..
  expected_output: >
    A JSON string containing the risk_score.
  agent: risk_assessor
  context:
    - task_validate
    - task_credit

task_compliance:
  description: >
    Check Lending Compliance based on history and risk score.
  expected_output: >
    Compliance status JSON.
  agent: compliance_officer
  context:
    - task_validate
    - task_risk

task_report:
  description: >
    Compile a final report with Approve/Deny decision.
  expected_output: >
    Markdown report.
  agent: manager
  context:
    - task_validate
    - task_credit
    - task_risk
    - task_compliance

Update the crew.py file to include the following configurations in detail:

Create an LLM: Implement the necessary code to initialize and configure the LLM (Language Model), for our case I have used OpenAI LLM.
Configure Agents: Within the @CrewBase section, use the @agent decorator to define and configure all agents that will be part of the crew. Ensure that each agent has clear roles and responsibilities outlined.
Define Tasks: Similarly, utilize the @task decorator to specify all tasks that agents will be responsible for. Make sure each task is well-defined and includes the necessary parameters and execution criteria.
Assemble the Crew: Use the @crew decorator to bring together the agents, tasks, and a designated manager. The assembly should clearly delineate how each component interacts and their roles within the overall crew structure.
Define Tools: Finally, clearly specify all the required tools needed for the operation using the @tools decorator. Ensure that each tool is associated with the relevant agent or task for seamless integration.

With these changes we have created a well-structured and functional crew configuration within the crew.py file.

from crewai import LLM, Agent, Crew, Process, Task
from crewai.project import CrewBase, agent, crew, task
from crewai.agents.agent_builder.base_agent import BaseAgent
from typing import List
from crewai.tools import tool
import json
import sys
from datetime import datetime
import os

#load the OPENAI_API_KEY from environment variable or set it directly
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
llm = LLM(
    model=os.getenv("MODEL", "gpt-4o"),  # Default to gpt-4o if MODEL env variable is not set
    api_key=OPENAI_API_KEY,  # Or set OPENAI_API_KEY
    temperature=0.0,
    max_tokens=1000,
)

@CrewBase
class LoanProcessing():
    """LoanProcessing crew"""

    #@title Run CrewAI
    agents: List[BaseAgent]
    tasks: List[Task]

    @agent
    def doc_specialist(self) -> Agent:
        return Agent(
                config=self.agents_config['doc_specialist'], # type: ignore[index]
                verbose=True,
                tools=[self.ValidateDocumentFieldsTool],
                llm=llm
            )

    @agent
    def credit_analyst(self) -> Agent:
        return Agent(
                config=self.agents_config['credit_analyst'], # type: ignore[index]
                verbose=True,
                tools=[self.QueryCreditBureauAPITool],
                llm=llm
            )       
    @agent
    def risk_assessor(self) -> Agent:
        return Agent(
                config=self.agents_config['risk_assessor'], # type: ignore[index]
                verbose=True,
                tools=[self.CalculateRiskScoreTool],
                llm=llm
            )
    @agent
    def compliance_officer(self) -> Agent:
        return Agent(
                config=self.agents_config['compliance_officer'], # type: ignore[index]
                verbose=True,
                tools=[self.CheckLendingComplianceTool],
                llm=llm
            )
    @agent
    def manager(self) -> Agent:
        return Agent(
                config=self.agents_config['manager'], # type: ignore[index]
                verbose=True,
                llm=llm,
                allow_delegation=True
            )
    @task
    def task_validate(self) -> Task:
            return Task(
                config=self.tasks_config['task_validate'], # type: ignore[index]
            )

    @task
    def task_credit(self) -> Task:
            return Task(
                config=self.tasks_config['task_credit'], # type: ignore[index]
            )
    @task
    def task_risk(self) -> Task:
            return Task(
                config=self.tasks_config['task_risk'], # type: ignore[index]
            )
    @task
    def task_compliance(self) -> Task:
            return Task(
                config=self.tasks_config['task_compliance'], # type: ignore[index]
            )
    @task
    def task_report(self) -> Task:
            return Task(
                config=self.tasks_config['task_report'], # type: ignore[index]
                allow_delegation=False
            )

    @crew
    def crew(self) -> Crew:
        """Creates the LoanProcessing crew"""
        # To learn how to add knowledge sources to your crew, check out the documentation:
        # https://docs.crewai.com/concepts/knowledge#what-is-knowledge

        return Crew(
            agents=[self.doc_specialist(), self.credit_analyst(), self.risk_assessor(), self.compliance_officer()], # Automatically created by the @agent decorator
            tasks=self.tasks, # Automatically created by the @task decorator
            manager_agent=self.manager(), # Automatically created by the @agent decorator
            process=Process.hierarchical,
            verbose=True
        )
    @tool
    def ValidateDocumentFieldsTool( application_data: str) -> str:
        """Validates JSON application data."""
        name: str = "Validate Document Fields"
        description: str = "Validates JSON application data."
        print(f"--- TOOL: Validating document fields ---"+f" (Data: {application_data}) ---")
        try:
            data = json.loads(application_data)
            required = ["customer_id", "loan_amount", "income", "credit_history"]
            missing = [f for f in required if f not in data]
            if missing:
                return json.dumps({"error": f"Missing fields: {', '.join(missing)}"})
            return json.dumps({"status": "validated", "data": data})
        except:
            return json.dumps({"error": "Invalid JSON"})
    @tool
    def QueryCreditBureauAPITool(customer_id: str) -> str:
        """Gets credit score for customer_id."""
        name: str = "Query Credit Bureau API"
        description: str = "Gets credit score for customer_id."
        print(f"--- TOOL: Calling Credit Bureau for {customer_id} ---")
        scores = {
            "CUST-12345": 810, # Good
            "CUST-99999": 550, # BAD SCORE (< 600)
            "CUST-55555": 620
        }
        score = scores.get(customer_id)
        if score:
            return json.dumps({"customer_id": customer_id, "credit_score": score})
        return json.dumps({"error": "Customer not found"})
    @tool
    def CalculateRiskScoreTool(loan_amount: int, income: str, credit_score: int) -> str:
        """Calculates risk based on financial data."""
        name: str = "Calculate Risk Score"
        description: str = "Calculates risk based on financial data."
        print(f"--- TOOL: Calculating Risk (Score: {credit_score}) ---")
        # Logic: Credit Score < 600 is automatic HIGH risk
        if credit_score < 600:
            return json.dumps({"risk_score": 9, "reason": "Credit score too low"})

        # Standard logic
        try:
            inc_val = int(''.join(filter(str.isdigit, income)))
            ann_inc = inc_val * 12 if "month" in income.lower() else inc_val
        except: ann_inc = 0

        risk = 1
        if credit_score < 720: risk += 2
        if ann_inc > 0 and (loan_amount / ann_inc) > 0.5: risk += 3

        return json.dumps({"risk_score": min(risk, 10)})
    @tool
    def CheckLendingComplianceTool(loan_amount: int, risk_score: int) -> str:
        """Checks if loan complies with lending rules."""
        name: str = "Check Lending Compliance"
        description: str = "Checks if loan complies with lending rules."
        print(f"--- TOOL: Checking Lending Compliance ---")
        # Simple compliance logic for demo
        if loan_amount > 500000:
            return json.dumps({"compliant": False, "reason": "Loan amount exceeds limit"})
        if risk_score >= 7:
            return json.dumps({"compliant": False, "reason": "Risk score too high"})
        return json.dumps({"compliant": True})

We are about to enhance the main.py file of the Crewai application by implementing a run method. This method will serve as the starting point for the Crewai app, taking in the necessary inputs to initialize the application.

The input we will use is a JSON file that features a key named document_content. This content is structured using several important attributes, which include:

customer_id: A unique identifier for each customer.
loan_amount: The total amount of money requested for the loan.
income: The customer's declared income, which helps assess their ability to repay the loan.
credit_history: A record of the customer's credit behavior, indicating their reliability in managing debt.

These attributes will be fetched using a specific identifier known as document_id, ensuring that we are working with the correct data for each individual customer. This setup will allow the Crewai app to effectively process loan requests based on the provided information.

code is shown below

#!/usr/bin/env python
import sys
import warnings

from datetime import datetime
import json
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception
from ratelimit import limits, sleep_and_retry
import time

from loan_processing.crew import LoanProcessing


# --- CONFIGURATION ---
CALLS = 15  # Max calls...
PERIOD = 60 # ...per minute

loan_application_inputs_valid = {
    "applicant_id": "borrower_good_780",
    "document_id": "document_valid_123"
}

loan_application_inputs_invalid = {
    "applicant_id": "borrower_bad_620",
    "document_id": "document_invalid_456"
}    


warnings.filterwarnings("ignore", category=SyntaxWarning, module="pysbd")

def run():
    """
    Run the crew.
    """

    inputs = {
        'topic': 'AI LLMs',
        'current_year': str(datetime.now().year)
    }

    try:
        print("--- KICKING OFF CREWAI (VALID INPUTS) ---")
        valid_json = get_document_content(loan_application_inputs_valid['document_id'])
        inputs = {'document_content': valid_json}
        robust_execute(LoanProcessing().crew().kickoff, inputs=inputs)
    except Exception as e:
        import traceback
        traceback.print_exc()
        handle_execution_error(e)

# --- 1. HELPER: Mock Document Fetcher ---
def get_document_content(document_id: str) -> str:
    print(f"--- HELPER: Simulating fetch for doc_id: {document_id} ---")

    if document_id == "document_valid_123":
        # Happy Path: High Income, Good History
        return json.dumps({
            "customer_id": "CUST-12345",
            "loan_amount": 50000,
            "income": "USD 120000 a year",
            "credit_history": "7 years good standing"
        })

    elif document_id == "document_risky_789":
        # Unhappy Path: Valid Docs, but LOW CREDIT SCORE
        return json.dumps({
            "customer_id": "CUST-99999",
            "loan_amount": 50000,
            "income": "USD 40000 a year",
            "credit_history": "Recent Missed Payments"
        })

    elif document_id == "document_invalid_456":
        # Broken Path: Missing fields (income)
        return json.dumps({
            "customer_id": "CUST-55555",
            "loan_amount": 200000,
            "credit_history": "1 year"
        })
    else:
        return json.dumps({"error": "Document ID not found."})

# --- HELPER: ERROR FILTER ---
def is_rate_limit_error(e):
    msg = str(e).lower()
    return "429" in msg or "quota" in msg or "resource exhausted" in msg or "serviceunavailable" in msg

# --- ROBUST WRAPPER ---
@sleep_and_retry
@limits(calls=CALLS, period=PERIOD)
@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=2, min=4, max=30),
    retry=retry_if_exception(is_rate_limit_error),
    reraise=True
)
def robust_execute(func, *args, **kwargs):
    """
    Executes any function (CrewAI kickoff, LangGraph invoke) with built-in
    rate limiting and auto-retries for transient API errors.
    """
    print(f"  >> [Clock {time.strftime('%X')}] Executing Agent Action (Safe Mode)...")
    return func(*args, **kwargs)

# --- ERROR HANDLER ---
def handle_execution_error(e):
    """Prints a clean, professional error report."""
    error_msg = str(e)
    is_quota = "429" in error_msg or "quota" in error_msg.lower()

    print("\n" + "━" * 60)
    print("  🛑  MISSION ABORTED: SYSTEM CRITICAL ERROR")
    print("━" * 60)

    if is_quota:
        print("  ⚠️   CAUSE:    QUOTA EXCEEDED (API Refusal)")
        print("  🔍   CONTEXT:  The LLM provider rejected the request.")
        print("\n  🛠️   ACTION:    [1] Wait before retrying")
        print("                  [2] Check API Limits (Free Tier is ~15 RPM)")
    else:
        print(f"  ⚠️   CAUSE:    UNEXPECTED EXCEPTION")
        print(f"  📝   DETAILS:  {error_msg}")

    print("━" * 60 + "\n")

Now run the app with command

crewai run

Github Repo

Code for this project is available at multi-agent-crewai loan processing app

Implementing LSTM RNN using Pytorch

Nitin Sharma — Sat, 31 Jan 2026 16:25:47 GMT

Previously, I wrote an article titled "Recurrent Neural Network," where I delved into the inner workings of Recurrent Neural Networks (RNNs) and their significance in the field of machine learning. Subsequently, I provided a tutorial “Implementing LSTM RNN using Keras and TensorFlow” that illustrated how to implement RNNs using popular deep learning libraries, Keras and TensorFlow. In this blog post, I am excited to take it a step further by guiding you through the implementation of Long Short-Term Memory (LSTM) networks, a specific type of RNN, using PyTorch. LSTMs are particularly effective in handling sequential data, and I look forward to exploring this powerful tool with you.

I am including a diagram from my previous blog post that illustrates the architecture of Long Short-Term Memory (LSTM) networks. This will help to refresh your memory and provide a solid foundation as we begin writing the code.

In general the current input vector and the previous short-term state/hidden state are fed to four different fully connected layers.

The main layer which outputs candidate value is responsible for outputting . It analyzes the current inputs and the previous short-term/hidden state . The most important information is stored in the long-term state while the rest is discarded.
The three other layers(forget, update and output) are gate controllers. Since they use the logistic activation function sigmoid , the outputs range from 0 to 1. As you can see, the gate controllers’ outputs are fed to element-wise multiplication operations: if they output 0s they close the gate, and if they output 1s they open it. Specifically:
- The forget gate (controlled by ) controls which parts of the long-term state should be erased.
- The update gate (controlled by ) controls which parts of should be added to the long-term state.
- Finally, the output gate (controlled by ) controls which parts of the long-term state should be read and output at this time step, both to and to .

Let's try to understand this content through the image shown below for simple RNN cell

Input Data Block (Left)
- Shape: (Batch Size, Time Steps, Num Features)
- This cube represents your multivariate time-series input:
  - Batch: Multiple sequences processed together.
  - Time Steps (Window Size): How far back in time you look (T).
  - Features: The number of variables at each time step (D).
- 👉 At each time step ( t ), the model sees all features together, not one at a time.
Feature Vector at Each Time Step
- The feature vectors are represented as follows:
  - $( x_1 = [f_1, f_2, f_3, \ldots, f_D] )$
  - $( x_2 = [f_1, f_2, f_3, \ldots, f_D] )$
  - ...
  - $( x_T = [f_1, f_2, f_3, \ldots, f_D] )$
- This means:
  - At time step ( t ), the input consists of one vector of ( D ) features.
  - There is NOT one RNN per feature; all features are fed simultaneously into the same RNN cell.
RNN Cells (Middle/Right)
- Each box labeled “RNN Cell (One Shared Network)” represents the same RNN, reused at every time step.
- Important points:
  - The boxes are drawn multiple times only to illustrate time unfolding.
  - Weights are shared across all time steps.
  - This is one RNN, not many.
- Mathematically:
  - $( h_t = RNN(x_t, h_{t-1}) )$
  - Where:
    - $( x_t )$ = all features at time ( t )
    - $( h_t ) $ \= hidden state at time ( t )
Hidden States $\mathbf{(( h_1, h_2, \ldots, h_T ))}$
- Each RNN cell outputs:
  - $( h_1, h_2, h_3, \ldots, h_T )$
- These represent:
  - The model's memory after processing data up to that time.
  - Each $( h_t )$ is typically a vector of size "hidden_size."
- Hidden states:
  - Change at every time step.
  - Carry temporal information forward.

At every time step, one RNN processes all features together, updates its memory, and passes that memory to the next time step using the same weights.

Implementation

Ihave utilized Jupyter Notebook, which is installed on my Mac, to run my code. However, an excellent alternative is Google Colab, which allows for seamless execution of the code presented in this blog.

In order to work with the code effectively, there are specific packages that I need to install within my virtual environment. These packages include:

PyTorch: A powerful deep learning library that provides flexibility and ease of use for building and training neural networks.
TorchMetrics: A library that offers a wide range of metrics specifically designed for evaluating the performance of generative models.

Loading packages

import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from pathlib import Path
import tarfile
import urllib.request
import torchmetrics

Leveraging GPU/MPS to boost performance

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"
device

Download the data

def download_and_extract_ridership_data():
    tarball_path = Path("datasets/ridership.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/learner14/data"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets", filter="data")

download_and_extract_ridership_data()

path = Path("datasets/ridership/CTA_-_Ridership_-_Daily_Boarding_Totals.csv")
df = pd.read_csv(path, parse_dates=["service_date"])
df.columns = ["date", "day_type", "bus", "rail", "total"]  # shorter names
df = df.sort_values("date").set_index("date")
df = df.drop("total", axis=1)  
df = df.drop_duplicates()

Lets look at the first 5 data rows

df.head()


day_type    bus    rail
date            
2001-01-01    U    297192    126455
2001-01-02    W    780827    501952
2001-01-03    W    824923    536432
2001-01-04    W    870021    550011
2001-01-05    W    890426    557917

Plotting a sample to look at data

plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

df["2021-03":"2021-05"].plot(grid=True, marker=".", figsize=(8, 3.5))
plt.show()

Data Analysis

Here is a comprehensive analysis of the time series data concerning Bus and Rail ridership from approximately March to May 2021.

Daily Time Series:

Bus Ridership (blue): Ranges from about 160,000 to 360,000.
Rail Ridership (orange): Ranges from approximately 95,000 to 225,000.

Key Observations:

Strong Weekly Seasonality: Both bus and rail ridership exhibit a pronounced pattern of weekly fluctuations.
Upward Trend: There is a noticeable increase in ridership over time for both modes of transport.
Highly Synchronized Dips: Whenever bus ridership declines, rail ridership also tends to decline, and vice versa during increases.
Increasing Variability Over Time: The variability in ridership data appears to be growing.

Given these characteristics, multivariate forecasting models, such as the Multivariate Long Short-Term Memory (LSTM) model, are likely to perform better than univariate models in predicting future ridership trends.

Preparing data for model

Lets create Multivariate Dataset for train and bus

df_mulvar = df[["rail", "bus"]] / 1e6  # use both rail & bus series as input
df_mulvar.head()

First 5 rows are shown below


rail    bus
date        
2001-01-01    0.126455    0.297192
2001-01-02    0.501952    0.780827
2001-01-03    0.536432    0.824923
2001-01-04    0.550011    0.870021
2001-01-05    0.557917    0.890426

we add day_type in data as one hot encoding for three day_type .

next_day_type_A: for saturday
next_day_type_U : for sunday and holiday
next_day_type_W.: for weekdays

df_mulvar["next_day_type"] = df["day_type"].shift(-1)  # we know tomorrow's type
df_mulvar = pd.get_dummies(df_mulvar, dtype=float)  # one-hot encode day type
df_mulvar.head()


rail    bus    next_day_type_A    next_day_type_U    next_day_type_W
date                    
2001-01-01    0.126455    0.297192    0.0    0.0    1.0
2001-01-02    0.501952    0.780827    0.0    0.0    1.0
2001-01-03    0.536432    0.824923    0.0    0.0    1.0
2001-01-04    0.550011    0.870021    0.0    0.0    1.0
2001-01-05    0.557917    0.890426    1.0    0.0    0.0

The following code generates sliding

windows from a time series, which can be used for predicting the next value in the sequence. The input parameters consist of the time series data itself and a specified window length, denoted as T. This method enables the extraction of overlapping segments of the series, facilitating the analysis and forecasting of future data points based on the patterns observed in these windows.

class MulvarTimeSeriesDataset(TimeSeriesDataset):
    def __getitem__(self, idx):
        window, target = super().__getitem__(idx)
        return window, target[:2]

We now create train, validation, and test tensors for rail ridership through date slicing. Using pandas' label-based time slicing, the command df[["rail"]]["2016-01":"2018-12"] selects the 'rail' column for the specified date range.

Next, we apply unit scaling by dividing the raw rider counts by 1 million (1e6). This conversion helps keep the values numerically smaller, which contributes to training stability.

The splits are as follows:

rail_train: January 2016 to December 2018
rail_valid: January 2019 to May 2019
rail_test: June 2019 to the end of the data

Typically, these tensors can be fed into a TimeSeriesDataset with a specified window_length to generate sliding windows for next-day predictions.

mulvar_train = torch.FloatTensor(df_mulvar["2016-01":"2018-12"].values)
mulvar_valid = torch.FloatTensor(df_mulvar["2019-01":"2019-05"].values)
mulvar_test = torch.FloatTensor(df_mulvar["2019-06":].values)

Create TimeSeriesDataset for train, validate and test for rail ridership

window_length = 56
mulvar_train_set = MulvarTimeSeriesDataset(mulvar_train, window_length)
mulvar_train_loader = DataLoader(mulvar_train_set, batch_size=32, shuffle=True)
mulvar_valid_set = MulvarTimeSeriesDataset(mulvar_valid, window_length)
mulvar_valid_loader = DataLoader(mulvar_valid_set, batch_size=32)
mulvar_test_set = MulvarTimeSeriesDataset(mulvar_test, window_length)
mulvar_test_loader = DataLoader(mulvar_test_set, batch_size=32)

Create following functions to evaluate and train the model(LSTM model).

evaluate_tm(model, data_loader, metric)
train(model, optimizer, loss_fn, metric, train_loader, valid_loader, n_epochs, patience=10, factor=0.1)

eval_tm function(model, data_loader, metric)

The following process evaluates a trained model on a dataset and returns a performance metric (e.g., Mean Absolute Error (MAE) or Accuracy) without updating the model weights.

Evaluation Mode: Use model.eval() to set the model in evaluation mode. This adjusts the behavior of layers like dropout and batch normalization to ensure consistent results during inference.
No Gradients: Employ torch.no_grad() to disable gradient tracking. This saves memory and speeds up the evaluation process.
Batch Loop: For each batch of data:
- Move X_batch and y_batch to the appropriate device (e.g., GPU).
- Compute predictions: y_pred = model(X_batch).
- Update the performance metric using the predictions and the true targets.
Metric Lifecycle:
- Call metric.reset() to clear any prior state.
- Use metric.compute() to return the aggregated score across all batches.

Inputs:

model (an instance of nn.Module)
data_loader (provides batches for evaluation)
metric (an object from torchmetrics)

Output: The evaluation process outputs a single scalar metric that summarizes the model's performance over the entire data loader.

def evaluate_tm(model, data_loader, metric):
    model.eval()
    metric.reset()
    with torch.no_grad():
        for X_batch, y_batch in data_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            metric.update(y_pred, y_batch)
    return metric.compute()

train(model, optimizer, loss_fn, metric, train_loader, valid_loader, n_epochs, patience=10, factor=0.1)

This function orchestrates the model training process while tracking metrics and adapting the learning rate.

Inputs: It requires the following parameters: model, optimizer, loss_fn, metric, train_loader, valid_loader, n_epochs, patience, and factor.

Scheduler: The function employs ReduceLROnPlateau with mode="min" to decrease the learning rate when there is no improvement in the validation metric.

Loop: For each epoch, the function computes y_pred, calculates the loss, performs backpropagation (loss.backward()), updates the model parameters (optimizer.step()), resets the gradients (optimizer.zero_grad()), and updates the metric.

Logging: It records the average training loss and training metric, evaluates the validation metric using evaluate_tm(), and prints a concise summary of the epoch.

Output: The function returns a history object containing train_losses, train_metrics, and valid_metrics, which can be used for plotting and analysis.


def train(model, optimizer, loss_fn, metric, train_loader, valid_loader,
          n_epochs, patience=10, factor=0.1):
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode="min", patience=patience, factor=factor)
    history = {"train_losses": [], "train_metrics": [], "valid_metrics": []}
    for epoch in range(n_epochs):
        total_loss = 0.0
        metric.reset()
        model.train()
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            metric.update(y_pred, y_batch)
        history["train_losses"].append(total_loss / len(train_loader))
        history["train_metrics"].append(metric.compute().item())
        val_metric = evaluate_tm(model, valid_loader, metric).item()
        history["valid_metrics"].append(val_metric)
        scheduler.step(val_metric)
        print(f"Epoch {epoch + 1}/{n_epochs}, "
              f"train loss: {history['train_losses'][-1]:.4f}, "
              f"train metric: {history['train_metrics'][-1]:.4f}, "
              f"valid metric: {history['valid_metrics'][-1]:.4f}")
    return history

Utility function fit_and_evaluate() which will be called to train the model.

Trains a model on train_loader, evaluates on valid_loader, and returns the best validation score scaled back to riders.
Loss: Uses nn.HuberLoss() for robustness to outliers compared to MSE.
Optimizer: Applies SGD with momentum 0.95 and user-provided learning rate lr for stable convergence.
Metric: Tracks torchmetrics.MeanAbsoluteError (MAE) on the active device.
Training: Delegates to train() with n_epochs, patience, and factor (for ReduceLROnPlateau inside train()), recording loss and metrics each epoch.
Returns min(history["valid_metrics"]) * 1e6 — the best validation MAE rescaled from “millions” back to riders for an intuitive number.

# extra code – defines a utility function we'll reuse several time

def fit_and_evaluate(model, train_loader, valid_loader, lr, n_epochs=50,
                     patience=20, factor=0.1):
    loss_fn = nn.HuberLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.95)
    metric = torchmetrics.MeanAbsoluteError().to(device)
    history = train(model, optimizer, loss_fn, metric,
                    train_loader, valid_loader, n_epochs=n_epochs,
                    patience=patience, factor=factor)
    return min(history["valid_metrics"]) * 1e6

LSTM module

This model unrolls an LSTMCell over the time dimension (many-to-one). It takes the final hidden state and maps it to the output.

class LstmModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.memory_cell = nn.LSTMCell(input_size, hidden_size)
        self.output = nn.Linear(hidden_size, output_size)

    def forward(self, X):
        batch_size, window_length, dimensionality = X.shape
        X_time_first = X.transpose(0, 1)
        H = torch.zeros(batch_size, self.hidden_size, device=X.device)
        C = torch.zeros(batch_size, self.hidden_size, device=X.device)
        for X_t in X_time_first:
            H, C = self.memory_cell(X_t, (H, C))
        return self.output(H)

```
Input X (batch, T, input_size)
         ┌───────────────────────────────────────────────────────────┐
         │  Time-first transpose → X_time_first (T, batch, input_size) │
         └───────────────────────────────────────────────────────────┘
                          │
                t = 1     │      t = 2              …              t = T
                          ▼
     X_1 ──► [ LSTMCell ] ──► H_1,C_1
                          │
     X_2 ──► [ LSTMCell ] ──► H_2,C_2
                          │
       …    [ LSTMCell ]   …
                          │
     X_T ──► [ LSTMCell ] ──► H_T,C_T
                          │
                          ▼
                 take final hidden H_T (batch, hidden_size)
                          │
                          ▼
               Linear(hidden_size → output_size)
                          │
                          ▼
                    y_pred (batch, output_size)
```

- `LSTMCell` runs one step at a time; `nn.LSTM` can process all steps at once (faster on GPU).
- Many-to-one setup: single output per sequence using the last hidden state.
- For many-to-many tasks, apply a head to each time step (e.g., over all hidden states).

Train the Model

torch.manual_seed(42)
Lstm_model = LstmModel(
    input_size=5, hidden_size=32, output_size=2).to(device)
Lstm_model = Lstm_model.to(device)
fit_and_evaluate(Lstm_model, mulvar_train_loader, mulvar_valid_loader, lr=0.05, n_epochs=50)

Epoch 1/50, train loss: 0.0675, train metric: 0.3049, valid metric: 0.2044
Epoch 2/50, train loss: 0.0184, train metric: 0.1556, valid metric: 0.1573
Epoch 3/50, train loss: 0.0104, train metric: 0.1184, valid metric: 0.0974
...
Epoch 47/50, train loss: 0.0013, train metric: 0.0364, valid metric: 0.0265
Epoch 48/50, train loss: 0.0013, train metric: 0.0365, valid metric: 0.0258
Epoch 49/50, train loss: 0.0013, train metric: 0.0363, valid metric: 0.0305
Epoch 50/50, train loss: 0.0013, train metric: 0.0368, valid metric: 0.0266

Evaluate the model on test dataset

# Evaluate the trained LSTM model on the test set
metric = torchmetrics.MeanAbsoluteError().to(device)
test_mae = evaluate_tm(Lstm_model, mulvar_test_loader, metric).item()
print(f"Test MAE: {test_mae:.6f} ({test_mae*1e6:.2f} riders)")

Test MAE: 0.134912 (134911.54 riders)

Predicting future date

Lets print the last 5 data rows from our data set

 # Also show the last 5 rows for context
    print("\nLast 5 rows:")
    print(df_mulvar.tail(5))


Last 5 rows:
                rail       bus  next_day_type_A  next_day_type_U  \
date                                                               
2021-11-26  0.189694  0.257700              1.0              0.0   
2021-11-27  0.187065  0.237839              0.0              1.0   
2021-11-28  0.147830  0.184817              0.0              0.0   
2021-11-29  0.276090  0.421322              0.0              0.0   
2021-11-30  0.302349  0.450230              0.0              0.0   

            next_day_type_W  
date                         
2021-11-26              0.0  
2021-11-27              0.0  
2021-11-28              1.0  
2021-11-29              1.0  
2021-11-30              0.0

'W' (weekday)
'A' (Saturday)
'U' (Sunday/holiday)

Lets do prediction for W(Weekday)

We will create a single instance of time series data using a sliding window of 56 to feed into the RNN model. The one-hot encoding for the "next_day_type_W" in the last row has been modified to "1" for weekday. This change is made to predict the output ( y ) for the next day for both train and bus riders.

# Predict next day's ridership using the last window and a chosen day type
try:
    future_day_type = 'W'  # options: 'W' (weekday), 'A' (Saturday), 'U' (Sunday/holiday)


    # Build the window (use defined window_length or default to 56)
    window_len = window_length if 'window_length' in globals() else 56
    X_window = df_mulvar.tail(window_len).values.copy()

    print(f"Window shape: {X_window.shape}")

    # Set the last row's day-type one-hot to the chosen future type
    X_window[-1, 3] = 1.0

    print(f"Modified last row of window for future day type '{future_day_type}':")
    print(X_window[-1])

    # Predict
    Lstm_model.eval()
    with torch.no_grad():
        X_t = torch.FloatTensor(X_window).unsqueeze(0).to(device)  # (1, T, 5)
        print(f"Input tensor shape for prediction: {X_t.shape}")
        y_pred = Lstm_model(X_t).squeeze(0).cpu().numpy() 


    rail_pred_m, bus_pred_m = float(y_pred[0]), float(y_pred[1])
    last_date = df_mulvar.index[-1]
    future_date = pd.to_datetime(last_date) + pd.Timedelta(days=1)

    print(f"Future date: {future_date.date()} (day type={future_day_type})")
    print(f"Predicted (millions): rail={rail_pred_m:.6f}, bus={bus_pred_m:.6f}")
    print(f"Predicted (riders):   rail={rail_pred_m*1e6:.0f}, bus={bus_pred_m*1e6:.0f}")
except NameError as e:
    print("Required variables not defined. Ensure df_mulvar, window_length, device, and Lstm_model exist.")
    print("Error:", e)
except Exception as e:
    print("Prediction failed:", e)

Window shape: (56, 5)
Modified last row of window for future day type 'W':
[0.302349 0.45023  0.       0.       1.      ]
Input tensor shape for prediction: torch.Size([1, 56, 5])
Future date: 2021-12-01 (day type=W)
Predicted (millions): rail=0.481875, bus=0.560757
Predicted (riders):   rail=481875, bus=560757

Predicted (millions): rail=0.481875, bus=0.560757
Predicted (riders): rail=481875, bus=560757

Implementing a ResNet-34 CNN Using PyTorch

Nitin Sharma — Thu, 29 Jan 2026 02:41:24 GMT

A while ago, I authored an article Implementing ResNet CNN that provided a detailed explanation of ResNet Convolutional Neural Networks (CNN) along with an implementation using TensorFlow. In this upcoming article, we will take a closer look at ResNet34, a specific variant of the ResNet architecture, and implement it using PyTorch. This will allow us to explore the unique features and benefits of PyTorch while leveraging the powerful capabilities of ResNet34 for various tasks in deep learning.

ResNet34 Architecture

The ResNet34 class constructs a complete ResNet-34 network consisting of 34 layers. Here’s a breakdown of its structure:

1. Stem (Initial Layers):

Conv2d: Converts 3 input channels to 64 filters with a 7×7 kernel and a stride of 2, which downsamples the image by a factor of 2.
BatchNorm2d + ReLU: Applies batch normalization followed by the ReLU activation function.
MaxPool2d: Further downsamples the image with a stride of 2.

2. Residual Blocks (Core):

The network comprises ResidualUnits grouped into four sections:

Stage 1: 3 units with 64 filters (stride = 1, maintaining spatial dimensions).
Stage 2: 4 units with 128 filters (the first unit has a stride of 2 for downsampling, while the remaining units have a stride of 1).
Stage 3: 6 units with 256 filters (the first unit has a stride of 2, while the rest have a stride of 1).
Stage 4: 3 units with 512 filters (the first unit has a stride of 2, while the rest have a stride of 1).

In total, there are 3 + 4 + 6 + 3 = 16 residual blocks, which results in 32 convolutional layers plus 2 initial convolutional layers, equating to 34 layers overall.

Stride Logic:

A stride of 2 is used when the number of filters changes, which reduces spatial resolution and increases the number of channels.
A stride of 1 is maintained when the number of filters remains the same.

3. Classification Head:

AdaptiveAvgPool2d: Performs global average pooling, resulting in an output shape of (batch_size, 512, 1, 1).
Flatten: Converts the output to a shape of (batch_size, 512).
LazyLinear: Maps the flattened output from 512 to 10 classes.

Key Design Points:

Progressively reduces spatial dimensions (56 → 28 → 14 → 7) while increasing channels
Each stage transition uses stride=2 to halve dimensions
Skip connections allow gradients to flow through all 34 layers
Total parameters: ~23.5 million

Lets implement this in Pytorch

Import the packages

import numpy as np
import torch
from sklearn.datasets import load_sample_images
import matplotlib.pyplot as plt
import torchvision
import torch.nn as nn
import torchvision.transforms.v2 as T
from functools import partial
import torchmetrics
import torch.nn.functional as F

ResidualUnit

class ResidualUnit(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()
        DefaultConv2d = partial(
            nn.Conv2d, kernel_size=3, stride=1, padding=1, bias=False)
        self.main_layers = nn.Sequential(
            DefaultConv2d(in_channels, out_channels, stride=stride),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(),
            DefaultConv2d(out_channels, out_channels),
            nn.BatchNorm2d(out_channels),
        )
        if stride > 1:
            self.skip_connection = nn.Sequential(
                DefaultConv2d(in_channels, out_channels, kernel_size=1,
                              stride=stride, padding=0),
                nn.BatchNorm2d(out_channels),
            )
        else:
            self.skip_connection = nn.Identity()

    def forward(self, inputs):
        return F.relu(self.main_layers(inputs) + self.skip_connection(inputs))

The ResidualUnit class implements a residual block, which is the core building block of ResNet (Residual Networks). Here's the breakdown:Key Components

1. Main Path

Two convolutional blocks in sequence:
- Conv2d → BatchNorm2d → ReLU → Conv2d → BatchNorm2d
The first Conv2d uses the stride parameter (for downsampling if needed)
The second Conv2d always uses stride=1

2. Skip Connection

If stride > 1: Creates a 1×1 convolution with the specified stride + batch norm (adjusts dimensions and spatial resolution)
If stride = 1: Uses nn.Identity() (passes input unchanged)
This ensures the skip connection has the same dimensions as the main path output

4. Forward Pass

Adds the output of main_layers and skip_connection
Applies ReLU activation to the sum

The key innovation is addition of the skip connection to the main path. This allows:

Gradients to bypass layers during backpropagation (easier training)
The network to learn residual mappings (differences) rather than full transformations
Training of very deep networks without degradation

ResNet34

ResNet34 class builds complete ResNet34 architecture leveraging ResidualUnit class . Architecture diagram for ResNet34 is show above earlier.

class ResNet34(nn.Module):
    def __init__(self):
        super().__init__()
        layers = [
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=7, stride=2,
                      padding=3, bias=False),
            nn.BatchNorm2d(num_features=64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        ]
        prev_filters = 64
        for filters in [64] * 3 + [128] * 4 + [256] * 6 + [512] * 3:
            stride = 1 if filters == prev_filters else 2
            layers.append(ResidualUnit(prev_filters, filters, stride=stride))
            prev_filters = filters
        layers += [
            nn.AdaptiveAvgPool2d(output_size=1),
            nn.Flatten(),
            nn.LazyLinear(10),
        ]
        self.resnet = nn.Sequential(*layers)

    def forward(self, inputs):
        return self.resnet(inputs)

Loading the CIFAR-10 dataset

# Load CIFAR-10 Dataset
transform = T.Compose([
    T.ToImage(),
    T.ToDtype(torch.float32, scale=True),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Load full training set
train_valid_dataset = torchvision.datasets.CIFAR10(root='./datasets', train=True, 
                                            download=False, transform=transform)
# Load test set for validation
test_dataset = torchvision.datasets.CIFAR10(root='./datasets', train=False, 
                                            download=False, transform=transform)

torch.manual_seed(42)
train_dataset, valid_dataset = torch.utils.data.random_split(
    train_valid_dataset, [45_000, 5_000]
)

# Create Data Loaders
batch_size = 128
train_loader = torch.utils.data.DataLoader(
    dataset=train_dataset, batch_size=batch_size, shuffle=True)
valid_loader = torch.utils.data.DataLoader(
    dataset=valid_dataset, batch_size=batch_size, shuffle=False
)
test_loader = torch.utils.data.DataLoader(
    dataset=test_dataset, batch_size=batch_size, shuffle=False
)

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(valid_dataset)}")
print(f"Testing  samples: {len(test_dataset)}")

The below code sets up the training environment for your ResNet34 model

Creates a new ResNet34 instance
Moves the model to the selected device (GPU or CPU)
Defines the loss function for multi-class classification

# Setup for Training
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
print(f"Using device: {device}")

# Initialize model
model = ResNet34().to(device)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

The train_epoch function trains the model for one complete pass through the training dataset and returns average loss and accuracy for the entire epoch

Step-by-Step Breakdown

1. Set Model to Training Mode

2. Initialize Tracking Variables

3. Loops Through Each Batch

5. Does Forward Pass

6. Does Backward Pass (Compute Gradients)

Updates all model weights using computed gradients
Moves weights in direction that reduces loss

8. Track Metrics

# Training function
def train_epoch(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    avg_loss = total_loss / len(train_loader)
    accuracy = 100 * correct / total
    return avg_loss, accuracy

Train The Model

# Train for 10 epochs
num_epochs = 10
train_losses = []
train_accs = []
valid_losses = []
valid_accs = []

print("Starting training for 10 epochs...")
print("=" * 80)

for epoch in range(num_epochs):
    # Train for one epoch
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    train_losses.append(train_loss)
    train_accs.append(train_acc)

    # Validate for one epoch
    valid_loss, valid_acc = test_epoch(model, valid_loader, criterion, device)
    valid_losses.append(valid_loss)
    valid_accs.append(valid_acc)

    # Print results for each epoch
    print(f"Epoch [{epoch+1:2d}/{num_epochs}] | "
          f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:6.2f}% | "
          f"Valid Loss: {valid_loss:.4f} | Valid Acc: {valid_acc:6.2f}%")

print("=" * 80) 
print("Training completed!")

print("Model saved to './my_resnet34_checkpoint.pt'")

# Save the trained modeltorch.save(model.state_dict(), './my_resnet34_checkpoint.pt')

Starting training for 10 epochs...
================================================================================
Epoch [ 1/10] | Train Loss: 0.1853 | Train Acc:  93.58% | Valid Loss: 0.6287 | Valid Acc:  82.22%
Epoch [ 2/10] | Train Loss: 0.1498 | Train Acc:  94.80% | Valid Loss: 0.7269 | Valid Acc:  80.48%
Epoch [ 3/10] | Train Loss: 0.1282 | Train Acc:  95.52% | Valid Loss: 0.7559 | Valid Acc:  80.24%
Epoch [ 4/10] | Train Loss: 0.0961 | Train Acc:  96.71% | Valid Loss: 0.8131 | Valid Acc:  80.16%
Epoch [ 5/10] | Train Loss: 0.0948 | Train Acc:  96.57% | Valid Loss: 0.8196 | Valid Acc:  80.94%
Epoch [ 6/10] | Train Loss: 0.0853 | Train Acc:  97.06% | Valid Loss: 0.8924 | Valid Acc:  79.26%
Epoch [ 7/10] | Train Loss: 0.0755 | Train Acc:  97.44% | Valid Loss: 0.8582 | Valid Acc:  80.14%
Epoch [ 8/10] | Train Loss: 0.0661 | Train Acc:  97.79% | Valid Loss: 0.9182 | Valid Acc:  80.18%
Epoch [ 9/10] | Train Loss: 0.0653 | Train Acc:  97.71% | Valid Loss: 0.9218 | Valid Acc:  80.42%
Epoch [10/10] | Train Loss: 0.0518 | Train Acc:  98.22% | Valid Loss: 0.9642 | Valid Acc:  79.80%
================================================================================
Training completed!
Model saved to './my_resnet34_checkpoint.pt'

Chart of losses and accuracy for training and validation data

\====================================================================== TRAINING SUMMARY \======================================================================

Total Epochs Trained: 10

Final Metrics:

Train Loss: 0.0518 Train Accuracy: 98.22% Valid Loss: 0.9642 Valid Accuracy: 79.80%

Best Validation Metrics: Best Valid Accuracy: 82.22% (Epoch 1) Best Valid Loss: 0.6287 (Epoch 1) ======================================================================

Evaluating on test Data

# Evaluate on Test Data
print("Evaluating model on test data...")
test_loss, test_acc = test_epoch(model, test_loader, criterion, device)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_acc:.2f}%")

Evaluating model on test data...
Test Loss: 1.2146
Test Accuracy: 76.64%

Deep learning using Pytorch on Images dataset

Nitin Sharma — Wed, 28 Jan 2026 01:18:53 GMT

The CIFAR-10 dataset is a widely used collection of images in the field of machine learning. It consists of 60,000 32x32 color images categorized into 10 different classes, with each class containing 6,000 images. These classes include airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, rabbits, and ships.

In the PyTorch vision library, the CIFAR-10 dataset can be easily accessed and utilized. PyTorch provides a convenient way to download the dataset and offers various transformations to preprocess the images for training and testing machine learning models. This makes it a popular choice for training and evaluating image classification algorithms.

In this blog we will train a deep neural network on the CIFAR10 image dataset.

I have utilized Jupyter Notebook, which is installed on my Mac, to run my code. However, an excellent alternative is Google Colab, which allows for seamless execution of the code presented in this blog.

In order to work with the code effectively, there are specific packages that I need to install within my virtual environment. These packages include:

PyTorch: A powerful deep learning library that provides flexibility and ease of use for building and training neural networks.
TorchMetrics: A library that offers a wide range of metrics specifically designed for evaluating the performance of generative models.
TorchVision: A package that simplifies the process of loading and preprocessing datasets, including the popular CIFAR-10 dataset, which contains a variety of images used for training machine learning models.

By setting up these packages, I can leverage the capabilities of deep learning and effectively work with image data.

Loading Packages and CIFAR-10 DataSet

Lets load the packages needed

import torch
import torchvision
import torch.nn as nn
import torchvision.transforms.v2 as T

We will detects and selects the best available device for PyTorch computations

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

Using TorchVision we will load the dataset and split into train, test and valid dataset.

We create a preprocessing pipeline that transforms images as they're loaded:

T.Compose([...]) - Chains multiple transforms together in sequence. Each transform is applied in order.
T.ToImage() - Converts images from various formats (PIL images, NumPy arrays, raw tensors) into TorchVision's Image class (a specialized tensor type). Standardizes the image format.
T.ToDtype(torch.float32, scale=True) - Converts image values to 32-bit floats and scales them:
- torch.float32 - Sets data type to float32
- scale=True - Normalizes pixel values from their original range (0-255 for typical images) to 0.0-1.0

toTensor = T.Compose([T.ToImage(), T.ToDtype(torch.float32, scale=True)])
train_and_valid_set = torchvision.datasets.CIFAR10(
    root="datasets", train=True, download=True, transform=toTensor
)
test_set= torchvision.datasets.CIFAR10(
    root="datasets", train=False, download=True, transform=toTensor
)

Randomly splits (50,000 CIFAR-10 training images) into two subsets:
- train_set → 45,000 images (for training)
- valid_set → 5,000 images (for validation/testing model performance during training)

torch.manual_seed(42)
train_set, valid_set = torch.utils.data.random_split(
    train_and_valid_set, [45_000, 5_000]
)

Now load the datasets as python’s DataLoader objects

batch_size = 128
train_loader = torch.utils.data.DataLoader(
    dataset=train_set, batch_size=batch_size, shuffle=True)
valid_loader = torch.utils.data.DataLoader(
    dataset=valid_set, batch_size=batch_size, shuffle=False
)
test_loader = torch.utils.data.DataLoader(
    dataset=test_set, batch_size=batch_size, shuffle=False
)

Build the Model(Deep Neural Network)

We will build a deep neural network (DNN) with 20 hidden layers, each containing 100 neurons. We will use He initialization for the weights and the Swish activation function (implemented as nn.SiLU). Since this is a classification task, the output layer will have one neuron for each class.

He Initialization

We will use He Initialization (also known as Kaiming Initialization) which is a technique for initializing weights in neural networks that use ReLU (Rectified Linear Unit) or Swish/SiLU activation functions.

The Problem It Solves:

When training deep networks, weights that are initialized randomly can lead to several issues:

Vanishing Gradients: Gradients can become very small, causing learning to slow down significantly.
Exploding Gradients: Gradients can grow too large, resulting in unstable training.

How It Works:

He initialization scales the weights based on the number of input neurons to the layer:

$[ w \sim U(-\sqrt{\frac{6}{n_{\text{in}}}}, \sqrt{\frac{6}{n_{\text{in}}}}) ]$

where $( n_{\text{in}} ) $ is the number of input neurons for that layer.

Layers with more input neurons will have smaller weight magnitudes. This approach helps maintain consistent signal variance throughout the network.

Why It Matters:

Faster Convergence: The network trains more efficiently.
Better Performance for Deeper Networks: It helps prevent gradient-related issues in very deep architectures.
Optimized for ReLU/SiLU: Specifically tuned for these activation functions.

Without He initialization, training a 20-layer network (like yours) would be challenging. With it, deep networks can learn significantly faster and more stably.

def use_he_init(module):
    if isinstance(module, torch.nn.Linear):
        torch.nn.init.kaiming_uniform_(module.weight)
        torch.nn.init.zeros_(module.bias)

Now we build Deep Neural Network of 20 hidden layers, each containing 100 neurons and using activation function SiLU

def build_deep_model(n_hidden, n_neurons, n_inputs, n_outputs):
    layers = [nn.Flatten(), nn.Linear(n_inputs, n_neurons), nn.SiLU()]
    for _ in range(n_hidden - 1):
        layers += [nn.Linear(n_neurons, n_neurons), nn.SiLU()]

    layers += [nn.Linear(n_neurons, n_outputs)]
    model = torch.nn.Sequential(*layers)
    model.apply(use_he_init)
    return model

Activation Function SiLU

SiLU (Sigmoid Linear Unit), also known as Swish, is a smooth activation function that has gained popularity in modern deep learning.

Mathematical Definition:

$\mathbf{ \text{SiLU}(x) = x \cdot \sigma(x) = x \cdot \frac{1}{1 + e^{-x}} }$

Here, $\mathbf{ \sigma(x) }$ represents the sigmoid function.

How It Works:

The function multiplies the input ( x ) by its sigmoid value, which lies between 0 and 1.
When ( x ) is negative, the sigmoid value approaches 0, resulting in a small output.
When ( x ) is positive, the sigmoid value approaches 1, making the output close to ( x ).
This creates a smooth, non-linear curve.

Key Advantages:

Smoothness: Unlike ReLU, which has a sharp corner at 0, SiLU is smooth everywhere.
Self-gating: The sigmoid component acts as a "gate," determining which activations can pass through.
Better Gradient Flow: It helps prevent issues related to vanishing or exploding gradients in deep networks.
Compatibility with He Initialization: SiLU is designed to work effectively with He weight initialization.

Comparison to ReLU:

ReLU: Returns $( \max(0, x) ) $ – it is fast but has a sharp transition.
SiLU: Returns $( x \cdot \sigma(x) ) $ – it is smoother and more expressive.

Below shows three functions Sigmoid , SiLU(Swish) and ReLU activation functions

Now we build the model calling our function build_deep_model

torch.manual_seed(42)
# build the model   
model = build_deep_model(n_hidden=20, n_neurons=100, n_inputs=3 * 32 * 32, n_outputs=10)
model.to(device)

Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=3072, out_features=100, bias=True)
  (2): SiLU()
  (3): Linear(in_features=100, out_features=100, bias=True)
  (4): SiLU()
  (5): Linear(in_features=100, out_features=100, bias=True)
  (6): SiLU()
  (7): Linear(in_features=100, out_features=100, bias=True)
  (8): SiLU()
  (9): Linear(in_features=100, out_features=100, bias=True)
  (10): SiLU()
  (11): Linear(in_features=100, out_features=100, bias=True)
  (12): SiLU()
  (13): Linear(in_features=100, out_features=100, bias=True)
  (14): SiLU()
  (15): Linear(in_features=100, out_features=100, bias=True)
  (16): SiLU()
  (17): Linear(in_features=100, out_features=100, bias=True)
  (18): SiLU()
  (19): Linear(in_features=100, out_features=100, bias=True)
  (20): SiLU()
  (21): Linear(in_features=100, out_features=100, bias=True)
  (22): SiLU()
  (23): Linear(in_features=100, out_features=100, bias=True)
...
  (38): SiLU()
  (39): Linear(in_features=100, out_features=100, bias=True)
  (40): SiLU()
  (41): Linear(in_features=100, out_features=10, bias=True)
)

Train the model

Now we will write a function that trains a neural network with early stopping to prevent overfitting. Here's what it does:

History Tracking: The function sets up a system to track losses and metrics during training.
Metric Calculation: It calculates both training and validation metrics at each epoch.
Validation Improvement:
- If the validation metric improves, the function saves the model weights and resets the patience counter to zero.
No Improvement Case:
- If the validation metric does not improve, the function increments the patience counter.
- After a specified number of epochs without improvement (the patience threshold), training is stopped early.
Restoring the Best Model: The function restores the model to the version with the highest validation metric.

Why This Matters:

Prevents Overfitting: Early stopping helps to stop training before the model memorizes the training data.
Saves the Best Model: It retains the checkpoint that exhibits the best validation performance.

Efficiency: This approach avoids unnecessary training once the model's performance plateaus.

  # This function evaluates a trained model on a dataset and computes a metric 
  #(like accuracy)
  def evaluate_tm(model, data_loader, metric):
      model.eval()
      metric.reset()
      with torch.no_grad():
          for X_batch, y_batch in data_loader:
              X_batch, y_batch = X_batch.to(device), y_batch.to(device)
              y_pred = model(X_batch)
              metric.update(y_pred, y_batch)
      return metric.compute()

  def train_with_early_stopping(model, optimizer, loss_fn, metric, train_loader,
                                valid_loader, n_epochs, patience=10,
                                checkpoint_path=None, scheduler=None):
      checkpoint_path = checkpoint_path or "my_checkpoint.pt"
      history = {"train_losses": [], "train_metrics": [], "valid_metrics": []}
      best_metric = 0.0
      patience_counter = 0
      for epoch in range(n_epochs):
          total_loss = 0.0
          metric.reset()
          model.train()
          t0 = time.time()
          for X_batch, y_batch in train_loader:
              X_batch, y_batch = X_batch.to(device), y_batch.to(device)
              y_pred = model(X_batch)
              loss = loss_fn(y_pred, y_batch)
              total_loss += loss.item()
              loss.backward()
              optimizer.step()
              optimizer.zero_grad()
              metric.update(y_pred, y_batch)

          train_metric = metric.compute().item()
          valid_metric = evaluate_tm(model, valid_loader, metric).item()
          if valid_metric > best_metric:
              torch.save(model.state_dict(), checkpoint_path)
              best_metric = valid_metric
              best = " (best)"
              patience_counter = 0
          else:
              patience_counter += 1
              best = ""

          t1 = time.time()
          history["train_losses"].append(total_loss / len(train_loader))
          history["train_metrics"].append(train_metric)
          history["valid_metrics"].append(valid_metric)
          print(f"Epoch {epoch + 1}/{n_epochs}, "
                f"train loss: {history['train_losses'][-1]:.4f}, "
                f"train metric: {history['train_metrics'][-1]:.4f}, "
                f"valid metric: {history['valid_metrics'][-1]:.4f}{best}"
                f" in {t1 - t0:.1f}s"
          )
          if scheduler is not None:
              scheduler.step()
          if patience_counter >= patience:
              print("Early stopping!")
              break

      model.load_state_dict(torch.load(checkpoint_path))
      return history

Let's use the NAdam optimizer with a learning rate set to 0.002.

NAdam (Nesterov-accelerated Adaptive Moment Estimation) is an advanced optimizer that combines two powerful techniques:
- A variant of the Adam optimizer that adds Nesterov momentum
- Designed to improve convergence speed and training stability
Key Advantages Over Adam:
1. Faster convergence - Nesterov momentum helps avoid oscillations
2. Better final performance - Often achieves lower final loss
3. Adaptive learning rates - Still maintains per-parameter learning rates like Adam
4. Good for deep networks - Especially effective with deep architectures

  optimizer = torch.optim.NAdam(model.parameters(), lr=2e-3)
  criterion = nn.CrossEntropyLoss()
  accuracy = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(device)

Now we will call the training function

  n_epochs = 100
  # now we will call the training function
  history = train_with_early_stopping(
      model, optimizer, criterion, accuracy,
      train_loader, valid_loader,
      n_epochs
  )

Epoch 1/100, train loss: 2.0548, train metric: 0.2134, valid metric: 0.1974 (best) in 4.1s
Epoch 2/100, train loss: 1.9637, train metric: 0.2550, valid metric: 0.2746 (best) in 4.0s
Epoch 3/100, train loss: 1.8881, train metric: 0.2879, valid metric: 0.3152 (best) in 3.9s
Epoch 4/100, train loss: 1.8266, train metric: 0.3196, valid metric: 0.2904 in 3.9s
...
Epoch 29/100, train loss: 1.3953, train metric: 0.5011, valid metric: 0.4284 in 4.0s
Epoch 30/100, train loss: 1.3910, train metric: 0.5036, valid metric: 0.4216 in 4.1s
Epoch 31/100, train loss: 1.3841, train metric: 0.5062, valid metric: 0.4260 in 4.1s
Early stopping!
Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

This output shows the training stopped early due to no improvement. Here's what it means:

Key Observations:

Train > Valid accuracy gap (50.62% vs 42.60%) - Shows some overfitting, but this is expected
Stopped at epoch 31 - Saved time and computational resources by not training all 100 epochs
Model restored - The best model checkpoint (from epoch 21) was automatically loaded

Test the Model

We will test the model on test set


test_accuracy = evaluate_tm(model, test_loader, accuracy).item()
print(f"\nTest Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")


Test Accuracy: 0.4377 (43.77%)

Charts

Lets draw the charts for training loss over epochs and Accuracy over epochs

# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot loss
epochs_range = range(1, len(history['train_losses']) + 1)
axes[0].plot(epochs_range, history['train_losses'], 'b-', linewidth=2, label='Train Loss')
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Training Loss Over Epochs', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].legend()

# Plot accuracy
axes[1].plot(epochs_range, history['train_metrics'], 'g-', linewidth=2, label='Train Accuracy')
axes[1].plot(epochs_range, history['valid_metrics'], 'r-', linewidth=2, label='Valid Accuracy')
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Accuracy Over Epochs', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].legend()

plt.tight_layout()
plt.show()

# Print summary
print("\n" + "="*50)
print("TRAINING SUMMARY")
print("="*50)
print(f"Total Epochs Trained: {len(history['train_losses'])}")
print(f"Final Train Loss: {history['train_losses'][-1]:.4f}")
print(f"Final Train Accuracy: {history['train_metrics'][-1]:.4f} ({history['train_metrics'][-1]*100:.2f}%)")
print(f"Final Valid Accuracy: {history['valid_metrics'][-1]:.4f} ({history['valid_metrics'][-1]*100:.2f}%)")
print(f"Best Valid Accuracy: {max(history['valid_metrics']):.4f} ({max(history['valid_metrics'])*100:.2f}%)")
print("="*50)

Convolutional Neural Network

Nitin Sharma — Tue, 27 Jan 2026 05:00:00 GMT

A convolutional neural network (CNN) is an advanced deep learning architecture designed for the identification and classification of images. In addition to image recognition, CNNs are utilized for object detection within images, audio classification, and the analysis of time-series data. A convolutional layer processes an input volume, transforming it into an output volume that may vary in size.

Convolutional neural networks (commonly known as convnets) are powerful architectures that build upon the foundation of fully connected neural networks. They consist of layers of neurons equipped with learnable weights and biases. Each neuron processes input data by performing a linear transformation followed by a nonlinear activation function, culminating in a unified scoring function that translates raw image pixels at the input layer into definitive class scores at the output layer.

The unique advantage of convnets lies in their deliberate assumptions about the structure of input data, especially images. These assumptions empower the architecture to encode critical properties, leading to remarkable implementation efficiency and a significant reduction in the number of parameters in the network.

Instead of treating input data as simple linear arrays, convnets expertly manage information as three-dimensional volumes defined by width, height, and depth. This allows each layer to accept a 3D volume of numerical data as input and produce another 3D volume as output. By incorporating color depth as the third dimension, a two-dimensional input image is seamlessly transformed into a three-dimensional representation, enhancing the network's ability to interpret and analyze visual information.

A Convolutional Neural Network (CNN) is structured with several layers, primarily categorized into three main types: convolutional layers, pooling layers, and fully connected layers. CNNs are often comprised of many layers, particularly a combination of convolutional and pooling layers, which work together to extract and refine features from input data.

In convolutional layers, specialized nodes, or filters, slide over the input data to detect patterns, such as edges or textures, by performing convolution operations. Pooling layers follow, downsampling the feature maps generated by the convolutional layers to reduce their dimensionality, thereby retaining essential information while minimizing computational load.

Finally, fully connected layers integrate the features learned throughout the network, connecting all nodes to produce the final output. Together, these layers enable the CNN to effectively analyze and interpret complex data, making them a powerful tool in various applications, such as image and video recognition.

Convolution Layer

The convolution process consists of the following steps:

It commences with an input volume.
A filter is applied at every position throughout the input.
The process yields an output volume, which typically differs in size from the input.

To convolve a 3x3 filter with an image, one multiplies the filter's values element-wise with the corresponding values of the original matrix. The resulting products are then summed, and a bias is added to produce the final output. Then we move one step next to the right for horizontal move and apply the filter and repeat the same convolution process. We do the same process while moving one step vertically down. This is also called to be using one stride.

Zero-Padding

Zero-padding adds zeros around the border of an image.

The primary advantages of utilizing padding in convolutional neural networks are as follows:

Padding enables the application of a convolutional layer without necessarily reducing the height and width of the input volumes. This characteristic is crucial for constructing deeper networks, as it prevents a reduction in height and width as one progresses through subsequent layers. Notably, the "same" convolution is a specific instance where the height and width are accurately maintained after processing through one layer.
Additionally, padding contributes to the retention of information at the periphery of an image. In the absence of padding, the influence of border pixels on subsequent layers would be significantly diminished, thereby compromising the utilization of crucial edge data.

Stride

Stride = amount you move the window each time you slide as shown in below in an image by red arrows. In below image Convolution operation is shown with a filter of 3x3 and a stride of 2 to create 3 by 3 output. In convolutional neural networks (CNNs), the main advantage of using stride is its capacity to efficiently downsample input data. This allows the network to concentrate on more significant features while also reducing computational complexity by processing fewer pixels. Increasing the stride decreases the computational load, as the filter moves across more pixels with each step, resulting in fewer operations. This can speed up both the training and inference processes.

If n x n is the size of the input image , f x f is the size of filter P is padding and S is stride then output size will be

$\frac{n+2P-f}{S} +1$ by $\frac{n+2P-f}{S} +1$

Pooling Layer

The pooling (POOL) layer reduces the height and width of the input, which helps decrease computation while also making feature detectors more invariant to their position in the input. There are two main types of pooling layers:

Max-Pooling Layer: This layer stores the maximum value within the specified window in the output.
Average-Pooling Layer: This layer calculates and stores the average value within the specified window in the output.

Looking at CNN in 3D ,in a convolutional layer, each pixel in the feature map corresponds to a single neuron. All neurons within a specific feature map share the same parameters, meaning they use the same kernel and bias term. However, neurons in different feature maps utilize different parameters. Each neuron's receptive field remains consistent, extending across all the feature maps from the previous layer. In summary, a convolutional layer applies multiple trainable filters simultaneously to its inputs, allowing it to detect various features anywhere within the input data.

Having all neurons in a feature map share the same parameters significantly reduces the number of parameters in the model.

Deep Learning Explained

Nitin Sharma — Sat, 17 Jan 2026 21:06:43 GMT

In the realm of popular neural network architectures, a diverse array of layer types is employed, each serving a distinct purpose. In this blog, we will delve into one of the most fundamental components: the linear layer. This layer is characterized by a structure in which each neuron, also known as a perceptron, in the preceding layer is intricately connected to every neuron in the subsequent layer. This design is commonly referred to as a fully connected layer, given that all neurons engage in interactions with one another across layers.

Forward Propagation

To better understand this, consider that if the preceding layer consists of ( m ) neurons and the following layer has ( n ) neurons, the network establishes a total of ( mn ) individual connections. Each of these connections carries its own unique weight, which plays a critical role in determining how information is processed as it flows through the network. This foundational layer effectively facilitates the transmission of signals, allowing for complex computations and ultimately contributing to the learning capabilities of the neural network.

The weight associated with the connection between the kth neuron in the previous layer $\textbf{l-1}$ and the jth neuron in the current layer $\textbf{l}$ is denoted as $\mathbf{w_{jk}^l}$. This weight represents the strength of the influence that the kth neuron has on the activation of the jth neuron, playing a crucial role in the computations performed by the neural network.

Single perceptron at layer l

Neurons are depicted in circles and jth neuron in layer l as red circle

A parameterized function is a mathematical construct that processes an input to yield a specific decision or estimate. One of the most basic forms of this function is the weighted (w) sum of inputs, augmented by a bias term (b). This approach assigns varying levels of importance to each input before they are combined, enabling a more nuanced evaluation of how each input contributes to the overall outcome. Nonlinearity is introduced through functions such as the sigmoid function. The weights w₀, w₁ and the bias b are the parameters of the function

Let $\mathbf{a_0^{l-1}}$, $a_1^{l-1}$ …. , $\mathbf{a_m^{l-1}}$ are the outputs of m neurons in layer l-1. and $\mathbf{a_0^{l}}$, $\mathbf{a_1^{l}}$… $\mathbf{a_n^{l}}$ are the outputs of n neurons at layer l.

Considering the jth neuron in layer l

Parameterized model function in layer l for jth neuron is defined as

$\mathbf{f_{w,b}(a)}=\mathbf{z_j^l= \sum_{k=0}^m w_{jk}^l a_k^{l-1} +b_j^l}$ where the number of neurons in layer l-1 and l are m and n

In dot product this can be represented as

$\mathbf{z_j^l=[ w_{j0}^l\enspace w_{j1}^l \enspace ... w_{jm}^l] \begin{bmatrix} a_0^{l-1} \\\\ a_1^{l-1} \\\\..... \\\\ a_m^{l-1} \end{bmatrix}+b_j^l}$

And the activation vector which is the output at layer l for jth neuron is derived by applying activation function on the model function $z_j^l$

$\mathbf{a_j^l=\sigma(z_j^l)}$

For all $\mathbf{j=0....n}$ , here arrow as superscript is used to depict Vector, and can be written as

$\mathbf{\vec{z}^{\,l}=W^l \enspace \vec{a}^{\,l-1}+\vec{b}^{\,l}}$

$\mathbf{\vec{a}^{\,l}=\sigma(\vec{z}^{\,l})}$ ……………………………………………… $\mathbf{(1)}$

Here $\mathbf{W^l}$ is an n × m matrix representing the weights of all connections from layer l − 1 to layer l

$W^l=\begin{bmatrix} w_{00}^{l} \enspace w_{01}^{l} \enspace .... \enspace w_{0m}^{l} \\\\ w_{10}^{l} \enspace w_{11}^{l} \enspace .... \enspace w_{1m}^{l} \\\\..... \\\\ w_{n0}^{l} \enspace w_{n1}^{l} \enspace .... \enspace w_{nm}^{l}\end{bmatrix}$ and $\vec{a}^{\,l-1}=\begin{bmatrix} a_0^{l-1} \\\\ a_1^{l-1} \\\\..... \\\\ a_m^{l-1} \end{bmatrix}$and $\vec{b}^{\,l}=\begin{bmatrix} b_0^{l} \\\\ b_1^{l} \\\\..... \\\\ b_m^{l} \end{bmatrix}$ and

$\sigma(\vec{z}^{\,l})= \begin{bmatrix} \sigma (z_0^{l}) \\\\ \sigma (z_1^{l}) \\\\..... \\\\ \sigma (z_n^{l}) \end{bmatrix}$

Equation 1 details the process of forward propagation within a single linear layer of a neural network for example layer $\mathbf{l}$ as shown above. In the context of a multilayer perceptron (MLP), which consists of a series of fully connected layers ranging from layer 0 to layer L, the final output can be achieved by continually applying this equation to the input data. Each application of the equation transforms the input, allowing information to flow through the network and ultimately produce the desired output.

This expression is evaluated incrementally through the repeated application of the linear layer.

$\mathbf{\vec{a}^{\,0}=\sigma(W^0 \enspace \vec{x}+\vec{b}^{\,0})}$

$\mathbf{\vec{a}^{\,l}=\sigma(W^1 \enspace \vec{a^0}+\vec{b}^{\,1})}$

……………………

$\mathbf{\vec{a}^{\,L}=\sigma(W^1 \enspace \vec{a^{L-1}}+\vec{b}^{\,L})}$

Backward Propagation

Loss function and training

In the context of neural networks, let ( y ) represent the output generated by the model, while $( \hat{y} ) $ signifies the actual or ground truth value that we aim to predict. To quantify the difference between these two values, we utilize a common metric known as the mean squared error (MSE). This loss function is expressed mathematically as $\mathbf{ (y - \hat{y})^2 }$, which calculates the square of the difference between the predicted output and the true value. For now, for simplicity we have selected the MSE as our primary loss function for training the neural network.

$L= \mathbf{\frac{1}{2} \sum_{k=0}^m (y-\hat{y})^2}$ ………………………………………………………… $\mathbf{(2)}$

We can transform each layer's weight matrix, denoted as $\mathbf{w^l}$, along with its corresponding bias $\mathbf{b^l}$, into individual vectors. After this conversion, we concatenate the vectors from all layers in sequence , resulting in a single, extensive vector that encompasses all the weights and biases throughout the multilayer perceptron (MLP). This concatenated vector one for weights and other for bias serves as a unified representation of the entire model's parameters, facilitating efficient processing and optimization during training.

$\mathbf{\vec{w}=[ w_{00}^0\enspace w_{01}^0 \enspace ..... w_{00}^1 w_{01}^1\enspace ... \enspace w_{00}^L w_{01}^L \enspace ..]}$

$\mathbf{\vec{b}=[ b_{0}^0\enspace b_{1}^0 \enspace ..... b_{0}^1 b_{1}^1\enspace ... \enspace b_{0}^L b_{1}^L \enspace ..]}$

The primary objective of training is to discover the optimal parameters and configurations that will effectively reduce the loss (equation 2) to its lowest possible level.

We begin by calculating the gradients of the loss function in relation to the weights and biases of our model. These gradients indicate the direction and rate at which we should adjust the weights and biases to minimize the loss. To refine our model, we update the weights and biases by a value that is proportional to these computed gradients. By repeatedly performing this update process, we progressively move toward the minimum point of the loss function, ultimately leading to improved model performance.

The equations for updating weights and biases in gradient descent are

$\mathbf{\vec{w}=\vec{w}- \lambda \frac{\partial L}{\partial w_{jk}^l} }$ for all l,j,k

$\mathbf{\vec{b}=\vec{b}- \lambda \frac{\partial L}{\partial b_{j}^l} }$ for all j,l ……..……………………………………… $\mathbf{(3)}$

Equations to updating individual weights and biases using individual partial derivatives are

$\mathbf{w_{jk}^l=w_{jk}^l- \lambda \frac{\partial L}{\partial w_{jk}^l} }$ for all l, j, k

$\mathbf{b_j^l=b_j^l- \lambda \frac{\partial L}{\partial b_{j}^l} }$ …………………………………………………………. $\mathbf{(4)}$

Gradient descent is an iterative optimization algorithm that updates the weights and biases of a model to minimize the loss function. It accomplishes this by applying equation $\mathbf{(3)}$ in each iteration, allowing for systematic adjustment of the model parameters based on the calculated error. This method is essentially equivalent to updating each weight and bias individually by utilizing their specific partial derivatives, By doing so, gradient descent effectively fine-tunes the model, gradually improving its performance with each update.

Back propagation with single neuron par layer

We will assess back propagation on a simple perceptron that consists of only one neuron per layer. This simplification allows us to avoid using subscripts for individual weights and biases, as there is only one weight and one bias between two consecutive layers. We will use superscripts to indicate the layer. We will employ Mean Squared Error (MSE) as our loss function and will focus on a single input-output pair, denoted as $\mathbf{x_i}$ and $\mathbf{y_i}$. The total loss L, which represents the summation across all training data instances, can be easily derived by applying the same steps repeatedly.

Forward propagation for an arbitrary layer l is defined as

$\mathbf{z^l=w^l a^{l-1} + b^l}$ and $\mathbf{a^l=\sigma(z^l)}$ …………………………………………….. $\mathbf{(5)}$

Loss function for a given $\mathbf{(x_i,y_i)}$ is $L= \mathbf{\frac{1}{2} (a^L-\hat{y_i})^2}$ where L is last layer ………………………… $\mathbf{(6)}$

Partial derivative of loss with respect to the weights (w) for the last layer, L

$\mathbf{\frac{\partial L}{\partial w^L}= \frac{\partial L}{\partial z^L}\frac{\partial z}{\partial w^L}= \frac{\partial L}{\partial z^L} a^{L-1}}$ ………………………………………. $\mathbf{(7)}$

$\mathbf{\frac{\partial z}{\partial w^L}= a^{L-1}}$ and ………………………………… $\mathbf{(8)}$

$\mathbf{\frac{\partial L}{\partial z^L}= \frac{\partial L}{\partial a^L} \frac{\partial a^L}{\partial z^L}}$ Using the chain rule for partial derivatives ………………………….. $\mathbf{(9)}$

$\mathbf{\frac{\partial L}{\partial a^L}= (a^L-\hat{y_i}) }$ ………………………………………….. $\mathbf{(10)}$

$\mathbf{\frac{\partial a^L}{\partial z^L}= \frac{\partial \sigma(z^L)}{\partial z^L} }$ ………………………………………………..…. $\mathbf{11)}$

The backpropagation algorithm is a powerful tool used in training artificial neural networks, and its effectiveness is not limited to just the sigmoid $\mathbf{\sigma}$ activation function. In fact, back propagation can work well with a variety of activation functions, including ReLU (Rectified Linear Unit), tanh (hyperbolic tangent), and softmax, among others. Each of these functions has unique properties that can lead to improved performance in different contexts. For instance, while the sigmoid function is helpful for binary classification tasks, the ReLU function is often preferred in deeper networks because it mitigates issues related to vanishing gradients, allowing for faster convergence. The flexibility of back propagation in accommodating multiple activation functions enables it to be applied across a wide range of neural network architectures and applications, enhancing its versatility and effectiveness in machine learning.

These functions possess the characteristic of maintaining a nonzero derivative throughout their entire domain, which enables the gradient descent algorithm to consistently make progress at each iteration. This ensures that the optimization process can effectively navigate the function landscape without encountering flat regions, allowing for a more fluid and efficient convergence toward the optimal solution.

Some popular activation functions and their derivatives are shown below.

For our current use case we have picked sigmoid activation function.

Lets calculate the derivative of sigmoid function for a variable x

$\mathbf{\frac{d \sigma(x)}{ dx}=\frac{d( (1+e^{-x} )^{-1})}{dx} =(\frac{-1}{ (1+e^{-x})^{-2} }) \frac{d}{dx}((1+e^{-x}))=(\frac{-1}{ (1+e^{-x})^{-2} }) ((-e^{-x}))=\sigma(x)(1-\sigma(x))}$now replacing x with $z^L$

$\mathbf{\frac{\partial a^L}{\partial z^L}= \frac{\partial \sigma(z^L)}{\partial z^L} =\sigma(z^L)(1-\sigma(z^L))=a^L(1-a^L)}$ …………………………… $\mathbf{(12)}$

Now substituting $\mathbf{(10)}$ and $\mathbf{(12)}$ in $\mathbf{(9)}$ we get

$\mathbf{\frac{\partial L}{\partial z^L}= (a^L-\hat{y_i})( a^L(1-a^L)) }$ ………………………………………………. $\mathbf{(13)}$

Now substituting $\mathbf{(8)}$ and $\mathbf{(13)}$ in equation $\mathbf{(7)}$ we get

$\mathbf{\frac{\partial L}{\partial w^L}= a^{L-1}(a^L-\hat{y_i})(a^L(1-a^L)) }$ ………………………………………. $\mathbf{(14)}$

Partial derivative of loss with respect to the bias (b) for the last layer, L

$\mathbf{\frac{\partial L}{\partial b^L}= \frac{\partial L}{\partial z^L}\frac{\partial z}{\partial b^L}= \frac{\partial L}{\partial z^L}. 1}$ ………………………………………………………. $\mathbf{(15)}$

so $\mathbf{\frac{\partial L}{\partial b^L}=(a^L-\hat{y_i})(a^L(1-a^L))}$ …………………………………………. $\mathbf{(16)}$

Weights and biases in equation $\mathbf{(4)} $ get adjusted as shown below using equation $\mathbf{(14)}$ and $\mathbf{(16)}$

$\mathbf{w^L=w^L- \lambda \frac{\partial L}{\partial w^L} }$

$\mathbf{b^L=b^L- \lambda \frac{\partial L}{\partial b^L} }$

Where $\mathbf{\lambda}$ is the learning rate parameter which decides how big is step, taken in gradient decent.

After adjusting the weights and biases in the last layer L of the neural network, we begin the process of back propagation. This involves systematically moving backwards through each preceding layer, making necessary adjustments to both the weights and biases. We continue this process layer by layer until we reach the very first layer of the network, ensuring that each component is fine-tuned to improve the overall performance of the model.

Importance of dot product in machine learning

Nitin Sharma — Sun, 11 Jan 2026 22:40:13 GMT

In the most basic type of machine learning model, the output is calculated by taking a weighted sum of the input features. Each input is multiplied by a corresponding weight that represents its importance in the model. Once this weighted sum is obtained, a bias term is added to the result. The bias allows the model to adjust the output independently of the input values, helping to improve accuracy and fit the model better to the data. This fundamental approach serves as the foundation for more complex machine learning algorithms.

For a given one input instance defined as $x=[x_0 x_1]$ where $x_0 $ and $x_1$ are features in the data set output of the model is defined as

$ y=w_0x_0+w_1x_1+b$ where $w_0,w_1$ are weights and $b$ is bias.

In situations where there are multiple features and weights, we utilize dot product notation for representation. The dot product is defined as the element-wise multiplication of two vectors, facilitating the analysis of their relationships.

lets say we have two vectors vector $x= \\\begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix}\\$ and vector $w= \\\begin{bmatrix} w_0 \\ w_1 \\ \vdots \\w_n \end{bmatrix}\\$ then their dot product will simply be the element-wise multiplication of the two vectors.

$w.x=w_0x_0+w_1x_1+.....+w_nx_n$ in other way the dot product of two vectors is the sum of the products of their corresponding elements.

Lets assume a machine learning model that is designed to predict a specific target value, represented as $y$However, instead of achieving this target, the model produces an output, denoted as $\hat{y}$ which may differ from what we expect. To evaluate the model's performance and understand how accurately it is making predictions, we need to calculate the error. This error is defined as the difference between the desired target value y and the actual output generated by the model $\hat{y}$. To effectively measure this discrepancy, we employ a statistical method known as mean squared error, which helps us quantify the average of the squared differences between the predicted and target values. This allows us to gain insights into the model's accuracy and areas for improvement.

squared error $e^2=(y-\hat{y})^2$

The total error across the entire training dataset is determined by calculating the difference between the output vector and the ground truth vector. Each element of this difference is squared, and the resulting squared values are summed to obtain the total error. This procedure is equivalent to computing the dot product of the difference vector with itself. This operation represents the squared magnitude, or length, known as the L2 norm of a vector, which is defined as the dot product of the vector with itself.

$E^2=(Y-\hat{Y}).(Y-\hat{Y})=(Y-\hat{Y})^T(Y-\hat{Y})= \\\begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_n \end{bmatrix}\\.\\\begin{bmatrix} y_0 \\ y_1 \\ \vdots \\ y_n \end{bmatrix}\\$

$=y_0^2+y_1^2.......y_n^2$

The L2 norm of a vector, often referred to as the Euclidean norm, is a mathematical concept that measures the length or magnitude of a vector in a multi-dimensional space. It is calculated as the square root of the sum of the squares of its components.

L2 norm of a vector is denoted as $||x||$ is defined as $||V||=\sqrt{V^TV}=\sqrt{v_0^2+v_1^2+....v_n^2}$

In a machine learning model with an output vector $\hat{Y}$ and a target vector $Y$, the error is defined as the magnitude or L2 norm of the difference between these vectors.

$E=\sqrt{(Y-\hat{Y}).(Y-\hat{Y})}=\sqrt{(Y-\hat{Y})^T(Y-\hat{Y})}$

Feature similarity using dot product

lets take an example as shown below where we have multiple documents as shown as sentences and words eligible for the feature vector are highlighted in bold. The first element of the feature vector indicates the number of occurrences of the word home, and the second indicates office.

id	document	feature vector
$d_0$	I can't wait to go home after a long vacation.	$[1,0]$
$d_1$	"I have the flexibility to work from my home office three days a week, but I still prefer going into the main office for meetings	$[1,2]$
$d_2$	In his new remote setup, his home had to function simultaneously as both a quiet home environment and a fully operational home office, blending the comfort of home with the structure of the office until he couldn't tell where the home ended and the office began	$[5,2]$
$d_3$	I need to stop by the main office to pick up my new employee badge before the meeting starts.	$[0,1]$

We have a collection of documents, each represented by its own feature vector. To evaluate the similarity between any two documents, we need to assess the similarity between their corresponding feature vectors. In this section, we will explore how the dot product of a pair of vectors can serve as a measure of their similarity.

Feature vectors corresponding to $d_0$ and $d_3$ are $\\\begin{bmatrix} 1 \\ 0\end{bmatrix}\\$ and $\\\begin{bmatrix} 0\\ 1\end{bmatrix}\\$ their dot product between them will be $\\\begin{bmatrix} 1 \\ 0\end{bmatrix}\\.\\\begin{bmatrix} 0 \\ 1\end{bmatrix}\\=0 . 1+1 . 0=0$. This low score aligns with our intuition that there is no common word of interest between the documents, indicating they are very dissimilar.

Feature vectors corresponding to $d_1$ and $d_2$ are $\\\begin{bmatrix} 1 \\ 2\end{bmatrix}\\$ and $\\\begin{bmatrix} 5 \\ 2\end{bmatrix}\\$ their dot product will be

$\\\begin{bmatrix} 1 \\ 2\end{bmatrix}\\.\\\begin{bmatrix} 5 \\ 2\end{bmatrix}\\=1 . 5+2 . 2=9$ .

The elevated score reinforces our understanding that the documents possess numerous common words of interest, highlighting their similarities. As a result, we can deduce that vectors representing similar content generate larger dot products, indicating a stronger relationship, while vectors that represent dissimilar content generate dot products that are nearly zero, reflecting a lack of connection between them.This high score aligns with our intuition that the documents share many common words of interest and exhibit similarities. Therefore, we can conclude that similar vectors produce larger dot products, while dissimilar vectors yield dot products that are close to zero.

Algorithms in Amazon SageMaker AI

Nitin Sharma — Fri, 10 Oct 2025 22:48:26 GMT

Amazon SageMaker AI is a fully managed machine learning service provided by AWS that enables developers and data scientists to build, train, and deploy machine learning models at scale.

Amazon SageMaker AI is a cloud-based platform that simplifies the machine learning workflow by providing:

Pre-built algorithms for various ML tasks.
Managed infrastructure for training and deployment.
Integrated tools for data preprocessing, model tuning, and monitoring.

It supports a wide range of machine learning algorithms across different categories:

Types of algorithm supported in SageMaker AI in different Categories are

Time-Series

SageMaker AI provides algorithms that are tailored to the analysis of time-series data for forecasting product demand, server loads, webpage requests, and more.

DeepAR

The Amazon SageMaker AI DeepAR forecasting algorithm is a supervised learning algorithm for forecasting scalar (one-dimensional) time series using recurrent neural networks (RNN). Classical forecasting methods, such as autoregressive integrated moving average (ARIMA) or exponential smoothing (ETS), fit a single model to each individual time series.
- Type: Supervised
- Purpose: Forecast scalar (1D) time-series data using RNNs.
- Use Cases: Demand forecasting, server load prediction, web traffic estimation.
- Key Features:
  - Learns across multiple related time series.
  - Outperforms classical methods like ARIMA and ETS.

Text

SageMaker AI provides algorithms that are tailored to the analysis of textual documents used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.

BlazingText

BlazingText algorithm provides highly optimized implementations of the Word2vec and text classification algorithms. The Word2vec algorithm is useful for many downstream natural language processing (NLP) tasks, such as sentiment analysis, named entity recognition, machine translation, etc. Text classification is an important task for applications that perform web searches, information retrieval, ranking, and document classification.
- Type: Supervised
- Purpose: Word embeddings (Word2Vec) and text classification.
- Use Cases: Sentiment analysis, document classification, search ranking.
- Key Features:
  - Highly optimized for speed and scalability.
  - Supports multi-threading and GPU acceleration

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories. LDA is most commonly used to discover a user-specified number of topics shared by documents within a text corpus. Here each observation is a document, the features are the presence (or occurrence count) of each word, and the categories are the topics. Since the method is unsupervised, the topics are not specified up front, and are not guaranteed to align with how a human may naturally categorize documents. The topics are learned as a probability distribution over the words that occur in each document. Each document, in turn, is described as a mixture of topics.
- Type: Unsupervised
- Purpose: Topic modeling.
- Use Cases: Discovering themes in document corpora.
- Key Features:
  - Learns topics as distributions over words.
  - CPU-only, single-instance training.

NTM

NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution. Documents that contain frequent occurrences of words such as "bike", "car", "train", "mileage", and "speed" are likely to share a topic on "transportation" for example.Topic modeling provides a way to visualize the contents of a large document corpus in terms of the learned topics.

Although you can use both the Amazon SageMaker AI NTM and LDA algorithms for topic modeling, they are distinct algorithms and can be expected to produce different results on the same input data.From a practicality standpoint regarding hardware and compute power, SageMaker NTM hardware is more flexible than LDA and can scale better because NTM can run on CPU and GPU and can be parallelized across multiple GPU instances, whereas LDA only supports single-instance CPU training.
- Type: Unsupervised
- Purpose: Topic modeling using neural networks.
- Use Cases: Visualizing document clusters by topic.
- Key Features:
  - Scales better than LDA.
  - Supports GPU and multi-instance training.

Object2Vec

Object2Vec algorithm is a general-purpose neural embedding algorithm that is highly customizable. It can learn low-dimensional dense embeddings of high-dimensional objects. The embeddings are learned in a way that preserves the semantics of the relationship between pairs of objects in the original space in the embedding space. You can use the learned embeddings to efficiently compute nearest neighbors of objects and to visualize natural clusters of related objects in low-dimensional space, for example. You can also use the embeddings as features of the corresponding objects in downstream supervised tasks, such as classification or regression. Object2Vec generalizes the well-known Word2Vec embedding technique for words that is optimized in the SageMaker AI BlazingText algorithm. For a blog post that discusses how to apply Object2Vec to some practical use cases, see Introduction to Amazon SageMaker AI Object2Vec.
- - Type: Supervised
    - Purpose: Learn embeddings for high-dimensional objects.
    - Use Cases: Similarity search, clustering, feature engineering.
    - Key Features:
      - Generalizes Word2Vec for arbitrary objects.
      - Useful for downstream classification/regression.

Sequence to Sequence

Sequence to Sequence is a supervised learning algorithm where the input is a sequence of tokens (for example, text, audio) and the output generated is another sequence of tokens. Example applications include: machine translation (input a sentence from one language and predict what that sentence would be in another language), text summarization (input a longer string of words and predict a shorter string of words that is a summary), speech-to-text (audio clips converted into output sentences in tokens). Recently, problems in this domain have been successfully modeled with deep neural networks that show a significant performance boost over previous methodologies. Amazon SageMaker AI seq2seq uses Recurrent Neural Networks (RNNs) and Convolutional Neural Network (CNN) models with attention as encoder-decoder architectures.
- Type: Supervised
- Purpose: Map input sequences to output sequences.
- Use Cases: Machine translation, summarization, speech-to-text.
- Key Features:
  - Uses RNNs and CNNs with attention mechanisms.
  - Encoder-decoder architecture.

Text Classification TensorFlow

Text Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Hub. Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of text data is not available. The text classification algorithm takes a text string as input and outputs a probability for each of the class labels. Training datasets must be in CSV format.
- - Type: Supervised
    - Purpose: Classify text using pretrained models.
    - Use Cases: Spam detection, sentiment analysis.
    - Key Features:
      - Transfer learning via TensorFlow Hub.
      - Requires CSV input format.

Tabular

AutoGluon-Tabular

AutoGluon-Tabular is a popular open-source AutoML framework that trains highly accurate machine learning models on an unprocessed tabular dataset. Unlike existing AutoML frameworks that primarily focus on model and hyperparameter selection, AutoGluon-Tabular succeeds by ensembling multiple models and stacking them in multiple layers. This page includes information about Amazon EC2 instance recommendations and sample notebooks for AutoGluon-Tabular.
- Type: AutoML (Supervised)
- Purpose: Automatically train and ensemble models.
- Use Cases: Predictive modeling on structured data.
- Key Features:
  - Stacks multiple models.
  - Minimal tuning required

CatBoost

CatBoost is a popular and high-performance open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models.

CatBoost introduces two critical algorithmic advances to GBDT:
1. The implementation of ordered boosting, a permutation-driven alternative to the classic algorithm
2. An innovative algorithm for processing categorical features

SageMaker AI CatBoost currently only trains using CPUs. CatBoost is a memory-bound (as opposed to compute-bound) algorithm.

- Type: Supervised (GBDT)
  - Purpose: Classification and regression.
  - Use Cases: Credit scoring, churn prediction.
  - Key Features:
    - Handles categorical features natively.
    - CPU-only, memory-bound.

Factorization Machines

The Factorization Machines algorithm is a general-purpose supervised learning algorithm that you can use for both classification and regression tasks. It is an extension of a linear model that is designed to capture interactions between features within high dimensional sparse datasets economically. For example, in a click prediction system, the Factorization Machines model can capture click rate patterns observed when ads from a certain ad-category are placed on pages from a certain page-category. Factorization machines are a good choice for tasks dealing with high dimensional sparse datasets, such as click prediction and item recommendation.
- - Type: Supervised
    - Purpose: Capture feature interactions in sparse data.
    - Use Cases: Click prediction, recommendation systems.
    - Key Features:
      - Efficient for high-dimensional sparse datasets.

k-nearest neighbors (k-NN)

k-nearest neighbors (k-NN) algorithm is an index-based algorithm. It uses a non-parametric method for classification or regression. For classification problems, the algorithm queries the k points that are closest to the sample point and returns the most frequently used label of their class as the predicted label. For regression problems, the algorithm queries the k closest points to the sample point and returns the average of their feature values as the predicted value.

Training with the k-NN algorithm has three steps: sampling, dimension reduction, and index building. Sampling reduces the size of the initial dataset so that it fits into memory. For dimension reduction, the algorithm decreases the feature dimension of the data to reduce the footprint of the k-NN model in memory and inference latency. We provide two methods of dimension reduction methods: random projection and the fast Johnson-Lindenstrauss transform. Typically, you use dimension reduction for high-dimensional (d >1000) datasets to avoid the “curse of dimensionality” that troubles the statistical analysis of data that becomes sparse as dimensionality increases. The main objective of k-NN's training is to construct the index. The index enables efficient lookups of distances between points whose values or class labels have not yet been determined and the k nearest points to use for inference.
- - Type: Supervised
    - Purpose: Classification and regression via similarity.
    - Use Cases: Recommendation systems, anomaly detection.
    - Key Features:
      - Index-based lookup.
      - Includes sampling and dimensionality reduction.

LightGBM

LightGBM is a popular and efficient open-source implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. GBDT is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler and weaker models. LightGBM uses additional techniques to significantly improve the efficiency and scalability of conventional GBDT. This page includes information about Amazon EC2 instance recommendations and sample notebooks for LightGBM.
- LightGBM
  - Type: Supervised (GBDT)
  - Purpose: Classification and regression.
  - Use Cases: Tabular modeling, ranking.
  - Key Features:
    - Efficient and scalable.
    - Supports large datasets.

Linear learner algorithm

The Amazon SageMaker AI linear learner algorithm provides a solution for both classification and regression problems.The linear learner algorithm supports both recordIO-wrapped protobuf and CSV formats.
- - Type: Supervised
    - Purpose: Linear models for classification/regression.
    - Use Cases: Binary classification, regression tasks.
    - Key Features:
      - Fast training.
      - Supports CSV and RecordIO formats.

TabTransformer

TabTransformer is a novel deep tabular data modeling architecture for supervised learning. The TabTransformer architecture is built on self-attention-based Transformers. The Transformer layers transform the embeddings of categorical features into robust contextual embeddings to achieve higher prediction accuracy. Furthermore, the contextual embeddings learned from TabTransformer are highly robust against both missing and noisy data features, and provide better interpretability. This page includes information about Amazon EC2 instance recommendations and sample notebooks for TabTransformer.
- TabTransformer
  - Type: Supervised
  - Purpose: Deep learning for tabular data.
  - Use Cases: Predictive modeling with categorical features.
  - Key Features:
    - Uses Transformer architecture.
    - Robust to missing/noisy data.

XGBoost

The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
- Its robust handling of a variety of data types, relationships, distributions.
- The variety of hyperparameters that you can fine-tune.

You can use XGBoost for regression, classification (binary and multiclass), and ranking problems.

- Type: Supervised (GBDT)
  - Purpose: Classification, regression, ranking.
  - Use Cases: ML competitions, structured data modeling.
  - Key Features:
    - Highly tunable.
    - Handles various data types and distributions.

Unsupervised

IP insights

Amazon SageMaker AI IP Insights is an unsupervised learning algorithm that learns the usage patterns for IPv4 addresses. It is designed to capture associations between IPv4 addresses and various entities, such as user IDs or account numbers. You can use it to identify a user attempting to log into a web service from an anomalous IP address, for example. Or you can use it to identify an account that is attempting to create computing resources from an unusual IP address. Trained IP Insight models can be hosted at an endpoint for making real-time predictions or used for processing batch transforms.

SageMaker AI IP insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage patterns of each entity. When queried with an (entity, IPv4 Address) event, a SageMaker AI IP Insights model returns a score that infers how anomalous the pattern of the event is. For example, when a user attempts to log in from an IP address, if the IP Insights score is high enough, a web login server might decide to trigger a multi-factor authentication system. In more advanced solutions, you can feed the IP Insights score into another machine learning model. For example, you can combine the IP Insight score with other features to rank the findings of another security system, such as those from Amazon GuardDuty.

The SageMaker AI IP Insights algorithm can also learn vector representations of IP addresses, known as embeddings. You can use vector-encoded embeddings as features in downstream machine learning tasks that use the information observed in the IP addresses. For example, you can use them in tasks such as measuring similarities between IP addresses in clustering and visualization tasks.
- Type: Unsupervised
- Purpose: Detect anomalous IP usage patterns.
- Use Cases: Fraud detection, security monitoring.
- Key Features:
  - Learns entity-IP associations.
  - Outputs anomaly scores.

K-means

K-means is an unsupervised learning algorithm. It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups. You define the attributes that you want the algorithm to use to determine similarity.The k-means algorithm expects tabular data, where rows represent the observations that you want to cluster, and the columns represent attributes of the observations. The n attributes in each row represent a point in n-dimensional space. The Euclidean distance between these points represents the similarity of the corresponding observations.
- - Type: Unsupervised
    - Purpose: Clustering.
    - Use Cases: Customer segmentation, pattern discovery.
    - Key Features:
      - Uses Euclidean distance.
      - Requires tabular data.

PCA

PCA is an unsupervised machine learning algorithm that attempts to reduce the dimensionality (number of features) within a dataset while still retaining as much information as possible. This is done by finding a new set of features called components, which are composites of the original features that are uncorrelated with one another. They are also constrained so that the first component accounts for the largest possible variability in the data, the second component the second most variability, and so on.In Amazon SageMaker AI, PCA operates in two modes, depending on the scenario:
- regular: For datasets with sparse data and a moderate number of observations and features.
- randomized: For datasets with both a large number of observations and features. This mode uses an approximation algorithm.

PCA uses tabular data.

Type: Unsupervised
Purpose: Dimensionality reduction.
Use Cases: Visualization, preprocessing.
Key Features:
- - Regular and randomized modes.
    - Works on tabular data.

Random Cut Forest (RCF)

Amazon SageMaker AI Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data.Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points.With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered "normal." High values indicate the presence of an anomaly in the data. The definitions of "low" and "high" depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.
- Type: Unsupervised
- Purpose: Anomaly detection.
- Use Cases: Detecting outliers in time-series or structured data.
- Key Features:
  - Assigns anomaly scores.
  - Suitable for streaming data.

Vision

Image classification-MXNet

The Amazon SageMaker image classification algorithm is a supervised learning algorithm that supports multi-label classification. It takes an image as input and outputs one or more labels assigned to that image. It uses a convolutional neural network that can be trained from scratch or trained using transfer learning when a large number of training images are not available.Image classification in Amazon SageMaker AI can be run in two modes: full training and transfer learning. In full training mode, the network is initialized with random weights and trained on user data from scratch. In transfer learning mode, the network is initialized with pre-trained weights and just the top fully connected layer is initialized with random weights. Then, the whole network is fine-tuned with new data. In this mode, training can be achieved even with a smaller dataset. This is because the network is already trained and therefore can be used in cases without sufficient training data.
- Type: Supervised
- Purpose: Multi-label image classification.
- Use Cases: Object recognition, medical imaging.
- Key Features:
  - Supports full training and transfer learning.
  - Uses CNNs.

Image Classification - TensorFlow

The Amazon SageMaker Image Classification - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Hub . Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The image classification algorithm takes an image as input and outputs a probability for each provided class label.
- Type: Supervised
- Purpose: Image classification using pretrained models.
- Use Cases: Visual recognition with limited data.
- Key Features:
  - Transfer learning via TensorFlow Hub.
  - Outputs class probabilities.

Object Detection - MXNet

The Amazon SageMaker AI Object Detection - MXNet algorithm detects and classifies objects in images using a single deep neural network. It is a supervised learning algorithm that takes images as input and identifies all instances of objects within the image scene. The object is categorized into one of the classes in a specified collection with a confidence score that it belongs to the class. Its location and scale in the image are indicated by a rectangular bounding box. It uses the Single Shot multibox Detector (SSD) framework and supports two base networks: VGG and ResNet . The network can be trained from scratch, or trained with models that have been pre-trained on the ImageNet dataset.
- Type: Supervised
- Purpose: Detect and classify objects in images.
- Use Cases: Surveillance, autonomous vehicles.
- Key Features:
  - SSD framework with VGG/ResNet.
  - Outputs bounding boxes and class scores.

Object Detection - TensorFlow

The Amazon SageMaker AI Object Detection - TensorFlow algorithm is a supervised learning algorithm that supports transfer learning with many pretrained models from the TensorFlow Model Garden . Use transfer learning to fine-tune one of the available pretrained models on your own dataset, even if a large amount of image data is not available. The object detection algorithm takes an image as input and outputs a list of bounding boxes.
- Type: Supervised
- Purpose: Object detection using pretrained models.
- Use Cases: Retail analytics, robotics.
- Key Features:
  - Transfer learning via TensorFlow Model Garden.
  - Outputs bounding boxes.

Semantic segmentation

The SageMaker AI semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications. It tags every pixel in an image with a class label from a predefined set of classes. Tagging is fundamental for understanding scenes, which is critical to an increasing number of computer vision applications, such as self-driving vehicles, medical imaging diagnostics, and robot sensing.
- Type: Supervised
- Purpose: Pixel-level image classification.
- Use Cases: Medical diagnostics, autonomous driving.
- Key Features:
  - Tags each pixel with a class label.
  - Enables fine-grained scene understanding.

Scenario-to-Algorithm Mapping Table

Scenario	Algorithm	Type	Example Use Case
Forecasting product demand	DeepAR	Supervised (Time-Series)	Predicting weekly sales
Sentiment analysis	BlazingText	Supervised (NLP)	Classifying tweets as positive/negative
Topic modeling in documents	LDA / NTM	Unsupervised (NLP)	Discovering themes in news articles
Object similarity search	Object2Vec	Supervised (Embedding)	Recommending similar products
Machine translation	Sequence-to-Sequence	Supervised (NLP)	Translating English to French
Text classification with limited data	Text Classification (TensorFlow)	Supervised (Transfer Learning)	Spam detection in emails
Predicting customer churn	AutoGluon-Tabular	AutoML (Tabular)	Churn prediction from customer data
Click prediction in sparse data	Factorization Machines	Supervised (Tabular)	Ad click-through rate prediction
Fraud detection via IP patterns	IP Insights	Unsupervised	Detecting login anomalies
Customer segmentation	K-Means	Unsupervised	Grouping users by behavior
Dimensionality reduction	PCA	Unsupervised	Visualizing high-dimensional data
Anomaly detection in logs	Random Cut Forest	Unsupervised	Detecting unusual spikes in server logs
Image classification	Image Classification (MXNet / TensorFlow)	Supervised (Vision)	Identifying dog breeds
Object detection in images	Object Detection (MXNet / TensorFlow)	Supervised (Vision)	Detecting cars in traffic footage
Scene understanding	Semantic Segmentation	Supervised (Vision)	Medical image diagnostics
Tabular classification with categorical features	TabTransformer	Supervised (Tabular)	Predicting loan defaults
Binary classification	Linear Learner	Supervised	Predicting if a transaction is fraudulent
High-performance tabular modeling	CatBoost / LightGBM / XGBoost	Supervised	Credit scoring, sales prediction
Nearest neighbor search	k-NN	Supervised	Recommending similar users

Above content is derived from https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html

Seq2Seq Encoder-Decoder Model

Nitin Sharma — Tue, 05 Aug 2025 22:29:44 GMT

The Seq2Seq (Sequence to Sequence) architecture is a highly advanced design in neural networks that underpins numerous complex tasks across various fields, particularly in natural language processing. Its significance is especially evident in applications such as language translation, text summarization, and conversational AI. The architecture is structured around two fundamental components: the encoder and the decoder. These components are frequently constructed using Long Short-Term Memory (LSTM) networks, although alternative structures like Gated Recurrent Units (GRUs) may also be employed for specific scenarios.

Encoder: The encoder acts as the foundational stage of the Seq2Seq model, dedicated to the meticulous processing of the input sequence. This input can consist of anything from a simple sentence to expansive blocks of text, and its complexity requires careful handling. The primary role of the encoder is to convert this input into a fixed-size context vector—a compact but rich abstract representation that captures the most imperative information from the entire input sequence. This context vector is meticulously crafted to encapsulate the essential features and semantic nuances of the original input, thus empowering the decoder to interpret the information effectively. Typically, the encoder is organized in layers of LSTM cells, which synergistically collaborate to absorb and retain patterns over time. This design adeptly navigates the intricacies of sequential data, enabling it to learn from prior inputs while accounting for their contextual significance.

Decoder: Following the encoder, the decoder takes on the task of outputting the desired sequence. It begins its operation with the context vector created by the encoder, using it as a springboard to generate outputs one element at a time—this could be a word, token, or any other applicable unit of information, depending on the specific application. At each step of the decoding process, the model integrates not only the context vector but also the elements it has previously produced. This integration is crucial; it enables the decoder to maintain coherence and relevance throughout the generated sequence. Thanks to the sequential nature of this decoding process, the architecture is capable of producing output that feels more natural and contextually appropriate, making it particularly effective for tasks like language translation and chatbot interactions.

The LSTM architecture is particularly well-suited for Seq2Seq tasks due to its exceptional ability to learn and remember long-term dependencies in data. Traditional recurrent neural networks (RNNs) often encounter difficulties with the vanishing gradient problem, which can prevent them from effectively capturing information from earlier parts of the sequence. LSTMs circumvent this challenge through the strategic use of gates—specifically, input, output, and forget gates. These gates meticulously regulate the flow of information, allowing LSTMs to keep hold of crucial details while discarding less relevant information as the sequence unfolds. This capability is vital for tasks requiring nuanced contextual understanding over extended sequences, such as comprehending lengthy sentences or analyzing multi-sentence paragraphs.

Seq2Seq architecture, particularly when integrated with LSTM networks, presents a robust framework for transforming one sequence into another with remarkable efficiency and accuracy. By encoding input information into a detailed and comprehensive context vector and subsequently decoding it into a desired output format, this architecture facilitates applications that necessitate high levels of precision and contextual sensitivity. As a result, it stands as an indispensable tool in the realm of natural language processing and is invaluable across numerous other domains.

Seq2Seq Encoder part

We are developing an Encoder-Decoder model designed to translate the English phrase "I am" into its Spanish equivalent, "soy." The process begins with the creation of an embedding layer that captures the relationships between different words in our input vocabulary. This layer outputs a dense representation of the phrase, which will serve as the input for our Long Short-Term Memory (LSTM) network.

To effectively manage the flow of information, we initialize both the long-term and short-term memory states of the LSTM. As we proceed, we unroll the LSTM, which involves unfolding the network across the sequence of inputs. This unrolling process allows us to maintain consistent weights and biases across each time step.

During the operation of the LSTM, we perform a series of calculations to determine the cell state, which maintains information over long sequences, and the hidden state, which carries short-term information. These states allow the network to retain relevant context and understand the relationships between different parts of the input.

Ultimately, we generate a context vector that encapsulates both the long-term and short-term memories produced by the encoder. This context vector is crucial as it serves as the foundational input for the decoder, guiding it to accurately produce the corresponding translation in Spanish.

Following calculations gets done in encoder and decoder

Forget Gate
Candidate Value
Update gate
Output gate
Cell state
Hidden state

Forget gate

$\mathbf{\Gamma}_f^{\langle t \rangle} = \sigma(\mathbf{W}_f[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_f)\tag{1}$

The previous time step's hidden state $a^{\langle t-1 \rangle}$and current time step's input $x^{\langle t \rangle}$are concatenated together and multiplied by $\mathbf{W_{f}}$

Candidate value

$\mathbf{\tilde{c}}^{\langle t \rangle} = \tanh\left( \mathbf{W}_{c} [\mathbf{a}^{\langle t - 1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{c} \right) \tag{3}$

The candidate value is a tensor containing information from the current time step that may be stored in the current cell state $\mathbf{c}^{\langle t \rangle}.$

The parts of the candidate value that get passed on depend on the update gate.

Update gate

$\mathbf{\Gamma}_i^{\langle t \rangle} = \sigma(\mathbf{W}_i[a^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_i)\tag{2}$

Update gate decides what aspects of the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$to add to the cell state $c^{\langle t \rangle}$

The update gate decides what parts of a "candidate" tensor $\tilde{\mathbf{c}}^{\langle t \rangle}$are passed onto the cell state $\mathbf{c}^{\langle t \rangle}$

Output gate

$\mathbf{\Gamma}_o^{\langle t \rangle}= \sigma(\mathbf{W}_o[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{o})\tag{5}$

The output gate decides what gets sent as the prediction (output) of the time step.

Cell state

$\mathbf{c}^{\langle t \rangle} = \mathbf{\Gamma}_f^{\langle t \rangle}* \mathbf{c}^{\langle t-1 \rangle} + \mathbf{\Gamma}_{i}^{\langle t \rangle} *\mathbf{\tilde{c}}^{\langle t \rangle} \tag{4}$

The cell state is the "memory" that gets passed onto future time steps

The previous cell state $\mathbf{c}^{\langle t-1 \rangle}$is adjusted (weighted) by the forget gate $\mathbf{\Gamma}_{f}^{\langle t \rangle}$and the candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$. adjusted (weighted) by the update gate $\mathbf{\Gamma}_{i}^{\langle t \rangle}$

Hidden state

$\mathbf{a}^{\langle t \rangle} = \mathbf{\Gamma}_o^{\langle t \rangle} * \tanh(\mathbf{c}^{\langle t \rangle})\tag{6}$

The hidden state gets passed to the LSTM cell's next time step.

The hidden state $\mathbf{a}^{\langle t \rangle}$is determined by the cell state $\mathbf{c}^{\langle t \rangle}$ in combination with the output gate $\mathbf{\Gamma}_{o}$

Seq2Seq Decoder part

The process begins with the encoder generating a context vector from the initial phrase "I am." This context vector serves as a crucial starting point, as it initializes both the long-term and short-term memory components of the Long Short-Term Memory (LSTM) network in the decoder.

As the decoding process commences, the input to the LSTM is set using the embedding value of the end-of-sequence token, denoted as . This value is sourced from an embedding layer that has been trained to represent the output vocabulary.

Within the LSTM, the short-term memory undergoes processing and is subsequently fed into a fully connected dense layer. This dense layer applies a softmax activation function, which plays a critical role in determining the first word that the decoder will output. In this instance, the generated word is "soy."

However, the decoding journey is not complete with the generation of "soy." The decoder continues to unroll, using the last generated word as the input for the embedding layer for the subsequent LSTM cycle. Once again, the same calculations are performed to update both long-term and short-term memory. This information then flows into the dense layer, followed by the application of the softmax function. This iterative process continues until the decoder finally produces the end-of-sequence , token, signaling that the generation is complete.

Vanishing/Exploding gradients in RNN

Nitin Sharma — Wed, 30 Jul 2025 02:34:07 GMT

A basic Recurrent Neural Network (RNN) is a specialized type of artificial neural network designed to effectively process sequences of data, which is common in various fields such as natural language processing, time series analysis, and speech recognition. Unlike traditional feedforward neural networks, where information moves in one direction from input to output, RNNs incorporate connections that allow certain neurons to loop back onto themselves. This unique architecture enables RNNs to maintain a form of memory, which is crucial for understanding the context and dependencies in sequential data.

In an RNN, the input at each time step is not only processed independently but is also combined with the hidden state derived from the previous time step. The hidden state serves as a form of memory that captures relevant information from prior inputs in the sequence. This dynamic updating of the hidden state allows the network to incorporate both the current input and the context of previous inputs, effectively enabling the learning of temporal dependencies. Consequently, RNNs can adaptively handle sequences of varying lengths, making them particularly advantageous for tasks where the input size is not fixed, such as in natural language sentences or time-varying signals.

Basic Recurrent Neural Networks (RNNs) often encounter significant challenges related to the phenomena of vanishing and exploding gradients. The vanishing gradient problem arises when gradients become progressively smaller as they are propagated backward through the network during training, leading to inadequate updates to the weights of earlier layers. This makes it difficult for the network to learn long-term dependencies from the input sequences. Conversely, the exploding gradient problem occurs when gradients grow excessively large, causing sudden and erratic changes in the weights, which can destabilize the learning process. Both of these issues can severely hinder the performance of RNNs and limit their ability to effectively model sequential data. In this article, we will explore how backpropagation can lead to vanishing and exploding gradients. We will begin by examining a simple RNN architecture, which includes a feedback loop along with its associated weights and biases. Very simple RNN is shown below.

$Image-1$

To illustrate this, we will start with a basic design of an RNN, as shown below, to demonstrate the calculation of back propagation.

For this scenario we will use SSR ( Sum of Squared Residuals) as cost function. The sum of squared residuals serves as a cost function in various statistical models. It measures the discrepancy between observed values and the values predicted by the model. This cost function is calculated by taking the difference between each observed value and its corresponding predicted value (the residual), squaring each of those differences to eliminate negative values, and then summing all the squared differences together. The goal is to minimize this sum, which indicates that the model's predictions are closely aligned with the actual data. It can be defined as

$$SSR= \sum_i^m (Observed_i-Predicted_i)^2$$

Lets ignore feedback loop for now and just calculate the derivative of SSR with respect to W1 considering only Input3.

Applying chain rule we can say that

$\begin{flalign*} & \frac{dSSR}{dW1}= \frac{dSSR}{dPredicted}.* \frac{dPredicted}{dW1}\space\space\space\space\cdots\cdots\cdots\cdots\cdots1 &\\ \end{flalign*}$

First we calculate the derivative of SSR with respect to predicted value(output)

$\begin{flalign*} & \frac{dSSR}{dpredicted}= \frac{d \sum_i^m (Observed_i-Predicted_i)^2}{dPredicted} &\\ \end{flalign*}$

Applying The chain rule

$\begin{flalign*} & \frac{dSSR}{dpredicted}= {\sum_i^m 2*(Observed_i-Predicted_i)} * -1 &\\ \end{flalign*}$

$\begin{flalign*} & \frac{dSSR}{dpredicted}= {\sum_i^m -2*(Observed_i-Predicted_i)} &\\ \end{flalign*}$

Now lets calculate the derivative of Predicted value with respect to W1

$\begin{flalign*} & \frac{dPredicted}{dW1}= \frac{d(W1*Input Value)}{dW1}=Input3 &\\ \end{flalign*}$

So now the Equation 1 can be summarized as

$\begin{flalign*} & \frac{dSSR}{dW1}= {\sum_i^m -2*(Observed_i-Predicted_i)} * Input3 &\\ \end{flalign*}$

Now lets calculate derivative, when we unroll RNN to include previous Input as feedback as is shown below

When we unroll the RNN, the predicted value is the sum of previous value(input 2) W1 multiplied by W2 plus Input3 * W1

$Predicted=(Input2 * W_1 * W_2)+(W_1*Input3)$

$\begin{flalign*} & \frac{dPredicted}{dW1}= \frac{d(Input2 * W_1*W_2)+(Input3*W_1)}{dW1}=(Input2 *W_2)+Input3 &\\ \end{flalign*}$

If we consider one more previous input (Input1) like in Image-1 , then predicted will change to

$Predicted=[(Input1 * W_1 * W_2)+(W_1*Input2)]*W_2+(Input3*W_1)$

solving it

$Predicted=(Input1 * W_1 * W_2^2)+(Input2*W_1*W_2)+(Input3*W_1)$

$\begin{flalign*} & \frac{dPredicted}{dW1}= \frac{d[(Input1 * W_1 * W_2)+(W_1*Input2)]*W_2+(Input3*W_1)}{dW1} &\\ \end{flalign*}$

$=(Input1*W_2^2)+(Input2*W_2)+Input3$

Now lets replace the whole derivative of SSR with respect to W1

$\begin{flalign*} & \frac{dSSR}{dW1}= {\sum_i^m -2*(Observed_i-Predicted_i)} * ((Input1*W_2^2)+(Input2*W_2)+Input3) &\\ \end{flalign*}$

We see that there is a pattern of raising the power of $W_2$by the number of times we unroll the RNN to include previous input $((Input1*W_2^2)$

Vanishing and Exploding Gradients

Lets say we unroll RNN multiple times to include many previous values , way more than shown on Image-1

If the weights $W_2$is between -1 and 1 , then the derivative part $((Input1*W_2^2)$ of the equation from $\frac{dSSR}{dW_1}, $ will become very small, essentially referring to vanishing gradients; in other words, we can say that the contribution weight of previous values will disappear.

If the weights $W_2$is less than -1 and greater than 1 , then the derivative part $((Input1*W_2^2)$ of t he equation from $\frac{dSSR}{dW_1},$ will explode , it will be described as an exploding gradient, meaning the weights attributed to contributions from previous values will be exceedingly high.

The fundamental concept revolves around the inherent limitations of a basic Recurrent Neural Network (RNN) concerning the temporal dependencies it can effectively manage. Specifically, an RNN can only unroll for a limited number of time steps before the influence of older data points on the training process becomes problematic. When the sequence length exceeds this optimal range, the older inputs may either lose their significance—resulting in diminishing returns on their contribution to learning—or exert an overwhelming influence, thereby skewing the model's predictions and learning dynamics. This imbalance can hinder the model’s ability to retain relevant information over long sequences, ultimately affecting its performance on tasks that involve longer temporal dependencies.

Long Short-Term Memory networks(LSTM)

Long Short-Term Memory networks, commonly known as LSTMs, are a specialized type of recurrent neural network (RNN) designed to overcome the significant challenges of vanishing and exploding gradients that often occur in traditional RNNs during training. Vanishing gradients can make it difficult for the network to learn long-range dependencies in sequences, as the gradients used to update the model's weights become excessively small, effectively freezing the learning of earlier layers. Conversely, exploding gradients can lead to numerical instability and erratic updates, causing the model to diverge.

LSTMs address these issues through a unique architectural design featuring memory cells and three distinct gates: the input gate, the forget gate, and the output gate. The input gate regulates the flow of new information into the memory cell, the forget gate decides what information to discard from the memory cell, and the output gate controls the information that is sent out of the cell. This gating mechanism enables LSTMs to maintain information over extended sequences, allowing them to learn complex patterns and relationships in data, making them particularly effective for tasks such as language modeling, speech recognition, and time series forecasting.

We will cover LSTM in another article.

Implementing Convolutional Neural Network using PyTorch

Nitin Sharma — Sun, 27 Jul 2025 15:51:06 GMT

My Previous article “ convolutional neural network **“**explained about the architecture of CNN and how it works. Another article “Implementing CNN using TensorFlow” showed how to implement CNN using TensorFlow.

In this article, we will explore the process of creating and optimizing a simple Convolutional Neural Network (CNN) using PyTorch and Lightning. A CNN is a specialized type of neural network that excels in processing and classifying images.

We will begin by outlining the fundamental concepts of Convolutional Neural Networks, including their architecture and the role of convolutional layers, pooling layers, and activation functions. Following that, we will walk through the implementation steps in PyTorch, detailing how to set up the environment, load the data, and construct the network.

We will focus on building a CNN that can distinguish between images of Xs and Os. We will also cover the optimization techniques used to improve the model's accuracy and efficiency, making use of Lightning to streamline our training process.

an example of CNN is shown below comprising of Conv2D and MaxPool layers.

We will start with importing needed Libraries. First we install Lightening framework

%%capture

!pip install lightning

# torch will allow us to create tensors.
import torch 
# torch.nn allows us to create a neural network.
import torch.nn as nn 
# nn.functional give us access to the activation and loss functions.
import torch.nn.functional as F 
# optim contains many optimizers. This time we're using Adam

from torch.optim import Adam 
# lightning has tons of cool tools that make neural networks easier
import lightning as L 
# these are needed for the training data
from torch.utils.data import TensorDataset, DataLoader
## matplotlib allows us to draw the images used for input.
import matplotlib.pyplot as plt

Once we import the necessary Python modules, our next step will be to create images of the letters O and X. These images are essential for training and testing our neural network's performance. We need to design the images to closely resemble the examples provided below, ensuring they are clear and correctly formatted for optimal neural network processing. This will involve defining the size, resolution, and any specific features that make the letters recognizable. By preparing these images carefully, we can improve the accuracy and reliability of our model in recognizing and interpreting these characters.

We will begin the process by generating a visual representation of the letter "O." To do this, we will construct a 6x6 matrix of numbers. In this matrix, the number 0 will represent the color white, while the number 1 will represent the color black. Each element of the matrix will correspond to a pixel in the image, allowing us to form the distinctive shape of the letter "O" through the arrangement of these values.

## Create a 6x6 matrix of numbers where 0 represents white
## and 1 represents black.
o_image = [[0, 0, 1, 1, 0, 0],
           [0, 1, 0, 0, 1, 0],
           [1, 0, 0, 0, 0, 1],
           [1, 0, 0, 0, 0, 1],
           [0, 1, 0, 0, 1, 0],
           [0, 0, 1, 1, 0, 0]]
o_image # print out the matrix to verify that it is what we expect

We will create an image of the letter X by creating a similar 6x6 matrix, where the 1s are now in an X pattern.

x_image = [[1, 0, 0, 0, 0, 1],
           [0, 1, 0, 0, 1, 0],
           [0, 0, 1, 1, 0, 0],
           [0, 0, 1, 1, 0, 0],
           [0, 1, 0, 0, 1, 0],
           [1, 0, 0, 0, 0, 1]]
x_image

To visualize the o_image and x_image with matplotlib, we begin by using the subplots() function. This function generates a grid of subplots, returning an array named axarr[]. Each element in this array corresponds to a subplot defined by the parameters nrows (number of rows) and ncols (number of columns) we specify. By organizing the images into this grid, we can easily position and display each image within its respective subplot for clear and effective comparison.

## To draw the o_image and x_image, we first call subplots(), which creates 
## an array, called axarr[], with an entry for each element in a grid
## specified by nrows and ncols.
fig, axarr = plt.subplots(nrows=1, ncols=2, figsize=(5, 5))

## Now we pass o_image and x_image to .imshow() for each element
## in the grid created by plt.subplots()
axarr[0].imshow(o_image, cmap='gray_r') ## Setting cmap='gray_r' gives us reverse grayscale.
axarr[1].imshow(x_image, cmap='gray_r')

We will begin by loading the training data into a DataLoader, a powerful tool in PyTorch that streamlines the process of feeding data into our neural network for training. DataLoaders are particularly advantageous when working with large datasets for several reasons. First, they enable us to access our data in manageable batches, which helps reduce memory consumption and speeds up the training process. Second, DataLoaders provide an easy way to shuffle our dataset at the beginning of each epoch, ensuring that the model does not learn any unintended patterns from the order of the data. Finally, if we want to quickly test our code or validate our model's functionality without using the entire dataset, DataLoaders allow us to work with a smaller subset of the data.

In order to prepare our training data for the DataLoader, we will convert the images into tensors using the torch.tensor() function. This step is crucial because PyTorch requires inputs to be in tensor format for processing. Once converted, we will save these tensors as input_images, which will then be passed to the DataLoader for efficient batch processing during training. This systematic approach will facilitate a smoother training experience and help us achieve better results with our neural network.

## Convert the images into tensors...
input_images = torch.tensor([o_image, x_image]).type(torch.float32)

In this step, we will create tensors that represent the labels, which are the ideal output values corresponding to each input image in our dataset. Specifically, our convolutional neural network is designed to recognize two distinct letters: O and X.

To achieve this, we will define our output for the letter O as the tensor [1.0, 0.0], indicating that the first output neuron is activated for the letter O while the second one is not. Conversely, the tensor [0.0, 1.0] will be used to represent the ideal output for the letter X, where the second output neuron is activated.

These tensors will be crucial for training the neural network, as they will guide the model in learning to differentiate between the two letters based on the input images. All the generated labels for our training dataset will be saved in a variable named input_labels, which will facilitate easy access and manipulation during the training process.

## Create the labels for the input images
input_labels = torch.tensor([[1.0, 0.0], [0.0, 1.0]]).type(torch.float32)

We will combine the input images with the input labels to create a TensorDataset, which we will then use to create a DataLoader.

## Now combine input_images and input_labels into a TensorDataset...
dataset = TensorDataset(input_images, input_labels) 
## ...and use the TensorDataset to create a DataLoader.
dataloader = DataLoader(dataset)

Build a convolutional neural network with PyTorch and Lightning

To build a convolutional neural network (CNN) using PyTorch, we will need to define a new class that extends the capabilities of LightningModule. This approach simplifies the training process and enhances model organization. The new class will encompass several key methods, each serving a specific purpose in the model's functionality:

init(): This method is crucial for initializing the CNN's parameters. Inside this method, you will set up the weights and biases for the network layers. Additionally, you’ll maintain any necessary bookkeeping information, such as the architecture details of the network and configurations for training.
forward(): In this method, you will define how data flows through the network during a forward pass. This includes the series of operations performed on the input data as it travels through each layer of the CNN, such as convolutional layers, activation functions, and pooling layers.
configure_optimizers(): This method is used to set up the optimization algorithm that will update the model's weights during training. In this tutorial, we will be using the Adam optimizer, which is well-regarded for its efficiency and effectiveness in optimizing deep learning models.
training_step(): This method handles the training process for each batch of data. It takes the training data as input and feeds it into the forward() method to obtain predictions. Afterward, it calculates the loss by comparing the predicted values with the actual target values. Additionally, it keeps track of the loss values, allowing for logging and monitoring during training, which is essential for assessing model performance.

By implementing these methods, we will have a well-structured and functional convolutional neural network ready for training with PyTorch.

Steps to build CNN using PyTorch

Let's build a simple Convolutional Neural Network (CNN) using the LightningModule. This network will help us extract features from images and classify them accordingly.

Step 1: Initializing Weights and Biases

We begin by initializing the weights and biases for our CNN. This step is crucial, as these parameters will be adjusted during training to improve the model's performance.

Step 2: Setting Up the Convolutional Layer

The first layer of our CNN is the convolutional layer, which we set up using nn.Conv2d(). This layer applies a filter to our input data to extract features. The parameters needed to configure this layer include:

in_channels: This parameter specifies the number of input channels. For instance, a grayscale (black and white) image has one channel, while a color image typically has three (for red, green, and blue).
out_channels: This parameter determines how many output channels the convolutional layer will produce. If the model receives multiple input channels, we can combine them into fewer output channels, or we can increase the number of output channels to capture more features.
kernel_size: This refers to the dimensions of the filter (also known as the convolutional kernel). In our implementation, we will use a 3x3 filter, but we have the flexibility to choose other sizes, including rectangular shapes, depending on our specific needs.

Step 3: Implementing Max Pooling

After the convolutional layer, we apply a max pooling operation using nn.MaxPool2d(). This step reduces the dimensionality of the feature maps, helping to extract the most important features and reduce computational load. The parameters for the max pooling layer include:

kernel_size: This defines the size of the pooling filter. In our case, we are using a 2x2 filter, which will help summarize the features in each 2x2 section of the input.
stride: The stride determines how far we move the pooling filter with each operation. In our example, we set the stride to 2, meaning that after applying the filter to one section, it will move 2 units over (or down), ensuring there is no overlap between pooling sections.

Step 4: Constructing the Fully Connected Neural Network

Now, we move on to constructing a fully connected neural network (also known as a dense layer). This network will take in the features extracted from the convolutional and pooling layers. The configuration of this layer includes:

Input features (in_features=4): This specifies the number of features that will be input into the neural network.
Output features (out_features=1): This indicates that we are producing a single output from the neural network from ReLU activation function, which could represent the predicted classification for our input.

Additionally, we will implement a hidden layer that has:

Input features (in_features=1): Here, the output from the previous layer feeds into this hidden layer.
Output features (out_features=2): This layer will produce two outputs, allowing the network to classify the input into two different categories.

Step 5: Calculating Loss with Cross Entropy

To assess how well our neural network is performing, we will use Cross Entropy Loss. This loss function compares the network's predicted classifications against the actual species labels in our dataset. The implementation of this is done using nn.CrossEntropyLoss, which conveniently applies a SoftMax function to the output values. This means we don't need to apply the SoftMax ourselves during training. However, we must remember to apply it during inference after the model has been trained.

Step 6: Applying the Filter and Activation Functions

We start the forward pass of our CNN by applying the filter to the input image. After this, the output from the convolution is passed through a ReLU activation function, which introduces non-linearity into the model:

Next, we take the output from the ReLU layer and feed it into the max pooling layer:

At this stage, we have a reduced matrix of feature values. To prepare this for input into our fully connected neural network, we flatten the matrix into a vector format:

Step 7: Running the Flattened Values Through the Neural Network

Once the values are flattened, we can pass them through our fully connected layer, which includes the hidden layer along with the activation function, to obtain the final output for classification.

Step 8: Configuring the Optimizer

Finally, we need to set up the optimizer that will adjust our model's parameters. We pass the parameters we want to optimize, which can be accessed using self.parameters(), into the optimizer. For this implementation, we’ll use the Adam optimizer, setting a learning rate (lr) of 0.001

We have now established a functioning CNN capable of processing images and making predictions. With proper training and validation, this model will learn to classify images effectively based on the features extracted from the data.

## Now build a simple CNN...
class SimpleCNN(L.LightningModule):

    def __init__(self):

        super().__init__() 
        L.seed_everything(seed=42)

        self.conv = nn.Conv2d(in_channels=1, out_channels=1, kernel_size=3)

        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        self.input_to_hidden = nn.Linear(in_features=4, out_features=1)
        ## ..and the single hidden layer, in_features=1, goes to
        ## two outputs, out_features=2
        self.hidden_to_output = nn.Linear(in_features=1, out_features=2)

        self.loss = nn.CrossEntropyLoss()


    def forward(self, x):

        ## First we apply a filter to the input image
        x = self.conv(x)

        ## Then we run the output from the filter through a ReLU...
        x = F.relu(x)
        ## Then we run the output from the ReLU through a Max Pooling layer...
        x = self.pool(x)
        x = torch.flatten(x, 1) # flatten all dimensions except batch 
        x = self.input_to_hidden(x)
        x = F.relu(x)
        x = self.hidden_to_output(x)

        return x


    def configure_optimizers(self):

        return Adam(self.parameters(), lr=0.001)

    def training_step(self, batch, batch_idx):

        inputs, labels = batch 

        outputs = self.forward(inputs)

        ## Then we calculate the loss.
        loss = self.loss(outputs, labels)


        return loss

Training our Neural Network

To train our new convolutional neural network, we are creating a model based on the new class, SimpleCNN. Create a Lightning Trainer using the function L.Trainer(), and utilize it to optimize the parameters. Please note that we will begin with 100 epochs, which means we will complete 100 full passes through our training data. This may be sufficient to successfully optimize all of the parameters, but there is a possibility it might not be enough.

model = SimpleCNN()
trainer = L.Trainer(max_epochs=700)
trainer.fit(model, train_dataloaders=dataloader)

INFO: 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO:lightning.pytorch.utilities.rank_zero:💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
INFO: GPU available: False, used: False
INFO:lightning.pytorch.utilities.rank_zero:GPU available: False, used: False
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO: 
  | Name             | Type             | Params | Mode 
--------------------------------------------------------------
0 | conv             | Conv2d           | 10     | train
1 | pool             | MaxPool2d        | 0      | train
2 | input_to_hidden  | Linear           | 5      | train
3 | hidden_to_output | Linear           | 4      | train
4 | loss             | CrossEntropyLoss | 0      | train
--------------------------------------------------------------
19        Trainable params
0         Non-trainable params
19        Total params
0.000     Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode
INFO:lightning.pytorch.callbacks.model_summary:
  | Name             | Type             | Params | Mode 
--------------------------------------------------------------
0 | conv             | Conv2d           | 10     | train
1 | pool             | MaxPool2d        | 0      | train
2 | input_to_hidden  | Linear           | 5      | train
3 | hidden_to_output | Linear           | 4      | train
4 | loss             | CrossEntropyLoss | 0      | train
--------------------------------------------------------------
19        Trainable params
0         Non-trainable params
19        Total params
0.000     Total estimated model params size (MB)
5         Modules in train mode
0         Modules in eval mode

Epoch 699: 100%

2/2 [00:00<00:00, 82.10it/s, v_num=2]

INFO: `Trainer.fit` stopped: `max_epochs=700` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=700` reached.

Having completed the training of our model, we are now positioned to utilize it for making predictions using new data. In particular, we will evaluate the efficacy of our model in predicting an image of the letter "X" that has been shifted one pixel to the right. To initiate this process, we will first generate an image of the letter "X" that is displaced by one pixel.

shifted_x_image = [[0, 1, 0, 0, 0, 0],
                   [0, 0, 1, 0, 0, 1],
                   [0, 0, 0, 1, 1, 0],
                   [0, 0, 0, 1, 1, 0],
                   [0, 0, 1, 0, 0, 1],
                   [0, 1, 0, 0, 0, 0]]
shifted_x_image

Lets check the image by drawing with matplotlib

fig, ax = plt.subplots(figsize=(2.5, 2.5))
ax.imshow(shifted_x_image, cmap='gray_r') ## Setting cmap='gray_r' gives us reverse grayscale.

Let's see if our trained convolutional neural network can accurately classify it as an X.

## First, let's make a prediction with the new image...
prediction = model(torch.tensor([shifted_x_image]).type(torch.float32))

## Now make the prediction easy to read and interpret by
## running it through torch.softmax() and torch.round()
predicted_label = torch.round(torch.softmax(prediction, dim=1), decimals=2) ## dim=0 applies argmax to rows, dim=1 applies argmax to colum

predicted_label

tensor([[0.0200, 0.9800]], grad_fn=)

We see that the trained network correctly predicted X, as the second output value, representing X, is larger than the first output value, representing O.

Single Layer Neural Network Using PyTorch

Nitin Sharma — Wed, 23 Jul 2025 01:29:44 GMT

In this article, we will explore the Iris flower dataset, a well-known and historically significant dataset in the field of machine learning. Originally introduced by the statistician Ronald Fisher in 1936, this dataset has been widely used for classification tasks. It consists of 150 samples from three different species of Iris flowers—Setosa, Versicolor, and Virginica—each characterized by four features: sepal length, sepal width, petal length, and petal width.

We will utilize the PyTorch framework to develop a classification model that can accurately identify the species of Iris flowers based on these features. Throughout the article, we will walk through the process step by step, from data loading and preprocessing to building, training, and evaluating our model. By the end, you will have a solid understanding of how to apply machine learning techniques to this dataset using PyTorch.

We will utilize the Lightning framework of PyTorch, which simplifies the process of building and training deep learning models. This framework provides a high-level interface that promotes best practices, enhances code organization, and facilitates efficient model training and testing. By leveraging Lightning, we can focus on developing our model's architecture and experiment with different training strategies, while the framework handles the boilerplate code and optimization tasks for us.

We will construct the neural network architecture as illustrated below. The model will take two features as input and consist of one hidden layer that contains two neurons. Each of these neurons will utilize the ReLU (Rectified Linear Unit) activation function to introduce non-linearity into the model. The final output layer will be designed to classify the input data into one of three distinct categories. This configuration aims to effectively capture the underlying patterns in the data for accurate classification.

Start with installing lightening framework

%%capture
!pip install lightning

Next we import all libraries

import torch # torch will allow us to create tensors.
import torch.nn as nn # torch.nn allows us to create a neural network.
# nn.functional give us access to the activation and loss functions.
import torch.nn.functional as F 
from torch.optim import Adam # optim contains many optimizers. This time we're using Adam

import lightning as L # lightning has tons of cool tools that make neural networks easier
# these are needed for the training data
from torch.utils.data import TensorDataset, DataLoader 

import pandas as pd # We'll use pandas to read in the data and normalize it
# We'll use this to create training and testing datasets
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_iris

We load iris dataset using Scikit

iris= load_iris(as_frame=True)
df=iris.data

The dataset consists of 150 samples total, 50 for each of 3 species of Iris, Setosa, Versicolor, and Virginica.

df.shape
(150, 4)

To start our analysis, we need to divide the dataset into training and testing subsets. The first step in this process is to identify and separate the relevant columns into two distinct DataFrames: one for the input values and another for the labels.

The first DataFrame, which we will name "input_values," will contain the features that we will use to make our predictions. Specifically, this DataFrame will include the measurements of the petal and sepal widths, which are critical for our predictive model.

The second DataFrame, labeled "label_values," will hold the target variable we aim to predict. This DataFrame will consist of the species classifications, which will allow us to assess the accuracy and effectiveness of our predictions once the model is trained.

By clearly defining these two DataFrames, we set the foundation for an organized approach to model training and evaluation.

In this example, we will keep the neural network simple by using only the values for petal width and sepal width as inputs. First, we'll ensure we can correctly isolate the columns we want from those we don't need. To do this, we will pass the DataFrame (df) a list of the column names we want to retrieve values for: ['petal width (cm), 'sepal width (cm)'].

input_values = df[['petal width (cm)', 'sepal width (cm)']]
label_values = iris.target

Using the DataFrame factorize() function, you will get two outputs: a list of numeric codes, shaped the same as your input, and an array of unique values that represent what each number corresponds to.

classes_as_numbers, classes = label_values.factorize()

We will separate the variables, specifically input_values and classes_as_numbers, into distinct training and testing datasets. This process is essential for building a robust machine learning model and helps us evaluate its performance effectively. To accomplish this, we will utilize the train_test_split() function from the sklearn library. This function allows us to randomly partition our data, ensuring that we have a subset for training the model and a separate subset for testing its accuracy and reliability.

input_train, input_test, label_train, label_test = train_test_split(input_values,
                                                                    classes_as_numbers,
                                                                    test_size=0.25,
                                                                    stratify=classes_as_numbers)input_train.shape

input_train.shape
(112, 2)
input_test.shape
(38, 2)

Since our neural network has three outputs, one for each species (as illustrated in the drawing of the neural network above), we need to convert the numbers in label_train into arrays with three elements. Each element in the array corresponds to a specific output of the neural network. We will use the following encoding: [1.0, 0.0, 0.0] for Setosa, [0.0, 1.0, 0.0] for Versicolor, and [0.0, 0.0, 1.0] for Virginica. The good news is that we can easily perform this one-hot encoding. Additionally, we'll use type(torch.float32) to ensure that the numbers are stored in the correct format for efficient processing by the neural network.

one_hot_label_train = F.one_hot(torch.tensor(label_train)).type(torch.float32)

To enhance the effectiveness of our machine learning models, it is important to normalize the input variables so that their values fall within a range of 0 to 1. Normalization standardizes the data, ensuring that all features contribute equally during the training process. This scaling helps to improve the model's convergence and overall performance. To achieve this, we will utilize the MinMaxScaler, a tool provided by the scikit-learn library, which efficiently transforms the data by adjusting the minimum and maximum values accordingly.

# Initialize the scaler
scaler = MinMaxScaler()
input_train_normalized = scaler.fit_transform(input_train)
input_test_normalized = scaler.fit_transform(input_test)

To effectively train our neural network, we need to organize our training data into a DataLoader. DataLoaders are particularly useful for handling large datasets, as they facilitate the processing of data in manageable batches. This approach not only allows us to shuffle the dataset at the beginning of each epoch, enhancing the training process by reducing potential overfitting, but it also lets us work with a smaller subset of the data if we're aiming for a quick, preliminary run—perfect for debugging our code.

To start, we will convert our training inputs, input_train, into PyTorch tensors using the function torch.tensor(). This step is crucial because neural networks in PyTorch operate with tensors.

Once we have our input data in tensor format, we'll combine input_train with our labels, one_hot_label_train, to form a TensorDataset. This dataset acts as a wrapper that pairs our inputs with their corresponding labels, ensuring that during training, the model learns from the correct label for each input.

Finally, we'll use the TensorDataset to create the DataLoader. By doing so, we can specify parameters such as batch size and whether we would like to shuffle the data. With everything set up in this manner, the DataLoader will streamline the process of feeding data to our neural network during training, enhancing both efficiency and ease of use.

## Convert the DataFrame input_train into tensors
input_train_tensors = torch.tensor(input_train.values).type(torch.float32)
# Convert the DataFrame input_test into tensors
input_test_tensors = torch.tensor(input_test.values).type(torch.float32)
train_dataset = TensorDataset(input_train_tensors, one_hot_label_train)
train_dataloader = DataLoader(train_dataset)

To build a neural network using PyTorch, you need to create a new class that inherits from LightningModule. This approach makes it easier to train the neural network.

Our new class will include the following methods:

__init__(): This method initializes the weights and biases, as well as manages other housekeeping tasks.
forward(): This method performs a forward pass through the neural network.
configure_optimizers(): This method sets up the optimizer. Although there are many optimizers available, for this tutorial, we will use the Adam optimizer.
training_step(): This method takes the training data, passes it to the forward() method, calculates the loss, and logs the loss values.

By implementing these methods, we will create a functional and efficient neural network ready for training.

class MultipleInsOuts(L.LightningModule):

  def __init__(self):
    super().__init__()

    L.seed_everything(seed=42)
    self.input_to_hidden=nn.Linear(in_features=2,out_features=2,bias=True)
    self.hidden_to_output = nn.Linear(in_features=2, out_features=3, bias=True)
    self.loss = nn.MSELoss(reduction='sum')

  def forward(self,input):

    ## First, we run the input values to the activation functions
        ## in the hidden layer
        hidden = self.input_to_hidden(input)
        ## Then we run the values through a ReLU activation function
        ## and then run those values to the output.
        output_values = self.hidden_to_output(torch.relu(hidden))
        return(output_values)

  def configure_optimizers(self):
        ## configuring the optimizer
        ## consists of passing it the weights and biases we want
        ## to optimize, which are all in self.parameters(),
        ## and setting the learning rate with lr=0.001.
        return Adam(self.parameters(), lr=0.001)

  def training_step(self, batch, batch_idx):
        ## The first thing we do is split 'batch'
        ## into the input and label values.
        inputs, labels = batch

        ## Then we run the input through the neural network
        outputs = self.forward(inputs)

        ## Then we calculate the loss.
        loss = self.loss(outputs, labels)

        return loss

Training our new neural network involves creating a model from the new class, MultipleInsOuts.

model = MultipleInsOuts()

INFO: Seed set to 42
INFO:lightning.fabric.utilities.seed:Seed set to 42

We will develop a Lightning Trainer, referred to as L.Trainer, aimed at optimizing our model parameters. The training process will commence with an initial setting of 100 epochs. This approach allows us to thoroughly evaluate and adjust the model's performance over multiple iterations, ensuring that we can refine our techniques and achieve better accuracy in our results.

trainer = L.Trainer(max_epochs=100)
trainer.fit(model, train_dataloaders=train_dataloader)

Lets test using test data

# Run the input_test_tensors through the neural network
predictions = model(input_test_tensors)

## Select the output with highest value...
predicted_labels = torch.argmax(predictions, dim=1) ## dim=0 applies softmax to rows, dim=1 applies softmax to columns

torch.sum(torch.eq(torch.tensor(label_test), predicted_labels)) / len(predicted_labels)

tensor(0.8947)

We get 89% Accuracy

With our model now trained, we can use it to make predictions from new data. This is achieved by passing the model a tensor that includes normalized petal and sepal widths.

Jupyter notebook is available at github at iris_PyTorch

Linear Regression

Nitin Sharma — Wed, 25 Jun 2025 14:43:39 GMT

Machine learning is an evolving domain within the field of artificial intelligence (AI) that focuses on the development of algorithms capable of learning from data. These algorithms empower systems to improve their performance on tasks over time through experience, primarily by recognizing patterns and making data-driven predictions. Among the diverse array of techniques employed in machine learning, linear regression emerges as a fundamental statistical approach extensively used for analyzing relationships among variables.

At its core, linear regression examines the connection between a dependent variable—often referred to as the outcome we aim to predict—and one or multiple independent variables, which are the factors believed to influence this outcome. This technique is particularly powerful in situations where the relationship between variables can be approximated by a straight line.

To implement linear regression, a mathematical line is fitted to the data points in such a manner that the distance between the line and the actual data points is minimized. This method, known as the least squares method, optimally determines the line that best represents the data. The equation of this line can be expressed as

$\textbf {Y=mX+c}$

Where (Y) is the dependent variable, (X) is the independent variable, (m) denotes the slope of the line indicating how changes in (X) impact (Y), and (c) represents the y-intercept.

By effectively drawing this line, linear regression not only aids in forecasting and predicting outcomes but also facilitates a deeper understanding of trends in data. It can be utilized to draw insights that inform decision-making processes across various fields, such as economics, where it can elucidate market trends; biology, for exploring relationships between biological factors; engineering, for modeling processes and behaviors; and social sciences, for analyzing societal trends and implications.

A linear model generates predictions by first assessing each input feature, which represents a specific characteristic of the data. It calculates a weighted sum, where each feature is multiplied by a corresponding weight that reflects its importance in the prediction process. To refine this computation, a constant value known as the bias term is added. This bias term allows the model to adjust its predictions to better fit the observed data, ensuring more accurate outcomes.

The model function for linear regression is represented as

$\textbf f_{w,b}(x) = \textbf {wx + b}$

Where $w$ is weights and $b$ is bias

for multivariate problem which has many features this can be defines ad

$\hat{y}= w_0+w_1x_1+w_2x_2........w_nx_n$

$\hat{y}$ is predicted value, $w_0$ is bias term and $n$is the number of features.

Or we can say $\hat{y}= w.x$

$w$ is the model’s parameter vector, containing the bias term $w_0$ and the feature weights $w_1$ to $w_n$ .
x is the instance’s feature vector, containing $x_1$ to $x_n$.
$w.x$ is the dot product of the vectors $w$ and $x$.

To evaluate the effectiveness of various pairs of parameters $\textbf {(w,b)}$ in a linear regression model, we employ a cost function represented as $\textbf {J(w,b)}$. This function plays a crucial role in measuring the performance of the selected parameters by quantifying the discrepancy between the predicted outcomes generated by the linear model and the actual target values observed in the dataset.

In more detail, the cost function typically calculates the sum of the squared differences between the predicted values (obtained from the linear equation defined by (w) and (b)) and the true values from the data. This quantification allows us to assess how accurately the linear model is able to predict outcomes based on the input features.

By systematically varying the parameters (w) (the weights) and (b) (the bias) and computing the corresponding values of the cost function (J(w,b)), we can analyze and compare the performance for different combinations. The ultimate goal of this assessment is to identify the pair of parameters that results in the lowest cost, indicating the best fit for the linear regression model. This process is foundational in optimizing the model to achieve the most accurate predictions possible based on the input data.

The cost function for linear regression $\textbf {J(w, b) }$ is defined as

$ J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 $

To find the optimal values for the parameters $\textbf {(w,b)}$ that minimize the cost function $\textbf {J(w, b) }$, one effective method we can use is gradient descent. This powerful iterative optimization technique systematically refines the parameter values over time.

The process begins by calculating the gradient of the cost function, which indicates the direction of the steepest increase in cost. By following the negative gradient, essentially moving in the opposite direction, gradient descent gradually adjusts $\textbf {(w,b)}$ in small steps. Each update is designed to reduce the cost function $\textbf {(w,b)}$, guiding the parameters toward those values that achieve the lowest cost.

Through repeated iterations, where each step brings us closer to the optimal solution, gradient descent effectively uncovers the best-fitting parameters for your model. This method is not only fundamental in machine learning but also widely applicable in various optimization problems across different fields.

the gradient descent algorithm is:

$\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \phantom {0000} b := b - \alpha \frac{\partial J(w,b)}{\partial b} \newline \; & \phantom {0000} w := w - \alpha \frac{\partial J(w,b)}{\partial w} \; & \newline & \rbrace\end{align*}$
where, parameters

$\textbf w$, b are both updated simultaniously and where

$\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{2}$

$\frac{\partial J(w,b)}{\partial w} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) -y^{(i)})x^{(i)} $

m is the number of training examples in the dataset
$f_{w,b}(x^{(i)})$is the model's prediction, while $y^{(i)},$ is the target value

Deriving Partial derivative of cost function for gradient descent

Lets find the above gradient descent equations by applying partial derivative on function $\textbf {J(w, b) }$with respect to w and b.

$\frac{\partial J(w,b)}{\partial b}=\frac{\partial \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 }{\partial b}$

Applying the chain rule of derivative with respect to b we get

$$\frac {\partial J(w,b)}{\partial b}= { 2.\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})}={\frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})}$$

Applying the chain rule and sum rule of derivative with respect to w we get

$$\frac {\partial J(w,b)}{\partial w}= { 2.\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})}x^{(i)}={\frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})}x^{(i)}$$

We start the process by assigning random values to the weights, denoted as w . This initial step is referred to as random initialization and serves as the starting point for our optimization. Once the weights are initialized, we enter the optimization phase, where we strive to improve these weights iteratively.

During this phase, we take small, measured steps—often referred to as "baby steps"—to modify the weights. Each step involves calculating the gradient of the cost function, such as the Mean Squared Error (MSE), with respect to the weights. The gradient informs us of the direction in which to adjust ( w ) in order to minimize the cost function.

We continue this iterative process of adjusting the weights and evaluating the cost function until the algorithm converges, meaning that further adjustments result in negligible changes to the cost function. This convergence indicates that we have reached a minimum point, where the weights are optimized for our model.

When the learning rate $ \alpha$ is set excessively high, it can cause the optimization process to overshoot the minimum point in the loss landscape, akin to a ball that rolls too far down one side of a valley and ends up on the opposite slope. Instead of homing in on the optimal solution, the algorithm may land at a point with a higher error value than it started with. This misstep can lead to divergence, where the values of the solution continue to increase uncontrollably, ultimately preventing the algorithm from converging on a suitable and effective solution.

Performance Measures for Classification model using Scikit

Nitin Sharma — Sat, 21 Jun 2025 22:21:54 GMT

Performance metrics are essential tools for assessing the effectiveness and reliability of classification machine learning models. These metrics provide a structured and quantitative approach to evaluate how accurately a model can assign data points to specific, predefined categories. A thorough evaluation of a model's performance typically includes a range of measures, each offering unique insights into different aspects of its predictive capabilities. Key metrics include accuracy, precision, recall, F1 score, and the area under the receiver operating characteristic curve (AUC-ROC).

Confusion Matrix

A confusion matrix is a tool used to evaluate the performance of a classification model. It is a table that summarizes the results of the model's predictions compared to the actual outcomes. The matrix typically has four components:

True Positives (TP): The cases in which the model correctly predicted the positive class.
True Negatives (TN): The cases where the model correctly predicted the negative class.
False Positives (FP): The instances in which the model incorrectly predicted the positive class (also known as Type I error).
False Negatives (FN): The cases where the model failed to predict the positive class but should have (also known as Type II error).

From these four values, various performance metrics can be calculated, such as accuracy, precision, recall, and F1-score, which help in understanding how well the model is performing.

Accuracy is often seen as the most straightforward metric. It represents the overall proportion of correct predictions made by the model, combining both true positives (correctly identified positive instances) and true negatives (correctly identified negative instances) relative to the total number of predictions. Although accuracy is a useful starting point, it can be misleading in cases where the dataset is imbalanced — for instance, in scenarios where one class significantly outweighs another. In such cases, a high accuracy rate might mask poor performance in predicting the minority class.

Precision is another critical metric that specifically focuses on the accuracy of the positive predictions made by the model. It is calculated as the number of true positives divided by the sum of true positives and false positives. High precision is particularly crucial in contexts where the consequences of false positives are high, such as in fraud detection or medical testing, where incorrect positive identifications can lead to unnecessary interventions or alarm.

$\mathbf {precision= \frac {TP}{TP+FP}}$

Recall, also known as sensitivity, measures the model's ability to identify all relevant cases within a dataset. It quantifies this capability by dividing the number of true positives by the sum of true positives and false negatives. High recall values are especially important in areas where the cost of missing a positive case can have severe ramifications, such as in disease screening or safety-critical applications, where overlooking a positive instance could result in dire outcomes.

$\mathbf {recall= \frac {TP}{TP+FN}}$

The F1 score is a composite measure that serves as the harmonic mean of precision and recall. This metric is particularly beneficial in scenarios where both false positives and false negatives carry significant weight, as it offers a single score that balances both metrics. It becomes particularly important in the context of imbalanced classes, where one class may be much smaller than the other, leading to inflated accuracy metrics that do not faithfully represent model performance.

$\mathbf {F1= \frac {Precision * Recall}{Precision+Recall}}$

Finally, the area under the receiver operating characteristic curve (AUC-ROC) provides a nuanced perspective on the model's capability to differentiate between classes across various classification thresholds. It plots the true positive rate against the false positive rate, outlining the trade-offs involved in model predictions at differing levels of sensitivity and specificity. A high AUC value indicates that the model is effective at distinguishing between classes, giving practitioners a clear indication of performance across a continuum of potential decision thresholds.

By meticulously analyzing these diverse performance metrics, data scientists and machine learning practitioners can uncover the strengths and weaknesses of their models. This multifaceted evaluation empowers them to make informed adjustments and enhancements to their models, ultimately leading to improved performance and more accurate predictions in real-world applications. This rigorous approach not only enhances model robustness but also fosters a deeper understanding of the models' operational characteristics in various contexts.

Let’s analyze these metrics using an example of a classification model.

We will use the MNIST dataset, which is available from the Scikit library. Lets load this data

from sklearn.datasets import fetch_openml
import matplotlib.pyplot as plt
mnist = fetch_openml('mnist_784', as_frame=False)
mnist.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

Create the features and target lables and check the shape of the data

X, y = mnist.data, mnist.target
X.shape

(70000, 784)

Lets check the first data point

def plot_digit(image_data):
    image = image_data.reshape(28, 28)
    plt.imshow(image, cmap="binary")
    plt.axis("off")

some_digit = X[0]
plot_digit(some_digit)
plt.show()

It is digit 5

We will divide the data into train and test

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

In order to effectively display all performance metrics, we will develop a binary classifier. This classifier will categorize the labels by assigning a value of true when the digit is 5 and a value of false for all other digits. This approach will allow us to analyze the model's ability to correctly identify the presence of the digit 5 compared to other digits.

y_train_5 = (y_train == '5')  # True for all 5s, False for all other digits
y_test_5 = (y_test == '5')

We will be implementing a stochastic gradient descent (SGD) classifier, which is a powerful and efficient approach for optimizing our machine learning model. This method works by updating the model's parameters incrementally, using randomly selected subsets of the training data known as mini-batches. By doing so, we can navigate the loss function more effectively, allowing us to refine our model's performance while also reducing the computational burden typically associated with processing the entire dataset at once. This iterative process helps us find the optimal weights for our model, ultimately leading to better predictions.

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

After calling on model fit lets test this model using first digit by calling predict function

sgd_clf.predict([some_digit])
array([ True])

In order to assess the accuracy of our model, we will utilize a technique known as cross-validation. Specifically, we will employ the cross_val_score function from the scikit-learn library. This function allows us to evaluate the performance of our model by splitting the dataset into multiple subsets, training the model on some of these subsets, and validating it on the remaining ones. By repeating this process several times, we can obtain a more reliable estimate of the model's accuracy.

from sklearn.model_selection import cross_val_score

cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")

array([0.95035, 0.96035, 0.9604 ])

It gives an accuracy of 95% which is good, but to better understand we will create Confusion matrix and will analyze other performance metrics.

We will utilize the cross_val_predict method from the scikit-learn library to generate predicted values based on our model. This method allows us to perform cross-validation and provides a way to obtain predictions for each data point in our dataset by training the model multiple times on different subsets of the data. This approach helps ensure that we get a more accurate estimate of the model's performance.

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

We will utilize the confusion_matrix function provided by the scikit-learn library, which allows us to evaluate the performance of our classification model by comparing the predicted classifications to the actual outcomes. This function generates a matrix that summarizes the correct and incorrect predictions, offering insights into the model's accuracy and error types.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_train_5, y_train_pred)
cm

array([[53892,   687],
       [ 1891,  3530]])

TN=53892, FP=687, FN=1891 and TP= 3530

Lets calculate the precision

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred)

0.8370879772350012

Recall

recall_score(y_train_5, y_train_pred)

0.6511713705958311

F1 score

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

0.7325171197343847

We will utilize the precision_recall_curve function from the scikit-learn library, specifically employing the method set to "decision_function." This choice allows us to generate scores for every instance within our dataset. By doing so, we can conduct a comprehensive assessment of our classification model's performance, as it enables the calculation of precision and recall at various threshold levels. This thorough evaluation is crucial for understanding how well our model distinguishes between different classes and identifying the optimal threshold to balance precision and recall effectively.

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
threshold = 3000

We will plot the precision , recall and threshold

plt.figure(figsize=(8, 4))  # extra code – it's not needed, just formatting
plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
plt.vlines(threshold, 0, 1.0, "k", "dotted", label="threshold")

# extra code – this section just beautifies and saves Figure 3–5
idx = (thresholds >= threshold).argmax()  # first index ≥ threshold
plt.plot(thresholds[idx], precisions[idx], "bo")
plt.plot(thresholds[idx], recalls[idx], "go")
plt.axis([-50000, 50000, 0, 1])
plt.grid()
plt.xlabel("Threshold")
plt.legend(loc="center right")

plt.show()

90 % precision is achieved around 50% recall.

Receiver Operating Characteristic (ROC)

The Receiver Operating Characteristic (ROC) curve is an important tool used for evaluating the performance of binary classifiers. It visually represents the trade-off between the True Positive Rate (TPR), also known as sensitivity, and the False Positive Rate (FPR). TPR indicates the proportion of actual positive cases that are correctly identified by the model, while FPR reflects the proportion of actual negative cases that are incorrectly classified as positive. Additionally, the True Negative Rate (TNR), which is also called specificity, measures the model’s ability to correctly identify negative cases. The ROC curve essentially plots TPR against 1 minus specificity, providing a graphical representation of the classifier's performance across various threshold settings. This allows for a comprehensive assessment of the model's strengths and weaknesses in distinguishing between the two classes.

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

Lets get a threshold value for a precision of 90% by using argmax function.

idx_for_90_precision = (precisions >= 0.90).argmax()
threshold_for_90_precision = thresholds[idx_for_90_precision]
threshold_for_90_precision

Plot the ROC curve

idx_for_threshold_at_90 = (thresholds <= threshold_for_90_precision).argmax()
tpr_90, fpr_90 = tpr[idx_for_threshold_at_90], fpr[idx_for_threshold_at_90]

plt.figure(figsize=(6, 5))  # extra code – not needed, just formatting
plt.plot(fpr, tpr, linewidth=2, label="ROC curve")
plt.plot([0, 1], [0, 1], 'k:', label="Random classifier's ROC curve")
plt.plot([fpr_90], [tpr_90], "ko", label="Threshold for 90% precision")

plt.text(0.12, 0.71, "Higher\nthreshold", color="#333333")
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.grid()
plt.axis([0, 1, 0, 1])
plt.legend(loc="lower right", fontsize=13)
plt.show()

Dotted line represents the ROC curve and a good classifier stays far away from this dotted line on top left side.

To effectively evaluate the performance of a classification model, we can measure the area under the Receiver Operating Characteristic (ROC) curve. Scikit-learn conveniently provides a function specifically designed for estimating this area. A perfect ROC-AUC score, which indicates flawless model performance, is represented by a value of 1. This score signifies that the model can perfectly distinguish between positive and negative classes.

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores)

np.float64(0.9604938554008616)

Code for this blog is available at PerformanceMeasures

Implementing a neural network using Keras for NCAA college basketball game data.

Nitin Sharma — Mon, 02 Jun 2025 01:08:49 GMT

In this Blog, we embark on an exciting journey as data scientists aiming to predict the outcomes of NCAA college basketball games. Our primary objective is to analyze multiple years' worth of game results, process this data meticulously, and utilize it to train a neural network for accurate predictions.

Our overarching goal is to develop a machine learning model that offers us a competitive advantage in predicting game results. Throughout this project, we will navigate the entire life cycle of a machine learning endeavor, which includes:

Designing a Neural Network: We will create a robust neural network architecture using Keras, leveraging its powerful capabilities for building and training models.
Training, Testing, and Validation: We will initiate a comprehensive training process, followed by rigorous testing and validation phases to ensure our model's accuracy and reliability.

By undertaking these steps, we aim to deliver a cutting-edge machine learning project that not only enhances our understanding of data-driven predictions but also equips us with the tools necessary to make informed outcomes in the realm of college basketball.

Data for this project can be grabbed from Games_Calculated.csv.

Columns are:

Date of the game
Home Team
Home Team’s Score
Away Team
Away Team’s Score
Home Team’s Offensive average (points scored) while at home
Home Team’s Defensive average (points given up) while at home
Away Team’s Offensive average while away
Away Team’s Defensive average while away
Score difference from the home team’s perspective

Our primary objective is to thoroughly clean and preprocess the dataset by identifying and rectifying any inconsistencies or errors. This includes standardizing formats, removing duplicates, and addressing missing values to ensure the data is reliable. Additionally, we will identify and eliminate any unnecessary columns that do not contribute to our analytical goals. By refining the dataset in this way, we aim to create a streamlined, normalized version that enhances the learning process, ensuring that the insights derived from our analysis are meaningful and actionable.

For this project, I utilized Google Colab as my development environment. To begin, I mounted my Google Drive to access the data files stored there. This step is crucial as it allows me to work with the dataset directly from my Drive while taking advantage of Colab's computational resources.

Mount the drive to load data

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

Load the CSV games file

game_file = '/content/drive/MyDrive/Colab_Notebooks/live_project/game/Games-Calculated.csv'

Import the Libraries we will use in this project

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt

Specify the column names for the dataset and then load the CSV file into a Pandas DataFrame for further analysis.

column_names = ['Date','HomeTeam','HomeScore','AwayTeam','AwayScore',
                'HomeScoreAverage','HomeDefenseAverage','AwayScoreAverage','AwayDefenseAverage',
                'Result']
data = pd.read_csv(game_file,names=column_names)

Lets read the first two rows

data.head(2)

	Date	HomeTeam	HomeScore	AwayTeam	AwayScore	HomeScoreAverage	HomeDefenseAverage	AwayScoreAverage	AwayDefenseAverage	Result
0	2015-11-13	Hawaii	87	Montana State	76	87.0	76.0	76.0	87.0	11
1	2015-11-13	Eastern Michigan	70	Vermont	50	70.0	50.0	50.0	70.0	20

We will eliminate the columns that are unnecessary for our training process to streamline the dataset and improve the efficiency of our model.

updated_data=data.drop(['Date','HomeTeam','HomeScore','AwayTeam','AwayScore'], axis=1)
updated_data.shape

(20160, 5)

So we have 20160 records and 5 features including the the target lable.

Splitting the train and test data

We need to divide the dataset into two parts using an 80:20 ratio. This means that 80% of the data will be allocated for training our model, while the remaining 20% will be reserved for testing its performance. To achieve this, we will utilize the Pandas library, which provides powerful data manipulation tools. We will first load the dataset into a Pandas DataFrame, then use functions to randomly shuffle and split the data accordingly. This ensures that both the training and testing sets are representative samples of the original dataset.

trainX=updated_data.sample(frac=0.8,random_state=0)
testX=updated_data.drop(trainX.index)

trainX.shape

(16128, 5)

testX.shape
testX.shape
(4032, 5)

Currently, we have divided our dataset into two parts: 80% of the data will be used for training our model, while the remaining 20% will serve as the test dataset. Our next step is to create the target variables for both the training and test sets.

trainY=trainX.pop('Result')
testY=testX.pop('Result')

Normalizing the data

We will apply data normalization techniques to both the training and testing datasets. Specifically, we will implement z-score standardization, which involves converting our data into a standard format. This process will ensure that each feature has a mean of zero and a standard deviation of one, allowing for a more accurate comparison across different scales and distributions. By doing this, we aim to enhance the performance of our machine learning models and improve the overall predictive accuracy.

def z_score_standardization(df):
    df_scaled = df.copy()
    for column in df.columns:
        df_scaled[column] = (df[column] - df[column].mean()) / df[column].std()
    return df_scaled

We call the above function for both test and train data to get scaled data.

scaledTrainX=z_score_standardization(trainX)
scaledTestX=z_score_standardization(testX)

Building the Model

We will develop a sequential model using Keras, which is a high-level neural networks API. The model will consist of two hidden layers, each containing 32 neurons and utilizing the ReLU (Rectified Linear Unit) activation function to introduce non-linearity. This choice allows the model to learn complex patterns in the data.

Following the two hidden layers, we will include an output layer with a single neuron. This layer will provide the model's prediction, suitable for tasks such as binary classification or regression.

To optimize the model’s performance, we will compile it using the RMSprop optimizer, which is effective for training deep learning models. Additionally, we will define a specific loss function suitable for our task and set up metrics to track the model's performance during training. This comprehensive approach will help ensure that our model is well-structured and efficient.

def buildModel():
  model=keras.models.Sequential([
      keras.layers.Dense(32,activation='relu',input_shape=[4]),
      keras.layers.Dense(32,activation='relu'),
      keras.layers.Dense(1)
      ])
  model.compile(optimizer='rmsprop',loss='mean_squared_error',metrics=['accuracy','MeanAbsoluteError','MeanSquaredError'])

  return model

Create Model and call fit for 100 epochs

model=buildModel()
history = model.fit(scaledTrainX, trainY, epochs=100, validation_split=0.2)

404/404 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - MeanAbsoluteError: 7.9093 - MeanSquaredError: 103.3424 - accuracy: 0.0156 - loss: 103.3424 - val_MeanAbsoluteError: 7.8313 - val_MeanSquaredError: 102.0163 - val_accuracy: 0.0130 - val_loss: 102.0163
Epoch 99/100
404/404 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - MeanAbsoluteError: 7.9773 - MeanSquaredError: 104.4646 - accuracy: 0.0166 - loss: 104.4646 - val_MeanAbsoluteError: 7.8514 - val_MeanSquaredError: 102.4821 - val_accuracy: 0.0124 - val_loss: 102.4821
Epoch 100/100
404/404 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - MeanAbsoluteError: 7.8406 - MeanSquaredError: 102.1461 - accuracy: 0.0168 - loss: 102.1461 - val_MeanAbsoluteError: 7.8153 - val_MeanSquaredError: 101.8889 - val_accuracy: 0.0133 - val_loss: 101.8889

MeanAbsoluteError: 7.8406 - MeanSquaredError: 102.1461 - accuracy: 0.0168 - loss: 102.1461 - val_MeanAbsoluteError: 7.8153 - val_MeanSquaredError: 101.8889 - val_accuracy: 0.0133 - val_loss: 101.8889

We are about to proceed with evaluating the model by utilizing our test dataset to assess its performance and accuracy.

test_loss, mae, test_acc, mse = model.evaluate(scaledTestX, testY)

126/126 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - MeanAbsoluteError: 7.3928 - MeanSquaredError: 93.3554 - accuracy: 0.0240 - loss: 93.3554

Lets plot training and validation loss using Matplotlib

history_dict = history.history
loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, "bo", label="Training loss")
plt.plot(epochs, val_loss_values, "r", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

history_dict = history.history

Lets also plot MeanAbsoluteError

abs_error = history_dict["MeanAbsoluteError"]
val_abs_error = history_dict["val_MeanAbsoluteError"]
epochs = range(1, len(abs_error) + 1)
plt.plot(epochs, abs_error, "bo", label="Training  mean absolute error")
plt.plot(epochs, val_abs_error, "r", label="Validation mean absolute error")
plt.title("Training and validation mean absolute error")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

jupyter notebook. for this project can be found at BaskeBallGame Prediction

Agentic AI Plan and Execute using LangChain

Nitin Sharma — Sat, 31 May 2025 18:40:55 GMT

Agentic AI refers to artificial intelligence systems that possess a degree of autonomy and decision-making capability, allowing them to act independently in specific contexts. Unlike traditional AI, which typically follows predetermined rules and algorithms, Agentic AI can evaluate situations, assess potential outcomes, and make choices based on its understanding of the environment and objectives. This includes the ability to adapt to new information and improve its performance over time.

Agentic AI systems are designed to perform tasks that require a level of judgment and reasoning, effectively enabling them to engage in complex interactions or to solve problems that were not explicitly programmed into them. This concept raises important discussions surrounding ethics, accountability, and the implications of delegating decision-making power to machines, as well as the potential impact on industries such as healthcare, transportation, and robotics.

Plan-and-execute agents utilize a language model (LLM) to develop detailed task plans, which are then carried out by a separate execution agent. This collaborative approach allows for more sophisticated task management, where the LLM generates strategies and instructions while the execution agent focuses on implementing the tasks effectively.

The strategy is composed of two key elements. The first element is a planner, which leverages the reasoning capabilities of large language models (LLMs) to develop a comprehensive plan by outlining specific steps. The second element is an executor, responsible for interpreting the steps outlined by the planner. This executor identifies the essential tools, resources, or actions required to successfully carry out each step of the plan. Together, these components work in harmony to ensure effective execution of tasks.

Develop a comprehensive plan and implement an agent strategy utilizing the LangChain framework. This involves outlining the objectives, defining the role of the agent, selecting appropriate tools and libraries, and executing a series of steps to ensure successful interaction with language models. Additionally, it includes monitoring performance and making adjustments as necessary to optimize outcomes.

The whole work flow is shown below

We will develop this application as a RESTful endpoint using FastAPI, a modern web framework for building APIs with Python. In a previous blog post, I outlined the process of creating and deploying a FastAPI endpoint RAG_CHATBOT, which I recommend checking out for background information.

For handling HTTP requests, we will utilize the popular requests library, which simplifies the process of sending requests and receiving responses. To extract and parse news articles efficiently, we will employ the newspaper package, a powerful tool designed for web scraping and article extraction.

Once we retrieve the articles, we will store them in DeepLake, a vector database optimized for managing embeddings. This method allows us to organize and access the articles in a format that enhances retrieval and analysis, as demonstrated in the accompanying workflow.

Load the libraries

from fastapi import FastAPI
import os
#from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings
from langchain_deeplake.vectorstores import DeeplakeVectorStore
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

from langchain_community.document_loaders import SeleniumURLLoader
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
import requests
from  newspaper import Article
import time
from langchain_core.tools import Tool
from langchain_experimental.plan_and_execute import PlanAndExecute, load_agent_executor, load_chat_planner

Save the OpenAI and DeepLake keys


os.environ["OPENAI_API_KEY"] = 'Key_Here'
os.environ["ACTIVELOOP_TOKEN"] = 'Key_Here'

After completing the initial setup, the next step is to create the foundational structure for your FastAPI application. This involves establishing the basic framework that will support your API's functionality, including defining the directory structure, setting up configuration files, and initializing the main application instance. This skeleton will serve as the groundwork for implementing endpoints, integrating middleware, and managing dependencies as you develop your project further.

FastAPI

app=FastAPI(
    title="Langchain Server",
    version="1.0",
    decsription="A simple API Server"

)

Now we will write code in below function for agents calling LLM and executing plan and execute pattern.

@app.get("/chat/")
async def root(query:str):
     return {response}

Custom Tool

We will develop a function that defines our custom tool within an agentic AI framework, utilizing Langchain. This function will be designed to efficiently retrieve relevant documents from a Deep Lake database. The database will contain previously fetched, parsed, and saved documents, enabling the AI to access and deliver pertinent information based on user queries or tasks. This process will enhance our AI's ability to provide accurate and contextually relevant responses by leveraging the structured data stored in Deep Lake.


def retrieve_n_docs_tool(query: str,) -> str:
    """Searches for relevant documents that may contain the answer to the query."""
    embeddings=OpenAIEmbeddings(model='text-embedding-ada-002')
    db = DeeplakeVectorStore(dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True)
    # Get the retriever object from the deep lake db object and set the number
    # of retrieved documents to 3
    retriever = db.as_retriever()
    retriever.search_kwargs['k'] = 3
    # We define some variables that will be used inside our custom tool
    CUSTOM_TOOL_DOCS_SEPARATOR ="\n---------------\n" # how to join together the retrieved docs to form a single string
    docs = retriever.get_relevant_documents(query)
    texts = [doc.page_content for doc in docs]
    texts_merged = "---------------\n" + CUSTOM_TOOL_DOCS_SEPARATOR.join(texts) + "\n---------------"
    return texts_merged

In the function @app.get("/chat/"), we will implement the code necessary to retrieve content from specified URLs and subsequently store this data in the DeepLake database. To do this, we will first fetch the content from each URL. Next, we will utilize the RecursiveCharacterTextSplitter to divide the retrieved text into manageable chunks. These smaller segments will allow for easier processing and analysis. Finally, we will generate embeddings for each chunk and save them securely in the DeepLake database for future retrieval and use.

 headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'
    }   

    article_urls = [
        "https://www.artificialintelligence-news.com/2023/05/23/meta-open-source-speech-ai-models-support-over-1100-languages/",
        "https://www.artificialintelligence-news.com/2023/05/18/beijing-launches-campaign-against-ai-generated-misinformation/"
        "https://www.artificialintelligence-news.com/2023/05/16/openai-ceo-ai-regulation-is-essential/",
        "https://www.artificialintelligence-news.com/2023/05/15/jay-migliaccio-ibm-watson-on-leveraging-ai-to-improve-productivity/",
        "https://www.artificialintelligence-news.com/2023/05/15/iurii-milovanov-softserve-how-ai-ml-is-helping-boost-innovation-and-personalisation/",
        "https://www.artificialintelligence-news.com/2023/05/11/ai-and-big-data-expo-north-america-begins-in-less-than-one-week/",
        "https://www.artificialintelligence-news.com/2023/05/11/eu-committees-green-light-ai-act/",
        "https://www.artificialintelligence-news.com/2023/05/09/wozniak-warns-ai-will-power-next-gen-scams/",
        "https://www.artificialintelligence-news.com/2023/05/09/infocepts-ceo-shashank-garg-on-the-da-market-shifts-and-impact-of-ai-on-data-analytics/",
        "https://www.artificialintelligence-news.com/2023/05/02/ai-godfather-warns-dangers-and-quits-google/",
        "https://www.artificialintelligence-news.com/2023/04/28/palantir-demos-how-ai-can-used-military/",
        "https://www.artificialintelligence-news.com/2023/04/26/ftc-chairwoman-no-ai-exemption-to-existing-laws/",
        "https://www.artificialintelligence-news.com/2023/04/24/bill-gates-ai-teaching-kids-literacy-within-18-months/",
        "https://www.artificialintelligence-news.com/2023/04/21/google-creates-new-ai-division-to-challenge-openai/"
    ]

    session=requests.Session()
    pages_content = [] # where we save the scraped articles
    for url in article_urls:
        try:
            time.sleep(2) # sleep two seconds for gentle scraping
            response = session.get(url, headers=headers, timeout=10)

            if response.status_code == 200:
                article = Article(url)
                article.download() # download HTML of webpage
                article.parse() # parse HTML to extract the article text
                pages_content.append({ "url": url, "text": article.text })
            else:
                print(f"Failed to fetch article at {url}")
        except Exception as e:
            print(f"Error occurred while fetching article at {url}: {e}")

    #If an error occurs while fetching an article, we catch the exception and print
    #an error message. This ensures that even if one article fails to download,
    #the rest of the articles can still be processed.


    embeddings=OpenAIEmbeddings(model='text-embedding-ada-002')

    db = DeeplakeVectorStore(dataset_path="./my_deeplake/", embedding_function=embeddings, overwrite=True)

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    all_texts = []
    for d in pages_content:
            chunks = text_splitter.split_text(d["text"])
            for chunk in chunks:
                all_texts.append(chunk)

    ids = db.add_texts(all_texts)

We are developing a tool within the Langchain framework that leverages the "retrieve_n_docs_tool" function. This function is designed to access and extract content stored in the DeepLake database, enabling us to efficiently retrieve and utilize relevant information as needed. By implementing this tool, we aim to streamline the process of accessing our data, ensuring that users can quickly obtain the necessary documents for their tasks.

   tools = [
        Tool(
            name="Search Private Docs",
            func=retrieve_n_docs_tool,
            description="useful for when you need to answer questions about current events about Artificial Intelligence"
        )
    ]

We will create a planning agent and an execution agent tailored to our specific dataset. This involves configuring the agents to effectively interpret and process the data, ensuring they can execute tasks efficiently and accurately

model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

    planner = load_chat_planner(model)
    executor = load_agent_executor(model, tools, verbose=True)
    agent = PlanAndExecute(planner=planner, executor=executor, verbose=True)
    response = agent.run(query)

In browser type url 127.0.0.1:8000/chat/?query=”Write an overview of Artificial Intelligence regulations by governments by country”

Plan gets created with multiple steps by calling LLM


> Entering new PlanAndExecute chain...
steps=[Step(value='Research and gather information on Artificial Intelligence regulations 
by governments in different countries.'), Step(value='Organize the information by country, 
including details such as key regulations, policies, and guidelines related to Artificial 
Intelligence.'), Step(value='Summarize the regulations for each country in a concise 
manner.'), Step(value='Include any recent updates or developments in the field of 
Artificial Intelligence regulations.'), Step(value='Provide a comparison of the regulations
 across different countries, highlighting similarities and differences.'), 
Step(value='Check for any official government sources or reputable publications to verify
 the accuracy of the information.'), Step(value='Compile the overview in a clear and 
structured format for easy understanding.'), Step(value="Review the overview to ensure it is
 comprehensive and up-to-date.\nGiven the above steps taken, please respond to the user's 
original question. \n")]

Each of this steps are executed using our custom tool function we wrote

*****

Step: Research and gather information on Artificial Intelligence regulations by governments in different countries.

Response: I will now search the private documents for information on Artificial Intelligence regulations by governments in different countries to assist the user with their research objective.

> Entering new AgentExecutor chain...
Thought: The user needs assistance in organizing information on Artificial Intelligence regulations by country. I can help by searching the private documents for relevant details on key regulations, policies, and guidelines related to Artificial Intelligence in different countries.

Action:
```
{
  "action": "Search Private Docs",
  "action_input": {"type": "Artificial Intelligence regulations by country"}
}
```

The agent will systematically compile a comprehensive overview of artificial intelligence regulations by analyzing and synthesizing a variety of documents stored in the DeepLake database. This process will involve multiple iterations to ensure that all relevant information is captured and accurately represented, drawing from diverse sources to provide a well-rounded understanding of the current regulatory landscape surrounding AI technology.

Implementing XGBoost using Scikit

Nitin Sharma — Wed, 28 May 2025 21:59:01 GMT

XGBoost, which stands for Extreme Gradient Boosting, is an advanced machine learning algorithm that is widely used for regression, classification, and ranking tasks. It is particularly known for its speed and performance, making it a popular choice in data science competitions and practical applications.

At its core, XGBoost is based on the concept of boosting, which is an ensemble learning technique. This approach combines the predictions from multiple weak learners, typically decision trees, to create a strong predictive model. The primary idea behind boosting is to focus on the instances that previous models misclassified, thereby sequentially improving the model's performance.

Key Features of XGBoost:

Regularization: XGBoost includes a regularization term in its objective function, which helps to prevent overfitting. This feature differentiates it from many other boosting algorithms, as it adds both L1 (Lasso) and L2 (Ridge) penalties. This allows for more flexibility in managing model complexity and improves generalization on unseen data.
Handling Missing Values: One of the standout features of XGBoost is its ability to handle missing data internally. It does this by learning the best direction to handle missing values during training, making it robust against incomplete datasets.
Parallel Processing: Unlike traditional gradient boosting algorithms that build trees sequentially, XGBoost leverages parallel processing to speed up the training process. It does this by building trees one level at a time, allowing the algorithm to construct trees much more quickly than its predecessors.
Tree Pruning: Instead of the standard pre-pruning method used in decision trees, XGBoost employs maximum depth for tree construction and then prunes the trees backwards (post-pruning). This helps to optimize the tree structure and improve overall performance.
Scalability: XGBoost is designed to be highly scalable. It can handle large datasets and can be run on distributed systems, making it suitable for modern data processing needs.

Usage:

To use XGBoost, data scientists typically follow these steps:

Data Preparation: Clean and preprocess the dataset, addressing missing values and converting categorical variables as necessary.
Model Configuration: Set parameters for the XGBoost model. This includes specifying the learning rate, the number of trees to create, maximum depth, regularization parameters, and evaluation metrics.
Training: Train the model on the training dataset while monitoring performance on a validation dataset to avoid overfitting.
Prediction: After training, the model is used to make predictions on new data.
Evaluation: Finally, the model’s predictions are evaluated using appropriate metrics (e.g., accuracy, RMSE, F1 score).

XGBoost has become a go-to algorithm due to its impressive performance across a variety of tasks, and its ability to produce predictive models that are both accurate and efficient. Given its flexibility and robustness, it has gained immense popularity in the machine learning community, making it a critical tool for practitioners.

Adult Data set

The Adult dataset, frequently referred to as the Census Income dataset, is a widely recognized collection of data utilized for binary classification tasks in machine learning. This dataset contains information from the U.S. Census and includes various attributes such as age, education level, occupation, and marital status, among others. The primary objective when working with this dataset is to predict whether an individual's income exceeds $50,000 per year based on these features. Its rich variety of demographic information and clear binary target variable make it an excellent resource for testing algorithms and exploring concepts in classification and predictive modeling.

Lets start with importing the libraries we will use in this

from sklearn.datasets import fetch_openml
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
import numpy as np
from collections import Counter

We will obtain the Adult dataset by utilizing the fetch_openml function from the scikit-learn library. This function allows us to easily download and load the dataset from OpenML, a platform that hosts various machine learning datasets. By using this method, we can access the data in a structured format, making it convenient for further analysis and modeling tasks.

Goal

Predicting whether an individual’s income exceeds $50,000 per year.

Load the Adult dataset

adult = fetch_openml('adult', as_frame=True)
X, y = adult.data, adult.target

Lets print the shape and contents of this loaded data

Print key information about the dataset


print(f"Dataset shape: {X.shape}")
print(f"Features: {adult.feature_names}")
print(f"Target variable: {adult.target_names}")
print(f"Class distributions: {Counter(y)}")

Dataset shape: (48842, 14)
Features: ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capitalgain', 'capitalloss',
 'hoursperweek', 'native-country']
Target variable: ['class']
Class distributions: Counter({'<=50K': 37155, '>50K': 11687})

Lets look at some of the data

X.head(5)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capitalgain	capitalloss	hoursperweek	native-country
0	2	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	1	0	2	United-States
1	3	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	0	United-States
2	2	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	2	United-States
3	3	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	2	United-States
4	1	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	2	Cuba

We will utilize the Scikit-learn library to transform categorical features into integer codes. This process involves using techniques such as label encoding or one-hot encoding, which allow us to convert string values representing categories into numerical formats. This transformation is crucial for machine learning models, as they typically perform better with numerical data. By encoding these categorical variables, we ensure that our models can effectively interpret and learn from the input data.

From above data we can see following columns are categorical

nominal = ['workclass', 'education', 'marital-status', 'occupation', 'relationship',
 'race', 'sex', 'native-country']

We will utilize the ColumnTransformer from the scikit-learn library to construct a data transformation pipeline. This pipeline will process categorical columns using the OrdinalEncoder, which will convert these categorical values into numerical format while preserving their ordinal relationships. For the remaining columns in the dataset, we will apply the 'passthrough' option, allowing those features to be retained without any transformation. This approach ensures that our preprocessing is tailored to the specific needs of both categorical and numerical data within our dataset.

transformer = ColumnTransformer(transformers=[('ordinal', OrdinalEncoder(), nominal)],
 remainder='passthrough')

Perform ordinal encoding

X = transformer.fit_transform(X)

LabelEncoder

The LabelEncoder is a utility in data preprocessing that transforms categorical target labels into a numerical format suitable for machine learning models. It converts each unique label into an integer value ranging from 0 to n_classes - 1, where n_classes represents the total number of distinct categories present in the target variable. This encoding technique is particularly useful for classification problems, as many machine learning algorithms require numerical input rather than categorical data.

When using the LabelEncoder, it is important to apply it exclusively to the target labels, not to the features. This ensures that the transformation accurately reflects the classes without altering the structure of the input data. The LabelEncoder can also assist in normalizing the labels, making them more manageable for algorithms that rely on numerical computations. By encoding the labels in this way, models can learn the underlying patterns in the data effectively, leading to better performance on predictive tasks.

y = LabelEncoder().fit_transform(y)

we will split the data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In order to optimize the performance of an XGBoost model, it's essential to establish a detailed parameter grid that encompasses a variety of hyperparameters. This grid will allow for a systematic exploration of different configurations to identify the best combination for our specific dataset.

Learning Rate (eta): This controls the contribution of each tree. Values typically range from 0.01 to 0.3.
Maximum Depth (max_depth): Defines the maximum depth of a tree in the ensemble. Common values are between 3 and 10.
Subsample: This parameter represents the fraction of samples to be used for each tree. It usually takes values between 0.5 and 1.0.
Colsample_bytree: The fraction of features to consider when building each tree, typically ranging from 0.3 to 1.
Number of Estimators (n_estimators): The number of trees to be created in the boosting process, commonly set between 100 and 1000.

By methodically defining this parameter grid, we can employ techniques like grid search or random search to uncover the optimal hyperparameter settings that enhance the model's predictive capabilities.

param_grid = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.05],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

We create XGBClassifier

model = XGBClassifier(objective='binary:logistic', random_state=42, n_jobs=1)

The "binary:logistic" objective function in XGBoost is specifically designed for binary classification tasks, where the target variable consists of two distinct classes or outcomes. In this context, it focuses on predicting which of the two classes a given instance belongs to. The optimization process targets the log loss function, which measures the performance of a classification model whose output is a probability value between 0 and 1.

By utilizing the log loss function, this objective effectively quantifies how far off the predicted probabilities are from the actual class labels. This makes "binary:logistic" particularly suitable for applications where understanding the likelihood of an instance belonging to a specific class is crucial.

Perform grid search

To conduct a thorough optimization of our model's hyperparameters, we will utilize the GridSearchCV method from the scikit-learn library. This approach involves specifying a range of values for each hyperparameter in the param_grid dictionary. We will create an instance of GridSearchCV, passing in our model as the estimator, alongside the defined parameter grid. Additionally, we will set the cv parameter to 3 to implement three-fold cross-validation during the search process. To leverage all available CPU cores for efficiency, we will assign n_jobs a value of -1. After setting up the grid search configuration, we will fit the model on our training dataset, X_train and y_train, which will allow the algorithm to explore and identify the optimal combination of hyperparameter settings based on cross-validated performance.

grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=-1)
grid_search.fit(X_train, y_train)

Print best score and parameters


print(f"Best score: {grid_search.best_score_:.3f}")
print(f"Best parameters: {grid_search.best_params_}")

Best score: 0.859
Best parameters: {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 5, 
'n_estimators': 100, 'subsample': 0.8}

Access the best model from grid_search

To obtain the optimal model from the grid search results, we can access the best estimator by referencing the best_estimator_ attribute of the grid_search object. This attribute contains the model that achieved the highest performance based on the evaluation criteria set during the grid search process.


best_model = grid_search.best_estimator_

Save the best model


best_model.save_model('best_model_adult.ubj')

Now we load the saved model


loaded_model = XGBClassifier()
loaded_model.load_model('best_model_adult.ubj')

To generate predictions using the trained model that has been previously loaded into memory, we will apply it to the test dataset. This is done by calling the predict method on the loaded model and passing in the features from the test set, denoted as X_test. The output will be a set of predictions based on the input data.

# Use loaded model for predictions
predictions = loaded_model.predict(X_test)

Print the accuracy score

To evaluate the performance of the model, we will calculate the accuracy score using the test dataset. The accuracy score is determined by comparing the predicted labels generated by the loaded model against the actual labels in the test set. We can achieve this by applying the score method on the loaded_model, passing in X_test as the input features and y_test as the corresponding true labels. This will yield a numerical value representing the proportion of correctly predicted instances in the test data.


accuracy = loaded_model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.3f}")

Accuracy: 0.862

Pretty good accuracy

Implementing Random Forest using Scikit learn

Nitin Sharma — Mon, 26 May 2025 13:49:33 GMT

The Random Forest algorithm is an ensemble learning method primarily used for classification and regression tasks. It operates by constructing a multitude of decision trees during training and outputs the mode of the classes (for classification) or the mean prediction (for regression) of the individual trees.

The process begins by creating various subsets of the training data through a technique known as bootstrap sampling, where random samples (with replacement) are drawn from the dataset. For each of these subsets, a decision tree is built. Unlike a standard decision tree that considers all features when making splits, Random Forest introduces an additional layer of randomness by only selecting a random subset of features at each split, which helps to enhance the diversity among the trees.

This diversity among the trees reduces the risk of overfitting, which is a common problem in single decision trees. After all trees are constructed, the final output is determined through majority voting for classification tasks or averaging for regression tasks. Random Forest is valued for its robustness, high accuracy, and ability to handle large datasets with higher dimensionality while maintaining computational efficiency. Additionally, it provides insights into feature importance, allowing for better understanding and interpretability of the model's predictions. To better understand Random forest first we need to understand Decision tree algorithm.

Decision Tree algorithm of Machine Learning

The Decision Tree algorithm is a popular machine learning method used for classification and regression tasks. It models decisions in a tree-like structure, where each internal node represents a feature (or attribute), each branch corresponds to a decision rule, and each leaf node represents an outcome (or class label). The decision tree is constructed through an iterative process where the goal is to partition the input space in a way that maximizes the homogeneity of the resulting subsets.

Key concepts involved in building a decision tree include:

Entropy: Entropy is a measure of the disorder or uncertainty in a set of data. In the context of decision trees, it quantifies the impurity or randomness of the class labels in a dataset. The formula for calculating entropy H(S) for a set ( S ) with class labels is given by:
$ H(S) = - \sum_{i=1}^{c} p_i \log_2 p_i $

where $p_i $ is the proportion of class ( i ) in the dataset and ( c ) is the total number of classes. Lower entropy indicates that the data is more pure and homogeneous, while higher entropy signifies more mixed data.
Information Gain: Information Gain measures the reduction in entropy after a dataset is split based on a particular feature. It helps to identify which feature best separates the classes. The Information Gain ( IG(S, A) ) for a feature ( A ) is calculated as:

entropy(parent) – [average entropy(children)]

$ IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v) $

where $S_v $ is the subset of $S $ where feature ( A ) takes on value ( v ). The feature with the highest Information Gain is chosen for the split as it provides the clearest separation of classes.
Gini Impurity: Gini Impurity is an alternative metric for measuring the quality of a split in a decision tree. It assesses how often a randomly chosen element would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. The Gini Impurity Gini(S) ) for a set ( S ) is calculated as:

$Gini(S) = 1 - \sum_{i=1}^{c} p_i^2 $

Like entropy, a lower Gini Impurity indicates a more homogeneous subset. Decision trees can be constructed using Gini Impurity as the criterion for splitting nodes, typically yielding faster results compared to entropy.

Decision Trees utilize Entropy or Gini Impurity as criteria to decide on the best features to split the dataset, aiming to create a model that accurately represents the underlying patterns of the data while fostering interpretability and ease of use.

Step-by-Step Calculation of Information Gain

Calculate the Entropy of the Whole Dataset: Let's say we have a dataset of 10 instances with the following classes: 6 positive instances (Yes) and 4 negative instances (No). The formula for entropy (H) is given by:

$ H(S) = - \sum (p_i \cdot \log_2(p_i)) $

Here, $ p_i $ is the proportion of each class in the dataset.
- For our dataset:
  - Proportion of Yes $ p_Y$\= 6/10 = 0.6
  - Proportion of No $p_N$ = 4/10 = 0.4

Substituting these values into the entropy formula:

$H(S) = - (0.6 \cdot \log_2(0.6) + 0.4 \cdot \log_2(0.4)) \approx 0.971 $

Split the Dataset based on a Feature: Suppose we have a feature called "Weather" with three possible outcomes: Sunny, Rainy, and Overcast.

Let's say it splits our dataset into:
- Sunny: 3 Yes, 1 No (4 instances)
- Rainy: 2 Yes, 2 No (4 instances)
- Overcast: 1 Yes, 1 No (2 instances)

Calculate the Entropy for Each Subset: For each subset, we calculate the entropy.
- Sunny:
  - Proportion of Yes $p_Y$\= 3/4 = 0.75
  - Proportion of No $P_N$ = 1/4 = 0.25
    
    $ H(Sunny) \approx - (0.75 \cdot \log_2(0.75) + 0.25 \cdot \log_2(0.25)) \approx 0.811 $
- Rainy:
  - Proportion of Yes $P_Y$\= 2/4 = 0.5
  - Proportion of No $P_N$\= 2/4 = 0.5

$H(Rainy) \approx - (0.5 \cdot \log_2(0.5) + 0.5 \cdot \log_2(0.5)) = 1.0 $

Overcast:
- Proportion of Yes $P_Y$\= 1/2 = 0.5
- Proportion of No $P_N$\= 1/2 = 0.5
  
  $ H(Overcast) \approx 1.0 $

Calculate the Weighted Average Entropy of Subsets: Now, we need to find the weighted average entropy based on the size of each subset:

$H(Feature) = \frac{4}{10} \cdot H(Sunny) + \frac{4}{10} \cdot H(Rainy) + \frac{2}{10} \cdot H(Overcast) = \frac{4}{10} \cdot 0.811 + \frac{4}{10} \cdot 1.0 + \frac{2}{10} \cdot 1.0 \approx 0.825$
Calculate Information Gain: Finally, we compute the Information Gain by subtracting the weighted average entropy of the feature from the original entropy:

$IG = H(S) - H(Feature) = 0.971 - 0.825 \approx 0.146 $

The Information Gain tells us how much information about the classification of the dataset is provided by the "Weather" feature. In this case, the Information Gain of approximately 0.146 indicates that the "Weather" feature does provide some useful information, making it a good candidate for splitting the dataset in decision tree algorithms. For each feature IG is calculated to find the root of the tree.

Now Lets Implement Random Forest using a sample dataset and Scikit learn library.

Problem Domain

To address the challenge of predicting whether the price of a New York City Airbnb listing will be above or below the average price, we utilize a tabular dataset containing information about various Airbnb listings in the city. Data set is available to down from

https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data

Or it can be loaded from

https://raw.githubusercontent.com/lmassaron/tabular_datasets/master/AB_NYC_2019.csv

Lets Load all libraries

import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import KFold, cross_validate
from sklearn.ensemble import RandomForestClassifier

Now lets load the data and take a look at it.

data = pd.read_csv("https://raw.githubusercontent.com/lmassaron/tabular_datasets/master/AB_NYC_2019.csv")

1 to 5 of 5 entriesFilter

index	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.9419	Private room	150	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.1	1	0

List of features to be excluded from the analysis

A list of features that should be excluded from the analysis, such as unique identifiers and text features

excluding_list = ['price', 'id', 'latitude', 'longitude', 'host_id',
                  'last_review', 'name', 'host_name']

Lets take a look at the categorical features

categorical = ['neighbourhood_group', 'neighbourhood', 'room_type']

data[categorical].nunique()

0
neighbourhood_group    5
neighbourhood    221
room_type    3

dtype: int64

If we do one hot encoding of all categorical features it will create many columns and with too many Zeros creating problem during model training. So we divide into list of low-cardinality categorical features to be one-hot encoded and list of high-cardinality categorical features to be ordinally encoded. ‘low_card_categorical’ is a subset of categorical features that have a low cardinality (few unique values) and will be one-hot encoded. ‘high_card_categorical’ is a subset of categorical features that have a high cardinality (many unique values) and will be encoded using an ordinal encoding

low_card_categorical = ['neighbourhood_group', 'room_type'] 
high_card_categorical = ['neighbourhood']

All required integer columns are added as. ‘continuous ‘ is a list of continuous numerical features that will be standardized for analysis

continuous = ['minimum_nights', 'number_of_reviews', 'reviews_per_month',
              'calculated_host_listings_count', 'availability_365']

whole data shape looks like

data.shape
(48895, 16)

We create binary targets, target_median, based on percentiles for classification purposes. It is important to note that our target_median is a balanced binary target, allowing us to safely use accuracy as an effective performance measurement. We find an almost equal number of cases for both the positive and negative classes when we count the values.

target_median.value_counts()


count
price    
0    24472
1    24423

dtype: int64

In the context of the Scikit-learn library, a transformer object is a component that is designed to perform data transformation tasks as part of a machine learning pipeline. These objects are crucial for preprocessing data, including steps such as normalization, encoding categorical variables, or reducing dimensionality.

In the next step, we will develop a series of transformers designed to preprocess the data, ensuring it is well-prepared for the analysis required for this project. These transformers will help clean, organize, and transform the raw data into a format that facilitates more accurate and insightful analysis.

categorical_onehot_encoding: This transformer is designed to perform one-hot encoding on low-cardinality categorical features. It converts categorical variables into a format that can be provided to machine learning algorithms, effectively representing each category as a binary vector.
categorical_ord_encoding: This transformer is tailored for high-cardinality categorical features and employs ordinal encoding. It assigns integer values to unique categories based on their order, making it suitable for situations where categories have a meaningful sequence.
numeric_passthrough: This transformer serves a straightforward purpose: it passes continuous numerical features directly to the next stage in the data processing pipeline without any alteration. This ensures that the integrity of numerical data is maintained.
```
  categorical_onehot_encoding = OneHotEncoder(handle_unknown='ignore')
  categorical_ord_encoding = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan)
  numeric_passthrough = SimpleImputer(strategy="constant", fill_value=0)
```
The code creates a ColumnTransformer object that is designed to manage different types of features in a dataset by applying specific transformations tailored to each subset. It uses one-hot encoding for categorical features with a limited number of unique values (low-cardinality), effectively converting them into a format that can be utilized by machine learning algorithms. Meanwhile, continuous numerical features are passed through without any modifications, preserving their original values.
The transformer is configured to exclude any features that are not explicitly included in the defined transformation steps, thus maintaining a clean and relevant set of output features. Additionally, the output feature names will be concise and clear, making it easier to interpret the results. To ensure the output is always in a usable format, the sparse_threshold parameter is set to zero, guaranteeing that the transformer will return dense arrays, regardless of the input data's sparsity.

column_transform = ColumnTransformer(
    [('low_card_categories', categorical_onehot_encoding, low_card_categorical),
     ('high_card_categories', categorical_ord_encoding, high_card_categorical),
     ('numeric', numeric_passthrough, continuous),
    ],
    remainder='drop',
    verbose_feature_names_out=False,
    sparse_threshold=0.0)

K-fold cross-validation is a powerful technique used to evaluate the performance of a machine learning model. It involves dividing the available training dataset into k distinct partitions or "folds." The process begins by training the model k times, where in each iteration, the model is trained on k-1 of these partitions while reserving the one remaining partition as a testing set. This means that each fold gets the opportunity to serve as the validation set once, allowing for a comprehensive assessment of the model’s performance.

Once all k models have been trained and evaluated, we calculate the average of the performance scores obtained from each fold. Additionally, we assess the standard deviation of these scores to gauge the consistency of the model’s performance across the different subsets of data. This statistical approach not only provides a more reliable estimate of how the model is likely to perform on unseen data but also quantifies the uncertainty surrounding this estimate, giving insights into the model's robustness and generalizability.

We are setting up a RandomForestClassifier, a popular ensemble learning method for classification tasks. In this implementation, we are utilizing 300 estimators, which means the model will build 300 individual decision trees to enhance overall predictive accuracy and robustness. Additionally, we have specified that the minimum number of samples required to be in a leaf node is 3. This parameter helps to prevent overfitting by ensuring that each leaf has a sufficient number of samples, promoting generalization in our model's predictions.

accuracy = make_scorer(accuracy_score)
cv = KFold(5, shuffle=True, random_state=0)
model = RandomForestClassifier(n_estimators=300,
                               min_samples_leaf=3,
                               random_state=0)

column_transform = ColumnTransformer(
    [('categories', categorical_onehot_encoding, low_card_categorical),
     ('numeric', numeric_passthrough, continuous)],
    remainder='drop',
    verbose_feature_names_out=False,
    sparse_threshold=0.0)

A Scikit-learn pipeline that systematically applies data transformations to columns of a dataset, followed by the implementation of a Random Forest classifier model to facilitate the prediction or classification tasks. This pipeline allows for seamless integration of preprocessing steps, such as encoding categorical variables and scaling numerical features, leading into the training of the Random Forest model, thus streamlining the workflow for machine learning.

model_pipeline = Pipeline(
    [('processing', column_transform),
     ('modeling', model)]) #C

In our analysis, we utilize Scikit-learn's cross_validate function to perform a comprehensive five-fold cross-validation. This method involves segmenting our dataset into five distinct subsets or "folds." For each iteration, we train the model on four of these folds while using the remaining fold as a validation set. This process is repeated until each fold has been used as the validation set once. Throughout this procedure, we calculate and record the accuracy scores for each fold, allowing us to assess the performance of our defined machine learning pipeline more robustly. By averaging these accuracy scores across all five folds, we can obtain a reliable estimate of the model's overall effectiveness.

cv_scores = cross_validate(estimator=model_pipeline,
                           X=data,
                           y=target_median,
                           scoring=accuracy,
                           cv=cv,
                           return_train_score=True,
                           return_estimator=True)

We retrieve the mean and standard deviation of the accuracy scores from cross-validation

mean_cv = np.mean(cv_scores['test_score'])
std_cv = np.std(cv_scores['test_score'])
fit_time = np.mean(cv_scores['fit_time'])
score_time = np.mean(cv_scores['score_time'])
print(f"{mean_cv:0.3f} ({std_cv:0.3f})",
      f"fit: {fit_time:0.2f} secs pred: {score_time:0.2f} secs")

0.826 (0.004) fit: 13.86 secs pred: 0.58 secs

We successfully implemented Random Forest algorithm using Scikit learn.

id	document	feature vector
\(d_0\)	I can't wait to go home after a long vacation.	\([1,0]\)
\(d_1\)	"I have the flexibility to work from my home office three days a week, but I still prefer going into the main office for meetings	\([1,2]\)
\(d_2\)	In his new remote setup, his home had to function simultaneously as both a quiet home environment and a fully operational home office, blending the comfort of home with the structure of the office until he couldn't tell where the home ended and the office began	\([5,2]\)
\(d_3\)	I need to stop by the main office to pick up my new employee badge before the meeting starts.	\([0,1]\)