WebPage Summarizer

WebPage Summarizer#

🌍 Task: Summarizing webpage content using AI.
🧠 Model: OpenAI’s gpt-4o-mini and llama3.2 for text summarization.
🕵️‍♂️ Data Extraction: Selenium for handling both static and JavaScript-rendered websites.
📌 Output Format: Markdown-formatted summaries.
🔗 Scope: Processes only the given webpage URL (not the entire site).
🚀 Tools: Python, Requests, Selenium, BeautifulSoup, OpenAI API, Ollama.

🛠️ Requirements

⚙️ Hardware: ✅ CPU is sufficient — no GPU required
🔑 OpenAI API Key (for GPT model)
Install Ollama and pull llama3.2 or another lightweight model
Google Chrome browser installed

✨ This script handles both JavaScript and non-JavaScript websites using Selenium with Chrome WebDriver for reliable content extraction from modern web applications.

Let’s get started and automate website summarization! 🚀

🛠️ Environment Setup & Dependencies#

%pip install selenium webdriver-manager

Collecting selenium
  Downloading selenium-4.35.0-py3-none-any.whl.metadata (7.4 kB)
Collecting webdriver-manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: urllib3<3.0,>=2.5.0 in e:\anaconda\envs\llms\lib\site-packages (from urllib3[socks]<3.0,>=2.5.0->selenium) (2.5.0)
Collecting trio~=0.30.0 (from selenium)
  Downloading trio-0.30.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.12.2 (from selenium)
  Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Requirement already satisfied: certifi>=2025.6.15 in e:\anaconda\envs\llms\lib\site-packages (from selenium) (2025.8.3)
Collecting typing_extensions~=4.14.0 (from selenium)
  Downloading typing_extensions-4.14.1-py3-none-any.whl.metadata (3.0 kB)
Requirement already satisfied: websocket-client~=1.8.0 in e:\anaconda\envs\llms\lib\site-packages (from selenium) (1.8.0)
Requirement already satisfied: attrs>=23.2.0 in e:\anaconda\envs\llms\lib\site-packages (from trio~=0.30.0->selenium) (25.3.0)
Collecting sortedcontainers (from trio~=0.30.0->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: idna in e:\anaconda\envs\llms\lib\site-packages (from trio~=0.30.0->selenium) (3.10)
Collecting outcome (from trio~=0.30.0->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Requirement already satisfied: sniffio>=1.3.0 in e:\anaconda\envs\llms\lib\site-packages (from trio~=0.30.0->selenium) (1.3.1)
Requirement already satisfied: cffi>=1.14 in e:\anaconda\envs\llms\lib\site-packages (from trio~=0.30.0->selenium) (2.0.0)
Collecting wsproto>=0.14 (from trio-websocket~=0.12.2->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Requirement already satisfied: pysocks!=1.5.7,<2.0,>=1.5.6 in e:\anaconda\envs\llms\lib\site-packages (from urllib3[socks]<3.0,>=2.5.0->selenium) (1.7.1)
Requirement already satisfied: requests in e:\anaconda\envs\llms\lib\site-packages (from webdriver-manager) (2.32.5)
Requirement already satisfied: python-dotenv in e:\anaconda\envs\llms\lib\site-packages (from webdriver-manager) (1.1.1)
Requirement already satisfied: packaging in e:\anaconda\envs\llms\lib\site-packages (from webdriver-manager) (25.0)
Requirement already satisfied: pycparser in e:\anaconda\envs\llms\lib\site-packages (from cffi>=1.14->trio~=0.30.0->selenium) (2.22)
Requirement already satisfied: h11<1,>=0.9.0 in e:\anaconda\envs\llms\lib\site-packages (from wsproto>=0.14->trio-websocket~=0.12.2->selenium) (0.16.0)
Requirement already satisfied: charset_normalizer<4,>=2 in e:\anaconda\envs\llms\lib\site-packages (from requests->webdriver-manager) (3.4.3)
Downloading selenium-4.35.0-py3-none-any.whl (9.6 MB)
   ---------------------------------------- 0.0/9.6 MB ? eta -:--:--
   ------------------------------ --------- 7.3/9.6 MB 41.2 MB/s eta 0:00:01
   ---------------------------------------- 9.6/9.6 MB 39.7 MB/s  0:00:00
Downloading trio-0.30.0-py3-none-any.whl (499 kB)
Downloading trio_websocket-0.12.2-py3-none-any.whl (21 kB)
Downloading typing_extensions-4.14.1-py3-none-any.whl (43 kB)
Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl (27 kB)
Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Installing collected packages: sortedcontainers, wsproto, typing_extensions, outcome, webdriver-manager, trio, trio-websocket, selenium

   ----- ---------------------------------- 1/8 [wsproto]
  Attempting uninstall: typing_extensions
   ----- ---------------------------------- 1/8 [wsproto]
    Found existing installation: typing_extensions 4.15.0
   ----- ---------------------------------- 1/8 [wsproto]
    Uninstalling typing_extensions-4.15.0:
   ----- ---------------------------------- 1/8 [wsproto]
   ---------- ----------------------------- 2/8 [typing_extensions]
   ---------- ----------------------------- 2/8 [typing_extensions]
   ---------- ----------------------------- 2/8 [typing_extensions]
   ---------- ----------------------------- 2/8 [typing_extensions]
   ---------- ----------------------------- 2/8 [typing_extensions]
   ---------- ----------------------------- 2/8 [typing_extensions]
   ---------- ----------------------------- 2/8 [typing_extensions]
   ---------- ----------------------------- 2/8 [typing_extensions]
   ---------- ----------------------------- 2/8 [typing_extensions]
      Successfully uninstalled typing_extensions-4.15.0
   ---------- ----------------------------- 2/8 [typing_extensions]
   ---------- ----------------------------- 2/8 [typing_extensions]
   -------------------- ------------------- 4/8 [webdriver-manager]
   -------------------- ------------------- 4/8 [webdriver-manager]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------- -------------- 5/8 [trio]
   ------------------------------ --------- 6/8 [trio-websocket]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ----------------------------------- ---- 7/8 [selenium]
   ---------------------------------------- 8/8 [selenium]

Successfully installed outcome-1.3.0.post0 selenium-4.35.0 sortedcontainers-2.4.0 trio-0.30.0 trio-websocket-0.12.2 typing_extensions-4.14.1 webdriver-manager-4.0.2 wsproto-1.2.0
Note: you may need to restart the kernel to use updated packages.

# ===========================
# System & Environment
# ===========================
import os
from dotenv import load_dotenv

# ===========================
# Web Scraping
# ===========================
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# ===========================
# AI-related
# ===========================
from IPython.display import Markdown, display
from openai import OpenAI
import ollama

🔐 Model Configuration & Authentication#

load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')

if not api_key:
   raise ValueError("OPENAI_API_KEY not found in environment variables")

print("✅ API key loaded successfully!")
openai = OpenAI()

✅ API key loaded successfully!

MODEL_OPENAI = "gpt-4o-mini"
MODEL_OLLAMA = "llama3.2"

🌐 Web Scraping Infrastructure#

class WebsiteCrawler:
    def __init__(self, url):
        self.url = url
        self.title = ""
        self.text = ""
        self.scrape()

    def scrape(self):
        try:
            # Chrome options
            chrome_options = Options()
            chrome_options.add_argument("--headless")
            chrome_options.add_argument("--no-sandbox")
            chrome_options.add_argument("--disable-dev-shm-usage")
            chrome_options.add_argument("--disable-gpu")
            chrome_options.add_argument("--window-size=1920,1080")
            chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")

            # Try to find Chrome
            chrome_paths = [
                r"C:\Program Files\Google\Chrome\Application\chrome.exe",
                r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe",
                r"C:\Users\{}\AppData\Local\Google\Chrome\Application\chrome.exe".format(os.getenv('USERNAME')),
            ]

            chrome_binary = None
            for path in chrome_paths:
                if os.path.exists(path):
                    chrome_binary = path
                    break

            if chrome_binary:
                chrome_options.binary_location = chrome_binary

            # Create driver
            driver = webdriver.Chrome(options=chrome_options)
            driver.set_page_load_timeout(30)

            print(f"🔍 Loading: {self.url}")
            driver.get(self.url)

            # Wait for page to load
            time.sleep(5)

            # Try to wait for main content
            try:
                WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.TAG_NAME, "main"))
                )
            except Exception:
                try:
                    WebDriverWait(driver, 10).until(
                        EC.presence_of_element_located((By.TAG_NAME, "body"))
                    )
                except Exception:
                    pass  # Continue anyway

            # Get title and page source
            self.title = driver.title
            page_source = driver.page_source
            driver.quit()

            print(f"✅ Page loaded: {self.title}")

            # Parse with BeautifulSoup
            soup = BeautifulSoup(page_source, 'html.parser')

            # Remove unwanted elements
            for element in soup(["script", "style", "img", "input", "button", "nav", "footer", "header"]):
                element.decompose()

            # Get main content
            main = soup.find('main') or soup.find('article') or soup.find('.content') or soup.find('body')
            if main:
                self.text = main.get_text(separator="\n", strip=True)
            else:
                self.text = soup.get_text(separator="\n", strip=True)

            # Clean up text
            lines = [line.strip() for line in self.text.split('\n') if line.strip() and len(line.strip()) > 2]
            self.text = '\n'.join(lines[:200])  # Limit to first 200 lines

            print(f"📄 Extracted {len(self.text)} characters")

        except Exception as e:
            print(f"❌ Error occurred: {e}")
            self.title = "Error occurred"
            self.text = "Could not scrape website content"

🧠 Prompt Engineering & Templates#

system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."

def user_prompt_for(website):
    user_prompt = f"You are looking at a website titled {website.title}"
    user_prompt += "\nThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n\n"
    user_prompt += website.text
    return user_prompt

def messages_for(website):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt_for(website)}
    ]

📝 Summarization#

def summarize_gpt(url):
    """Scrape website and summarize with GPT"""
    site = WebsiteCrawler(url)

    if "Error occurred" in site.title or len(site.text) < 50:
        print(f"❌ Failed to scrape meaningful content from {url}")
        return

    print("🤖 Creating summary...")

    # Create summary
    response = openai.chat.completions.create(
        model=MODEL_OPENAI,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt_for(site)}
        ]
    )

    web_summary = response.choices[0].message.content
    display(Markdown(web_summary))

summarize_gpt('https://openai.com')
# summarize_gpt('https://stripe.com')
# summarize_gpt('https://vercel.com')
# summarize_gpt('https://react.dev')

🔍 Loading: https://openai.com
✅ Page loaded: OpenAI
📄 Extracted 3417 characters
🤖 Creating summary...

OpenAI Website Summary

OpenAI’s website serves as a hub for AI-driven products and services, featuring various tools and resources for users. Key components include:

Features

ChatGPT: A versatile AI tool assisting with diverse tasks ranging from travel planning and language translation to coding and creative writing.
GPT-5: The latest model introduced, noted as the smartest and fastest yet, enhancing various applications including creative and medical research.

News and Announcements

GPT-5 Release: Positioned as OpenAI’s most advanced model, aimed at making AI more useful across different domains.
Stargate Updates: Partnership with Oracle and SoftBank to expand AI data center capabilities.
New Collaborations: Strategic partnerships with companies like NVIDIA and SAP to enhance AI infrastructure and services.

Recent Publications

Focused on real-world applications of their AI models and ongoing enhancements in safety, performance measurement, and reducing biases in AI.

The website emphasizes innovative AI applications, partnerships, and updates for effective deployment of AI solutions in various fields.

def summarize_ollama(url):
    website = WebsiteCrawler(url)
    response = ollama.chat(
        model=MODEL_OLLAMA,
        messages=messages_for(website))
    display(Markdown(response['message']['content']))  # Generate and display output

summarize_ollama('https://github.com')
# summarize_ollama('https://nextjs.org')

🔍 Loading: https://github.com
✅ Page loaded: GitHub · Build and ship software on a single, collaborative platform · GitHub
📄 Extracted 4757 characters

GitHub

A Collaborative Platform for Building and Shipping Software

GitHub is a leading platform for developers to build, ship, and manage software projects collaboratively. The company features various tools and services, including GitHub Copilot, an AI-powered developer platform that assists with coding, security, and more.

Features

GitHub Copilot: A chat-based code editor that provides AI-powered assistance with writing, reviewing, and refactoring code.
Dependabot: A tool for automating dependency updates to ensure software is secure and up-to-date.
GitHub Actions: A comprehensive platform for managing CI/CD (Continuous Integration and Continuous Deployment) pipelines.
GitHub Codespaces: A cloud-based development environment that allows users to start building software immediately.

Benefits

Increased Productivity: GitHub Copilot helps developers work 55% faster, with features like code completion, chat, and more.
Improved Security: The platform offers automated vulnerability fixes, security campaigns, and secret scanning for detecting and preventing leaked secrets across organizations.
Enhanced Collaboration: GitHub provides a single, collaborative platform for managing projects, issues, pull requests, and discussions.

Customer Stories

Duolingo boosts developer speed by 25% with GitHub Copilot.
Mercedes-Benz standardizes source code and automates onboarding using GitHub.
Mercado Libre cuts coding time by 50% using GitHub.

News and Announcements

The AI wave continues to grow on software development teams, as stated in a survey (2024).

Note: This summary focuses on the core features and benefits of GitHub and its associated products, while ignoring navigation-related text.

WebPage Summarizer

Contents

WebPage Summarizer#

🛠️ Environment Setup & Dependencies#

🔐 Model Configuration & Authentication#

🌐 Web Scraping Infrastructure#

🧠 Prompt Engineering & Templates#

📝 Summarization#

OpenAI Website Summary

Features

News and Announcements

Recent Publications

GitHub

A Collaborative Platform for Building and Shipping Software

Features

Benefits

Customer Stories

News and Announcements