WebPage Summarizer#
π Task: Summarizing webpage content using AI.
π§ Model: OpenAIβs
gpt-4o-miniandllama3.2for text summarization.π΅οΈββοΈ Data Extraction: Selenium for handling both static and JavaScript-rendered websites.
π Output Format: Markdown-formatted summaries.
π Scope: Processes only the given webpage URL (not the entire site).
π Tools: Python, Requests, Selenium, BeautifulSoup, OpenAI API, Ollama.
π οΈ Requirements
βοΈ Hardware: β CPU is sufficient β no GPU required
π OpenAI API Key (for GPT model)
Install Ollama and pull llama3.2 or another lightweight model
Google Chrome browser installed
β¨ This script handles both JavaScript and non-JavaScript websites using Selenium with Chrome WebDriver for reliable content extraction from modern web applications.
Letβs get started and automate website summarization! π
π οΈ Environment Setup & Dependencies#
%pip install selenium webdriver-manager
Collecting selenium
Downloading selenium-4.35.0-py3-none-any.whl.metadata (7.4 kB)
Collecting webdriver-manager
Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Requirement already satisfied: urllib3<3.0,>=2.5.0 in e:\anaconda\envs\llms\lib\site-packages (from urllib3[socks]<3.0,>=2.5.0->selenium) (2.5.0)
Collecting trio~=0.30.0 (from selenium)
Downloading trio-0.30.0-py3-none-any.whl.metadata (8.5 kB)
Collecting trio-websocket~=0.12.2 (from selenium)
Downloading trio_websocket-0.12.2-py3-none-any.whl.metadata (5.1 kB)
Requirement already satisfied: certifi>=2025.6.15 in e:\anaconda\envs\llms\lib\site-packages (from selenium) (2025.8.3)
Collecting typing_extensions~=4.14.0 (from selenium)
Downloading typing_extensions-4.14.1-py3-none-any.whl.metadata (3.0 kB)
Requirement already satisfied: websocket-client~=1.8.0 in e:\anaconda\envs\llms\lib\site-packages (from selenium) (1.8.0)
Requirement already satisfied: attrs>=23.2.0 in e:\anaconda\envs\llms\lib\site-packages (from trio~=0.30.0->selenium) (25.3.0)
Collecting sortedcontainers (from trio~=0.30.0->selenium)
Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: idna in e:\anaconda\envs\llms\lib\site-packages (from trio~=0.30.0->selenium) (3.10)
Collecting outcome (from trio~=0.30.0->selenium)
Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Requirement already satisfied: sniffio>=1.3.0 in e:\anaconda\envs\llms\lib\site-packages (from trio~=0.30.0->selenium) (1.3.1)
Requirement already satisfied: cffi>=1.14 in e:\anaconda\envs\llms\lib\site-packages (from trio~=0.30.0->selenium) (2.0.0)
Collecting wsproto>=0.14 (from trio-websocket~=0.12.2->selenium)
Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Requirement already satisfied: pysocks!=1.5.7,<2.0,>=1.5.6 in e:\anaconda\envs\llms\lib\site-packages (from urllib3[socks]<3.0,>=2.5.0->selenium) (1.7.1)
Requirement already satisfied: requests in e:\anaconda\envs\llms\lib\site-packages (from webdriver-manager) (2.32.5)
Requirement already satisfied: python-dotenv in e:\anaconda\envs\llms\lib\site-packages (from webdriver-manager) (1.1.1)
Requirement already satisfied: packaging in e:\anaconda\envs\llms\lib\site-packages (from webdriver-manager) (25.0)
Requirement already satisfied: pycparser in e:\anaconda\envs\llms\lib\site-packages (from cffi>=1.14->trio~=0.30.0->selenium) (2.22)
Requirement already satisfied: h11<1,>=0.9.0 in e:\anaconda\envs\llms\lib\site-packages (from wsproto>=0.14->trio-websocket~=0.12.2->selenium) (0.16.0)
Requirement already satisfied: charset_normalizer<4,>=2 in e:\anaconda\envs\llms\lib\site-packages (from requests->webdriver-manager) (3.4.3)
Downloading selenium-4.35.0-py3-none-any.whl (9.6 MB)
---------------------------------------- 0.0/9.6 MB ? eta -:--:--
------------------------------ --------- 7.3/9.6 MB 41.2 MB/s eta 0:00:01
---------------------------------------- 9.6/9.6 MB 39.7 MB/s 0:00:00
Downloading trio-0.30.0-py3-none-any.whl (499 kB)
Downloading trio_websocket-0.12.2-py3-none-any.whl (21 kB)
Downloading typing_extensions-4.14.1-py3-none-any.whl (43 kB)
Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl (27 kB)
Downloading outcome-1.3.0.post0-py2.py3-none-any.whl (10 kB)
Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Installing collected packages: sortedcontainers, wsproto, typing_extensions, outcome, webdriver-manager, trio, trio-websocket, selenium
----- ---------------------------------- 1/8 [wsproto]
Attempting uninstall: typing_extensions
----- ---------------------------------- 1/8 [wsproto]
Found existing installation: typing_extensions 4.15.0
----- ---------------------------------- 1/8 [wsproto]
Uninstalling typing_extensions-4.15.0:
----- ---------------------------------- 1/8 [wsproto]
---------- ----------------------------- 2/8 [typing_extensions]
---------- ----------------------------- 2/8 [typing_extensions]
---------- ----------------------------- 2/8 [typing_extensions]
---------- ----------------------------- 2/8 [typing_extensions]
---------- ----------------------------- 2/8 [typing_extensions]
---------- ----------------------------- 2/8 [typing_extensions]
---------- ----------------------------- 2/8 [typing_extensions]
---------- ----------------------------- 2/8 [typing_extensions]
---------- ----------------------------- 2/8 [typing_extensions]
Successfully uninstalled typing_extensions-4.15.0
---------- ----------------------------- 2/8 [typing_extensions]
---------- ----------------------------- 2/8 [typing_extensions]
-------------------- ------------------- 4/8 [webdriver-manager]
-------------------- ------------------- 4/8 [webdriver-manager]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------- -------------- 5/8 [trio]
------------------------------ --------- 6/8 [trio-websocket]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
----------------------------------- ---- 7/8 [selenium]
---------------------------------------- 8/8 [selenium]
Successfully installed outcome-1.3.0.post0 selenium-4.35.0 sortedcontainers-2.4.0 trio-0.30.0 trio-websocket-0.12.2 typing_extensions-4.14.1 webdriver-manager-4.0.2 wsproto-1.2.0
Note: you may need to restart the kernel to use updated packages.
# ===========================
# System & Environment
# ===========================
import os
from dotenv import load_dotenv
# ===========================
# Web Scraping
# ===========================
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# ===========================
# AI-related
# ===========================
from IPython.display import Markdown, display
from openai import OpenAI
import ollama
π Model Configuration & Authentication#
load_dotenv(override=True)
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
raise ValueError("OPENAI_API_KEY not found in environment variables")
print("β
API key loaded successfully!")
openai = OpenAI()
β
API key loaded successfully!
MODEL_OPENAI = "gpt-4o-mini"
MODEL_OLLAMA = "llama3.2"
π Web Scraping Infrastructure#
class WebsiteCrawler:
def __init__(self, url):
self.url = url
self.title = ""
self.text = ""
self.scrape()
def scrape(self):
try:
# Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36")
# Try to find Chrome
chrome_paths = [
r"C:\Program Files\Google\Chrome\Application\chrome.exe",
r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe",
r"C:\Users\{}\AppData\Local\Google\Chrome\Application\chrome.exe".format(os.getenv('USERNAME')),
]
chrome_binary = None
for path in chrome_paths:
if os.path.exists(path):
chrome_binary = path
break
if chrome_binary:
chrome_options.binary_location = chrome_binary
# Create driver
driver = webdriver.Chrome(options=chrome_options)
driver.set_page_load_timeout(30)
print(f"π Loading: {self.url}")
driver.get(self.url)
# Wait for page to load
time.sleep(5)
# Try to wait for main content
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "main"))
)
except Exception:
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
except Exception:
pass # Continue anyway
# Get title and page source
self.title = driver.title
page_source = driver.page_source
driver.quit()
print(f"β
Page loaded: {self.title}")
# Parse with BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')
# Remove unwanted elements
for element in soup(["script", "style", "img", "input", "button", "nav", "footer", "header"]):
element.decompose()
# Get main content
main = soup.find('main') or soup.find('article') or soup.find('.content') or soup.find('body')
if main:
self.text = main.get_text(separator="\n", strip=True)
else:
self.text = soup.get_text(separator="\n", strip=True)
# Clean up text
lines = [line.strip() for line in self.text.split('\n') if line.strip() and len(line.strip()) > 2]
self.text = '\n'.join(lines[:200]) # Limit to first 200 lines
print(f"π Extracted {len(self.text)} characters")
except Exception as e:
print(f"β Error occurred: {e}")
self.title = "Error occurred"
self.text = "Could not scrape website content"
π§ Prompt Engineering & Templates#
system_prompt = "You are an assistant that analyzes the contents of a website \
and provides a short summary, ignoring text that might be navigation related. \
Respond in markdown."
def user_prompt_for(website):
user_prompt = f"You are looking at a website titled {website.title}"
user_prompt += "\nThe contents of this website is as follows; please provide a short summary of this website in markdown. If it includes news or announcements, then summarize these too.\n\n"
user_prompt += website.text
return user_prompt
def messages_for(website):
return [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt_for(website)}
]
π Summarization#
def summarize_gpt(url):
"""Scrape website and summarize with GPT"""
site = WebsiteCrawler(url)
if "Error occurred" in site.title or len(site.text) < 50:
print(f"β Failed to scrape meaningful content from {url}")
return
print("π€ Creating summary...")
# Create summary
response = openai.chat.completions.create(
model=MODEL_OPENAI,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt_for(site)}
]
)
web_summary = response.choices[0].message.content
display(Markdown(web_summary))
summarize_gpt('https://openai.com')
# summarize_gpt('https://stripe.com')
# summarize_gpt('https://vercel.com')
# summarize_gpt('https://react.dev')
π Loading: https://openai.com
β
Page loaded: OpenAI
π Extracted 3417 characters
π€ Creating summary...
OpenAI Website Summary
OpenAIβs website serves as a hub for AI-driven products and services, featuring various tools and resources for users. Key components include:
Features
ChatGPT: A versatile AI tool assisting with diverse tasks ranging from travel planning and language translation to coding and creative writing.
GPT-5: The latest model introduced, noted as the smartest and fastest yet, enhancing various applications including creative and medical research.
News and Announcements
GPT-5 Release: Positioned as OpenAIβs most advanced model, aimed at making AI more useful across different domains.
Stargate Updates: Partnership with Oracle and SoftBank to expand AI data center capabilities.
New Collaborations: Strategic partnerships with companies like NVIDIA and SAP to enhance AI infrastructure and services.
Recent Publications
Focused on real-world applications of their AI models and ongoing enhancements in safety, performance measurement, and reducing biases in AI.
The website emphasizes innovative AI applications, partnerships, and updates for effective deployment of AI solutions in various fields.
def summarize_ollama(url):
website = WebsiteCrawler(url)
response = ollama.chat(
model=MODEL_OLLAMA,
messages=messages_for(website))
display(Markdown(response['message']['content'])) # Generate and display output
summarize_ollama('https://github.com')
# summarize_ollama('https://nextjs.org')
π Loading: https://github.com
β
Page loaded: GitHub Β· Build and ship software on a single, collaborative platform Β· GitHub
π Extracted 4757 characters
GitHub
A Collaborative Platform for Building and Shipping Software
GitHub is a leading platform for developers to build, ship, and manage software projects collaboratively. The company features various tools and services, including GitHub Copilot, an AI-powered developer platform that assists with coding, security, and more.
Features
GitHub Copilot: A chat-based code editor that provides AI-powered assistance with writing, reviewing, and refactoring code.
Dependabot: A tool for automating dependency updates to ensure software is secure and up-to-date.
GitHub Actions: A comprehensive platform for managing CI/CD (Continuous Integration and Continuous Deployment) pipelines.
GitHub Codespaces: A cloud-based development environment that allows users to start building software immediately.
Benefits
Increased Productivity: GitHub Copilot helps developers work 55% faster, with features like code completion, chat, and more.
Improved Security: The platform offers automated vulnerability fixes, security campaigns, and secret scanning for detecting and preventing leaked secrets across organizations.
Enhanced Collaboration: GitHub provides a single, collaborative platform for managing projects, issues, pull requests, and discussions.
Customer Stories
Duolingo boosts developer speed by 25% with GitHub Copilot.
Mercedes-Benz standardizes source code and automates onboarding using GitHub.
Mercado Libre cuts coding time by 50% using GitHub.
News and Announcements
The AI wave continues to grow on software development teams, as stated in a survey (2024).
Note: This summary focuses on the core features and benefits of GitHub and its associated products, while ignoring navigation-related text.