Website Scraping with ChatGPT API (Python and Beautiful Soup 4)

WARINING:

“Fair Use” is not a global right. It is an American concept that does not have a corresponding legal concept in many countries. Even in the United States Fair Use is not set in stone, and is only ever really determined by a court on a case by case basis.

Courts evaluate fair use claims on a case-bycase basis, and the outcome of any given case depends on a fact-specific inquiry. This means that there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission.

https://www.copyright.gov/fair-use/index.html
  • 00:00 Intro
  • 06:23 Legal Comments on Copyright and Fair Use with AI
  • 11:18 ChatGPT Pricing and Web Scraping
  • 15:10 Getting the ChatGPT API Key
  • 18:00 Why can’t you “just” submit an entire website to ChatGPT
  • 27:36 Using Beautiful Soup 4 to Extract Text from a web page
  • 35:44 Combining Beautiful Soup and ChatGPT API to Summarize a web page
  • 45:00 Using ChatGPT to create tags for a web page
  • 50:20 Final Thoughts

Setup:

Create OpenAI Account and get API Key
https://openai.com/

For Python:
pip3 install openai
pip3 install beautifulsoup4

Code Without Beautiful Soup of RegEx:

This code will fail on 3.5 Turbo 4K model because it requires 19K tokens for the request. By submitting an entire web page you are submitting Javascript, CSS, and any other code that is not the actual article.

import requests
import openai

openai.api_key = "APIKEY"

url = "https://arstechnica.com/science/2023/07/the-heat-wave-scorching-the-us-is-a-self-perpetuating-monster/"
page = requests.get(url).text

response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=[
            {"role": "system", "content": "You are a journalist."},
            {"role": "assistant", "content": "write a 100 word summary of this article"},
            {"role": "user", "content": page}
            ]
)
summary = response["choices"][0]["message"]["content"]

print(url)
print(soup.title.get_text())
for x in article:
      print(x.get_text())

Beautiful Soup 4 Example:

With Beautiful Soup you can pull out text from specific tags. Make sure to use get_text() if you only want the text, otherwise you’ll get a href, img, and other values too.

import requests
from bs4 import BeautifulSoup

url = "https://arstechnica.com/science/2023/07/the-heat-wave-scorching-the-us-is-a-self-perpetuating-monster/"
page = requests.get(url).text


soup = BeautifulSoup(page,"html.parser")
article = soup.find_all('p')

print(url)
print(soup.title.get_text())
for x in article:
      print(x.get_text())

Summarize a Blog Post with ChatGPT:

This code summarizes a blog post by extracting the text with Beautiful Soup 4, and then submitting that to ChatGPT. We then print the summary, and full text to an HTML page to simulate building an autoblog system.

import requests
from bs4 import BeautifulSoup
import openai

openai.api_key = "APIKEY"

url = "https://arstechnica.com/gadgets/2023/07/with-macos-sonoma-intel-macs-are-still-getting-fewer-updates-than-they-used-to/"
page = requests.get(url).text

soup = BeautifulSoup(page,"html.parser")
item = soup.find_all('p')

print(url)
title = soup.title.get_text()
print(title)

article =""
for x in item:
      article += f" {x.get_text()}"

print(article)

response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=[
            {"role": "system", "content": "You are a journalist."},
            {"role": "assistant", "content": "write a 20 word summary of this article"},
            {"role": "user", "content": article}
            ]
)
summary = response["choices"][0]["message"]["content"]

print(f"Summary:\n {summary}")

file = open("parse.html", "w")
file.write(f"<h1>{title}</h1>")
file.close()

file = open("parse.html", "a")
file.write(f"<p><strong>URL: </strong>{url}</p>")
file.write(f"<h2>Summary:</h2><p>{summary}</p>")
file.write(f"<h2>Article:</h2><p>{article}</p>")
file.close()

Creating Tags/ Taxonomy for Other People’s Blogs:

It may be useful for you to create an index system for other people’s blogs to support you or your organizations work. They may have great content, but a poor search system. This script create tags for blog posts so that you can build your own indexing system for sites that you use.

import requests
from bs4 import BeautifulSoup
import openai

openai.api_key = "APIKEY"

url = "https://arstechnica.com/gadgets/2023/07/with-macos-sonoma-intel-macs-are-still-getting-fewer-updates-than-they-used-to/"
page = requests.get(url).text

soup = BeautifulSoup(page,"html.parser")
item = soup.find_all('p')

print(url)
title = soup.title.get_text()
print(title)

article =""
for x in item:
      article += f" {x.get_text()}"

print(article)

response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=[
            {"role": "system", "content": "You are a journalist."},
            {"role": "assistant", "content": "give me 10 tags for this blog post"},
            {"role": "assistant", "content": "return them in a python list"},
            {"role": "assistant", "content": "formatted like ['tag1','tag2','tag3']"},
            {"role": "user", "content": article}
            ]
)
tag = response["choices"][0]["message"]["content"]

print(f"Tags:\n {tag}")

Be the first to comment

Leave a Reply