RSS Scraping with ChatGPT API (OpenAI, Python, Feedparser, Beautiful Soup)

WARNING:

“Fair Use” is not a global right. It is an American concept that does not have a corresponding legal concept in many countries. Even in the United States Fair Use is not set in stone, and is only ever really determined by a court on a case by case basis.

Courts evaluate fair use claims on a case-bycase basis, and the outcome of any given case depends on a fact-specific inquiry. This means that there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission.

https://www.copyright.gov/fair-use/index.html
  • 00:00 Intro
  • 52:11 Why RSS Matters
  • 04:47 Demonstration of Parsing RSS with ChatGPT API
  • 11:26 Copyright WARNING
  • 18:50 How RSS Works
  • 24:07 Lab Setup
  • 25:43 Basic Usage of Feedparser Module
  • 31:19 Parse Multiple RSS Feeds
  • 33:52 Use ChatGPT to Query RSS Feed
  • 42:15 Use Feedparser and Beautiful Soup to scrape websites
  • 55:06 Final Thoughts

Setup:

Get OpenAI API Key - https://platform.openai.com/account/api-keys
pip3 install feedparser
pip3 install beautifulsoup4
pip3 install openai

1-feed.py

This code prints out the Titles and Links for articles in an RSS Feed.

import feedparser
feed = feedparser.parse("https://gizmodo.com/rss")

entry = feed.entries[0]
print(entry.keys())

for x in feed.entries:
    print(f"{x.title}\n {x.link}\n")

print(f'Total Stories: {len(feed.entries)}')

2-multiple.py

This code creates a list of all Titles of Articles from multiple RSS Feeds.

import feedparser

feed_list = ['https://gizmodo.com/rss',
             'https://engadget.com/rss.xml','https://feeds.arstechnica.com/arstechnica/index']

feed_all = []
for source in feed_list:
    feed = feedparser.parse(source)
    for item in feed.entries:
        feed_all.append(f'{item.title}\n')

for x in feed_all:
    print(x)

print(f'Total Links: {len(feed_all)}')

3-gpt-feed.py

This code allows you to ask ChatGPT to find information from the Titles of All Articles in multiple RSS feeds. This script specifically asks for any product names and to return them in s CSV format.

import feedparser
import openai

openai.api_key = 'APIKEY'

feed_list = ['https://gizmodo.com/rss',
             'https://engadget.com/rss.xml', 'https://feeds.arstechnica.com/arstechnica/index', 'https://www.theverge.com/rss/index.xml']

feed_all = []
for source in feed_list:
    feed = feedparser.parse(source)
    for item in feed.entries:
        feed_all.append(f'{item.title} \n')

for item in feed_all:
    print(item)

print(f'Total Links: {len(feed_all)}')

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a database computer "},
        {"role": "assistant",
            "content": "list all product names from these titles"},
        {"role": "assistant",
         "content": "return values in a comma sperated value format in a single list"},
        {"role": "assistant",
         "content": "only provide list, do not add any context or disclaimers"},
        {"role": "user",
            "content": f"{str(feed_all)}"}
    ]
)
summary = response["choices"][0]["message"]["content"]
print(summary)

4-gpt-feed-bs.py

This code uses RSS feeds to index Article URLs, and then feeds those URLs to Beautiful Soup to extract only the Article text. From there we request ChatGPT to provide a 25 word summary of each article, and to ask for 5 appropriate tags for the article. With this information we then write the Article, Summary, and Tags to an HTML file.

NOTE: I have limited the output to 3 iterations of the for loop.

import feedparser
import requests
from bs4 import BeautifulSoup
import openai

openai.api_key = 'APIKEY'

feed_list = ['https://engadget.com/rss.xml', 'https://feeds.arstechnica.com/arstechnica/index', 'https://www.theverge.com/rss/index.xml']

feed_all = []
for source in feed_list:
    feed = feedparser.parse(source)
    for item in feed.entries:
        feed_all.append(f'{item.link}')

file = open("rss.html", "w")
count = 0
for url in feed_all:
    if count >= 3:
        break
    page = requests.get(url).text
    soup = BeautifulSoup(page,"html.parser")
    item = soup.find_all('p')
    article = ''
    for text in item:
        article = f'{article} {text.get_text()}'

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a journalist."},
            {"role": "assistant", "content": "write a 25 word summary of this article"},
            {"role": "user", "content": article}
        ]
    )
    summary = response["choices"][0]["message"]["content"]

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a journalist."},
            {"role": "assistant", "content": "provide 5 tags for this blog post"},
            {"role": "assistant", "content": "output tags as CSV, comma seperated values"},
            {"role": "user", "content": article}
        ]
    )
    tag = response["choices"][0]["message"]["content"]

    print(f'Title: {soup.title.get_text()}\n')
    print(f'{article} \n')
    print(f"Summary:\n {summary}\n")
    print(f"Tags:\n {tag}\n\n")

    file.write(f"<h1>{soup.title.get_text()}</h1>")
    file.write(f"<p><strong>URL: </strong>{url}</p>")
    file.write(f"<h2>Summary:</h2><p>{summary}</p>")
    file.write(f"<h2>Article:</h2><p>{article}</p>")
    file.write(f"<h2>Tags:</h2><p>{tag}</p>")

    count += 1

file.close()

Be the first to comment

Leave a Reply