Jason Thorsness

github
github icon
linkedin
linkedin icon
twitter
twitter icon
hi
13Aug 26 24

Spicing Things Up With Jobs

SingleStore Cloud offers a number of services beyond the core database. For a while it’s been possible to create and interact with Jupyter Notebooks. Now, you can schedule your notebooks to run on a fixed schedule. This simple service is a powerful building block you can use to perform maintenance, updates, or many other tasks using Python or SQL, and run it all within the same data platform. I’ve recently applied it to enhance the home page of weather.bingo.

A Quick Use Case

For months, the home page of weather.bingo was a static list of “Popular Cities”. I picked these cities, so I suppose they were popular in my home office (I am alone in my office). Alas, they did not capture the interest of everyone. So what could be better than a static list? How about cities in the news? I decided to update the home page with a list of cities appearing in recent news headlines.

The Notebook

From the SingleStore Portal, I created a new notebook and started filling it out with the help of ChatGPT. First let’s get some package installation out of the way.

pip install requests beautifulsoup4 openai

Now we can import the functionality we need.

import requests
import json
import ibis
from bs4 import BeautifulSoup
from openai import OpenAI
from singlestoredb.management import get_secret

Grab the links from lite.cnn.com. Here each link is a headline.

url = "https://lite.cnn.com"

response = requests.get(url)

a_tags_str = None

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    a_tags = soup.find_all('a')
    a_tags_str = "\n".join(tag.get_text(strip=True) for tag in a_tags)
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

How to get the city names from the headlines? This is a perfect use case for an LLM. Here I use the larger GPT-4o model, as the mini version returns countries and states along with cities no matter what I prompt it with. Note I stored the OpenAI key in a secret, which I added in the SingleStore Portal.

client = OpenAI(api_key=get_secret('OPENAI_API_KEY'))

response = client.chat.completions.create(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": f"""Extract the top ten city names from the
        following text and return them as a JSON list, most-to-least famous. Only
        include a city if it's a famous well-known city, that might be known
        around the world. Don't include something that's more famous for another
        thing - like don't include Florida, that's a famous state, not a city.
        Only include famous cities.\n\n{a_tags_str}"""},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "city_extraction",
            "strict": True,
            "schema": {
                "type": "object",
                "properties":{
                    "cities": {
                        "type": "array",
                        "items": {
                            "type": "string",
                        }
                    }
                },
                "required": ["cities"],
                "additionalProperties": False
            },
        },
    }
)

parsed_response = json.loads(response.choices[0].message.content)
cities_list = parsed_response.get('cities', [])
print(cities_list)

After I get the city names, I have another cell that translates them to the lk internal ids that weather.bingo uses which I’ve omitted here because I want to jump straight to the cool part — in these notebooks (and jobs) I can use SQL too, either for the full cell like this:

CREATE TABLE IF NOT EXISTS inTheNews_new(lk BIGINT NOT NULL);

Or embedded in Python like this:

%config SqlMagic.named_parameters=True
for lk in lks:
    print(lk)
    %sql INSERT INTO inTheNews_new(lk) VALUES (:lk)

This syntax is called “Magic” and is a cool feature of the Jupyter Notebook environment. Now I just need to swap new table for the old, and my update is done.

DROP TABLE IF EXISTS inTheNews;
ALTER TABLE inTheNews_new RENAME inTheNews;

I did all of this within the web interface, interactively checking each part as I went, without needing to install any kind of developer environment or tooling on my local machine. Notebooks can be a major convenience for small tasks.

The Schedule

From the Portal, in the upper-right there’s a “Schedule” button that let me set up a schedule for running my notebook. I chose to run it every six hours. It executes each cell sequentially, halting on errors. Learn more about jobs here.

Closing Thoughts

I’m interested to hear what use cases become the most popular applications for these scheduled jobs. I am sure they will be used to delete old data, ingest new data, and generate derived tables sort of like a materialized view, but I am also positive there are many other use cases for this general tool. If you have any comments or questions on this feature, or on the weather.bingo home page, please let me know!

 Top