~ 6 min read
How to Query the Product Hunt API Responsibly with Python
I recently wrote an article that got to the #1 spot on Hacker News for the folks at ScrapingBee. It focused on retrieving details for the 90,000+ featured products from the Product Hunt API and digging into interesting facets of the data. I was excited to see it sat at the #1 for a few hours and the frontpage for longer whilst I repeatedly took screenshots to regale my children with in later life. I’ve even discovered that some students are considering pursuing a masters into topic around the dataset after seeing the article.
Here, I’m going to go into detail on how to retrieve information responsibly from the Product Hunt API for your own data experiments. To do so, I used the x-rate-limit-remaining
and x-rate-limit-reset
header values that are returned each time a query is made. These values have been given so as to not overwhelm the Product Hunt API servers when querying a large collection of data and you won’t get a response if you exceed them. It’s therefore important to factor them into any script you write to query data and be a responsible API citizen.
Making a Single Query
To get set up, you’ll need an API key - which you’ll get when making a request for your first application from the API dashboard.
Product Hunt has two versions of their API, one uses REST (deprecated) and one uses GraphQL. The GraphQL one, whilst much more versatile (eg. will allow complex filtering), will return far less data on each call (up to 20 products vs 50) and you’ll hit your limits much quicker. You’ll also be able to make less requests based on how much data you want to return. I calculated that getting all 90,000 products with all the additional data we wanted would take several days (!) using the GraphQL API and I’d prefer not to be waiting that long.
For the original article, I used V1 of the API and I’ll be doing the same thing here. Additionally, I’m using HTTPX as the library to make my queries, but you can use any library you like.
Firstly, let’s kick off with a simple query to get the most recent posts and header information we’re interested in. We need to ensure we pass an Authorization token to authenticate and I extract some of the header details to use later.
import httpx
import os
PRODUCTHUNT_API_TOKEN = os.environ.get("PRODUCTHUNT_API_TOKEN")
def get_posts():
headers = {"Authorization": f"Bearer {PRODUCTHUNT_API_TOKEN}"}
with httpx.Client(headers=headers) as client:
r = client.get(
"https://api.producthunt.com/v1/posts/all",
params={"per_page": 50},
timeout=20,
)
limits = {h: v for h, v in r.headers.items() if h.startswith("x-rate-limit")}
return r.json()["posts"], limits
if __name__ == "__main__":
posts, limits = get_posts()
print(limits)
If this is the first request you’ve made, the header values that are printed out to the command line should be as follows:
{'x-rate-limit-limit': '250', 'x-rate-limit-remaining': '250', 'x-rate-limit-reset': '900'}
We can see our limits are currently at their maximum value as we might expect. The x-rate-limit-remaining
value is the remaining time in seconds before the rate limit resets. A subsequent should look like the following:
{'x-rate-limit-limit': '250', 'x-rate-limit-remaining': '249', 'x-rate-limit-reset': '273'}
You can see some of our values have now reduced - we’ve decremented our remaining limit by 1 and can see we have 4 and a half minutes to wait before that rate resets. The reset time is a fixed point in time based on when our first request was made, which resets every 15 minutes. Therefore if we made a single request, disappeared for 5 mins, returned and made another the value would be the same as if we’d made a whole number of requests in succession. You can see in my own two examples, it took me a full 10 minutes before executing the second one. The limits on the V2 API are higher, but will reduce much more quickly too and even more so if you request a more complex payload.
Making Multiple Queries
If we were just to start pulling posts from the API as quickly as possible in a loop, we’d very quickly hit our limit and start returning empty data. There isn’t much use in making requests that return nothing, especially if we desire to leave our machine unattended whilst making them.
Instead, lets code a loop that takes the limit values we’re returned into consideration.
import httpx
import os
import time
import json
from datetime import datetime
PRODUCTHUNT_API_TOKEN = os.environ.get("PRODUCTHUNT_API_TOKEN")
def get_posts_loop():
more_data = True
params = {"per_page": 50}
page_limit = 100
headers = {"Authorization": f"Bearer {PRODUCTHUNT_API_TOKEN}"}
with httpx.Client(headers=headers) as client:
while more_data:
r = client.get(
"https://api.producthunt.com/v1/posts/all", params=params, timeout=20
)
data = r.json()
limits = {
h: v for h, v in r.headers.items() if h.startswith("x-rate-limit")
}
# Do something exciting with the data (more so than write it to a file)
with open(f"{datetime.now()}.json", "w") as f:
print(limits)
json.dump(data, f, indent=4)
if int(r.headers["x-rate-limit-remaining"]) == 0:
time_to_sleep = int(r.headers["x-rate-limit-reset"]) + 30
print(f"Rate limit exceeded, waiting {time_to_sleep}s...")
time.sleep(time_to_sleep)
else:
time.sleep(2)
more_data = (
"posts" in data
and len(data["posts"])
and params.get("page", 1) < page_limit
)
if more_data:
params["page"] = params.get("page", 1) + 1
In the above example you can see we make the request as part of a loop, incrementing and passing the ‘page’ key to step back through the next page of 50 items for each subsequent request. If our x-rate-limit-remaining
value gets to zero, we sleep for the remaining x-rate-limit-reset
value (plus a slight pause of 30 seconds) before requesting the next 50 items. By this time our remaining and reset values will have reset allowing us to continue.
If we left the above script running, it will return 100 pages of products from the Product Hunt featured product catalog, writing them to disk before terminating. This gives us details of the most recent 5000 products and we can obviously change our page limit to go back even further.
Even within the default loop behaviour, I’ve added a sleep of 2 seconds between each request. It is unnecessary make all our requests at as soon as possible, as we’re limited anyway so we might as well keep a slow and steady pace throughout. The code would also work the same for V2 of the API, though we would need to change the remaining comparison with zero to a higher value.
Closing Thoughts
The Product Hunt API provides access to a really interesting dataset which can be interpreted in a multitude of ways. When querying it and any API for large amounts of data however we need to be mindful of adhering to the limits that have been put in place by the API owners and behaving responsibly when making queries.
I had a lot of fun interpreting this data and thoroughly enjoyed working with it. I’ve been given a whole bunch of ideas for other ways the Product Hunt data could be interpreted since the article being posted and will likely do so as part of a future article here.
I will also say I’d love to do more paid ‘data analysis’ technical writing like this. If you have an interesting dataset and like my work, please get in touch and lets make it happen.