Skip to main content

Write Own Spider

This guide explains how to create a custom Scrapy spider to crawl firmware images from a vendor of your choice.

Basics

  • The framework used is Scrapy. You can find full documentation at https://docs.scrapy.org/en/latest/
  • All spiders are located in the /firmware_scraper/firmware_scraper/spiders directory.
  • Spiders are responsible for collecting the download URLs and relevant metadata.
  • The actual firmware file downloads are handled separately by the Firmware Scraper.

Step-By-Step-Guide

1. Find the Vendor's Download Page

In most cases, a simple Google search like “YourVendor Downloads” will lead you directly to the firmware download page. If not, visit the vendor’s official website and look for sections such as Support or Downloads.

Once on the download page, try to understand the structure of the site and confirm that it actually contains firmware downloads.

2. Create the Spider Class

Create a new file in /firmware_scraper/firmware_scraper/spiders called yourvendor.py, then follow these steps:

  • Import scrapy
  • Define a scrapy.Spider subclass
  • Set the vendor name
  • Define a start_urls list and add the vendor’s download page
  • Implement the parse method
  • Once this basic structure is in place, the scraper will recognize and run your Spider.
  • You can also run it manually in commandline with:
    scrapy crawl YourVendor -o output.json
import scrapy

class YourVendorSpider(scrapy.Spider):
name = "YourVendor"
start_urls=["https://yourvendor.com/vendorsDownloadSite"]

def parse(self,response):
...
info

The parse method is the entry point of the spider. It receives the response from one of the start_urls and is used to start the crawling logic.

There is no general solution, as each vendor website is different. However, in most cases you’ll need to navigate through all relevant links on the site and its subpages until you reach the actual firmware download.

Download- and subpages links can usually be found in one of three ways:

  1. From the HTML structure, directly embedded in the page as <a href="...">.
  2. From API requests, loaded dynamically via background HTTP calls (check the browser's DevTools → Network tab).
  3. From JavaScript code, rendered or generated by frontend scripts, which may require reverse-engineering.

In some cases, you may need to combine two or even all three of these approaches.

3.1 Visit Subpages

To visit subpages, it's recommended to create a separate parse method for the subpage and call it with the link you found.
For example, if the vendors's site is split between different categories and you found the category links, you can call parse_category(self, response, category) like this:

def parse(self, response):
...
yield response.follow(
category_url,
callback=self.parse_category,
cb_kwargs={"category": category},
)

There’s no general way to get the links, since every site is different, but here’s an example based on a typical download page structure.

  • Press Ctrl + Shift + C in your browser and hover over the download buttons.

  • If the links are directly visible in the HTML, you can extract them using XPath or CSS selectors in your spider.

  • The URLs typically look something like this:

    <a class="download-button" href="/downloads/firmware_v1.bin">Download</a>
  • If there are multiple download links on a single page, they are often grouped in a table or list structure. A typical list container might look like this:

    <div id="list" class="download-list">
  • So your whole site might look like this:

    ...
    <div id="list" class="download-list">
    <a class="download-button" href="/downloads/firmware_v1.bin">firmware_v1</a>
    <a class="download-button" href="/downloads/firmware_v2.bin">firmware_v2</a>
    <a class="download-button" href="/downloads/firmware_v2.bin">firmware_v3</a>
    </div>
    ...

Once you’ve identified the download links or the container holding the download entries (e.g. a list or table), you can extract the individual download links using CSS or XPath selectors.

In our example, you would do it like this:

3.2.1 Get the download list

def parse(self, response):
download_list = response.css("div.download-list")

→ This selects the HTML container holding all the download entries.

3.2.2 Get the download buttons inside that list

    links = download_list.css("a.download-button")

→ This selects all the individual download buttons within the container.

3.2.3 Iterate over each link to extract URL

    for link in links:
relative_url = link.attrib["href"]
full_url = response.urljoin(relative_url)

3.2.4 Extract other metadata

At this point, you can also extract some metadata from the HTML content. In this case, the version can be extracted from the text content of the link:

        version = link.css("::text").get().split("_")[1]

After that, you can either yield the extracted data and finish parsing or, if the link points to a subpage, use response.follow to visit and parse that page next.

info

For more information on CSS or Xpath Selectors checkout the Scrapy selector documentation.

If the links don’t appear in the HTML structure or are loaded dynamically, they might be fetched via an API request. You can inspect these requests using your browser’s Developer Tools (Ctrl + Shift + I) in the Network tab . Look for requests triggered when interacting with the site (e.g. refreshing, selecting a product or scrolling). Pay attention to the requests that return JSON data, because those often contain the actual download URLs.

Just like with HTML parsing, there’s no universal solution. You’ll need to analyze how the request works and replicate it using Scrapy.

Let’s say you find a POST request like this:

Host URL:

https://api.yourvendor.com/downloads

Request:

{
"category": "firmware",
"limit": 50,
"offset": 0
}

Response:

{
"results": [
{
"file_name": "firmware_v1.bin",
"url": "https:/yourvendor.com/files/firmware_v1.bin",
"version": "v1.0"
},
...
]
}

You can replicate this in your spider-like this:

3.3.1 Start without a start_urls list

Since you don’t need to crawl a static page first, you can override the start_requests() method instead of using the default parse() method:

import scrapy
import json

class YourApiVendorSpider(scrapy.Spider):
name = "YourApiVendor"

def start_requests(self):

3.3.2 Define the target URL and payload

Set the API endpoint and the data you want to send:

        url = "https://example.com/api/downloads"
payload = {
"category": "firmware",
"limit": 50,
"offset": 0
}

3.3.3 Send the request:

Make sure to:

  • Use the same HTTP method as the original request (e.g., POST or GET)
  • Set the correct headers (e.g., "Content-Type": "application/json") to get a json response
  • Convert the payload to a JSON string
  • Set a callback function to handle the response
        yield scrapy.Request(
url=url,
method="POST",
body=json.dumps(payload),
headers={"Content-Type": "application/json"},
callback=self.parse_api_response
)

3.3.4 Parse the JSON response

Convert the response to a Python dictionary and handle the entries:

    def parse_api(self, response):
data = response.json()
for entry in data["results"]:
#Do smart things with the response

If you can’t find any download links in the HTML or through API requests, they might be generated dynamically by JavaScript.
In such cases, you may need to reverse-engineer the JavaScript logic.

This process can vary widely. Sometimes you’re lucky, and it’s a simple script with clearly visible URL patterns or data. But often, the scripts are automatically generated or heavily obfuscated, which makes reverse engineering nearly impossible.

4. Yield firmware_info

Once you have collected all the necessary data, you need to yield it as a firmware_info dictionary.
Each yield represents one firmware image.

4.1 Mandatory fields:

These fields must be set (should not be None):

  • product_name
  • product_type
  • manufacturer
  • download_link

4.2 Optional fields:

These fields can be None if the data is not available:

  • version
  • release_date (must be in format YYYY-MM-DD)
  • checksum_scraped
  • additional_data
firmware_info = {
"product_name": product_name,
"version": version,
"release_date": realease_date,
"download_link": download_link,
"product_type": product_type,
"manufacturer": manufacturer,
"checksum_scraped": checksum_scraped,
"additional_data": additional_data,
}

yield firmware_info