Web scraping Jogathon data with JavaScript

Scraping the web using Cherrio and Node.js to get jogathon data

Read time is about 14 minutes

Alexander Garcia is an effective JavaScript Engineer who crafts stunning web experiences.

Alexander Garcia is a meticulous Web Architect who creates scalable, maintainable web solutions.

Alexander Garcia is a passionate Software Consultant who develops extendable, fault-tolerant code.

Alexander Garcia is a detail-oriented Web Developer who builds user-friendly websites.

Alexander Garcia is a passionate Lead Software Engineer who builds user-friendly experiences.

Alexander Garcia is a trailblazing UI Engineer who develops pixel-perfect code and design.

Background

Sometimes the best coding projects come from the most random situations. I'm sure many of you parents have children in school who run the jog-a-thon to raise funds for the school. When dropping off my son at school one day, I asked his teacher if he could tell where my son stood compared to the rest of his class.

The teacher just mentioned that he would not be divulging that information because he just closed his computer.

So naturally, I did what any engineer-parent would do — I went home and wrote a web scraper.

What is web scraping?

For those unfamiliar, web scraping is the process of programmatically extracting data from websites. Instead of manually visiting pages and copying information, you write a script that fetches the HTML, parses it, and pulls out the specific data you need. It's commonly used for price monitoring, data aggregation, research, and — in my case — finding out if your kid is winning the jog-a-thon fundraiser.

Web scraping sits in a gray area legally and ethically, so a few ground rules before we start:

  • Check the site's robots.txt — this file tells you what pages the site owner allows or disallows for automated access
  • Don't hammer the server — add delays between requests. Sending thousands of rapid-fire requests can effectively DDoS a small site
  • Respect rate limits — if you get 429 (Too Many Requests) responses, back off
  • Only scrape public data — don't bypass authentication or access controls
  • Consider the purpose — scraping public fundraiser data for your kid's class? Fine. Scraping user data for commercial resale? Not fine.

In this case, the jog-a-thon progress pages were publicly accessible with no authentication required — each student had a public URL with their participant number.

The Problem

Because I'm nosy curious, I wanted to find out where my son stands in the running with his 4th grade classmates. The website presented to students to see their progress is on this page: https://ultrafunrun.com/pledge/print_progress.php?par=125001 with the par query parameter representing a specific student's participant number.

Two challenges right away:

  • I don't know the upper and lower bounds for the participant number (par query parameter) for a particular school or class
  • I don't want to make requests to parse HTML pages we've already filtered out — that wastes time and server resources

Let's get poppin'

So first I want to import some libraries to help us out.

  • cheerio used to parse HTML pages
  • axios used to perform network requests
// scraper.js const axios = require("axios"); const cheerio = require("cheerio");

Next I wanted to perform a single network request just to see how the cheerio library worked and to see what data I needed to filter on. So using the following URL for UltraFunRun.com I made a single network request and log the results when grabbing the school name.

// scraper.js const axios = require("axios"); const cheerio = require("cheerio"); const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`; async function scrapeData() { const participantId = 125000; try { const { data } = await axios.get(`${url}${participantId}`); const $ = cheerio.load(data); const schoolName = $("body > table > tbody > tr:nth-child(2) > td"); console.log(schoolName); // School Name: <Specific Elementary Name> } catch (err) { console.error("problem with request", err); } } scrapeData();

I then ran modified the scripts in my package.json to run the script like so to see the results.

{ "scripts": { // rest of scripts "scrape": "node scraper.js" } }

I found that reading through the cheerio documentation it has a built-in method called text which helps with transformations. Additionally, I noticed when grabbing the school name with the selector it included the School Name: as part of the .text() call.

Let's loop it up

So from here after learning that the <td> elements were grabbing the entire field names (ie. School Name, Participant Name, Score, and Teacher Name). I modified the function to set an upper and lower bounds of around 1000 requests.

// scraper.js const axios = require("axios"); const cheerio = require("cheerio"); const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`; async function scrapeData() { const scrappedData = []; try { for (let participantId = 125000; participantId <= 126000; participantId++) { const { data } = await axios.get(`${url}${participantId}`); const $ = cheerio.load(data); // select all elements to filter on const schoolName = $("body > table > tbody > tr:nth-child(2) > td"); const teacherName = $("body > table > tbody > tr:nth-child(3) > td"); const student = $("body > table > tbody > tr:nth-child(4) > td"); const studentPoints = $( "body > table > tbody > tr:nth-child(9) > td:nth-child(3)", ); if ( schoolName.text().includes("School Name: <Specific Elementary Name>") && teacherName.text().includes("Teacher Name: <Mr. Teachers Name>") ) { scrappedData.push(`${student.text()} Points: ${studentPoints.text()}`); } } } catch (err) { console.error("problem with request", err); } console.log(scrappedData); } scrapeData();

So this definitely got us a few results (around 8 students in his class) and their points however I have a problem. The scrappedData array only exists in-memory and every time I ran this scraping function it would take about 5-8 minutes to run. So that would mean I would have to increase the upper and lower bounds and thus the time it takes to run would increase.

To address these annoying issues I wanted to do the following:

  • Log the participant numbers I already attempted that were not within our filters
    • Skip the ones I already attempted
  • Log the saved students data to a JSON file

Tweaking and refactoring

Let's start with saving both the found students and attempted participant numbers to a JSON file. I will need to use Node.js built-in file system using fs, so let's go ahead and import it.

// scraper.js // ... rest of imports const fs = require("fs");

Our filtered students is good to go but before I can create a file I need some data for the attempted numbers. I first initialized an empty array for attemptedNumbers and every time I did not get a hit I would push the participantId to that array, and when I did get a hit I added it to the scrappedData.

const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`; async function scrapeData() { const scrappedData = []; const attemptedNumbers = []; try { for (let participantId = 124000; participantId <= 127000; participantId++) { if (attemptedNumbers.includes(participantId)) { break; } const { data } = await axios.get(`${url}${participantId}`); const $ = cheerio.load(data); // select all const schoolName = $("body > table > tbody > tr:nth-child(2) > td"); const teacherName = $("body > table > tbody > tr:nth-child(3) > td"); const student = $("body > table > tbody > tr:nth-child(4) > td"); const studentPoints = $( "body > table > tbody > tr:nth-child(9) > td:nth-child(3)", ); if ( schoolName.text().includes("School Name: Barnett Elementary") && teacherName.text().includes("Teacher Name: Kinsey") ) { scrappedData.push(`${student.text()} Points: ${studentPoints.text()}`); } else { attemptedNumbers.push(participantId); } } } catch { console.log("with request error"); } } // calling `scrapeData` function

Now that both of those arrays of data exist, let's save these to a JSON file to ensure I can help speed it up next time it runs. We're going to create the files at the end of our data scraping right after the ending of the for-loop.

I did write all of this code in about 15-20 minutes so you'll see that it's not very DRY code (aka It definitely repeats itself). I didn't feel the need to write out an abstraction to help with file creation but you can do as you like.

} // ending of for-loop fs.writeFile( "students.json", JSON.stringify(scrappedData, null, 2), (err) => { if (err) { console.error(err); return; } console.log("written <Mr. Teachers Name> class successfully"); } ); fs.writeFile( "attemptedNumbers.json", JSON.stringify(attemptedNumbers, null, 2), (err) => { if (err) { console.error(err); return; } console.log("written attempted numbers successful"); } ); } catch { console.log("with request error"); } } // end of function

The only other thing I decided to do was to replace the initialized attemptedNumbers with the data that was imported via the newly created attemptedNumbers.json file like so:

// scraper.js const axios = require("axios"); const cheerio = require("cheerio"); const fs = require("fs"); const attemptedNumbers = require("./attemptedNumbers"); const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`; async function scrapeData() { const scrappedData = []; // deleted const attemptedNumbers = [] // ... rest of code

The results

This was a fun little web scraping project to find out real information about how many points my son received compared to the others. Not to sound boastful but he got 54.45 points — which was more than the rest of the class combined. That's my boy.

What I'd do differently today

Looking back at this code a few years later, there are some improvements I'd make:

1

Add request delays

The code above fires requests as fast as the loop can iterate. Adding a simple await new Promise(r => setTimeout(r, 200)) between requests would be polite to the server and reduce the chance of getting rate-limited or blocked.

2

Use a Set instead of an Array for attempted numbers

attemptedNumbers.includes(participantId) is O(n) on every iteration. With thousands of IDs, that adds up. A Set gives you O(1) lookups with attemptedNumbers.has(participantId) — a small change that makes a big difference at scale.

3

Consider Puppeteer for dynamic sites

Cheerio works great for static HTML pages like this one. But if the site rendered content with JavaScript (React, Vue, etc.), Cheerio wouldn't see it. So I would probably use Puppeteer or Playwright for JavaScript-rendered sites.

Wrapping up

The broader takeaway here isn't about jog-a-thons — it's about recognizing when a quick script can answer a question faster than waiting for someone else to give you the data. The tools are simple (Cheerio for HTML parsing, Axios for HTTP requests, fs for persistence), and the pattern of iterate-filter-persist applies to a lot of real-world data extraction problems. The whole thing took about 15-20 minutes to write, and the persistence layer (saving attempted IDs to skip on the next run) turned a throwaway script into something reusable.

Just remember to scrape responsibly.