Web scraping Jogathon data

Background

So I'm sure many of you parents have children in school who run the jog-a-thon to raise funds for the school. When dropping off my son for his jog-a-thon fundraiser today I asked his teacher if he could tell where he stood compared to the rest of his class.

The teach just mentioned that he would not be divulging that information because he just closed his computer. :sweat_smile:

Assumptions

I assume that the reader is knowledgable using Node.js and JavaScript.

The Problem

Because I'm ~~nosy~~ curious I wanted to find out where Noah is in the running with his classmates. So the website presented to students to see their progress is on this page: https://ultrafunrun.com/pledge/print_progress.php?par=125001 with the par query parameter representing a specific student's participant number.

I don't know the upper and lower bounds for the participant number (par query parameter) for a particular school or class.
I don't want to make requests to parse HTML pages we've already filtered out

Let's get poppin'

So first I want to import some libraries to help us out.

cheerio used to parse HTML pages
axios used to perform network requests

// scraper.js
const axios = require("axios");
const cheerio = require("cheerio");

Next I wanted to perform a single network request just to see how the cheerio library worked and to see what data I needed to filter on. So using the following URL for UltraFunRun.com I made a single network request and log the results when grabbing the school name.

// scraper.js
const axios = require("axios");
const cheerio = require("cheerio");

const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`;

async function scrapeData() {
  const participantId = 125000;
  try {
    const { data } = await axios.get(`${url}${participantId}`);
    const $ = cheerio.load(data);

    const schoolName = $("body > table > tbody > tr:nth-child(2) > td");

    console.log(schoolName); // School Name: <Specific Elementary Name>
  } catch (err) {
    console.error("problem with request", err);
  }
}

scrapeData();

I then ran modified the scripts in my package.json to run the script like so to see the results.

{
  "scripts": {
    // rest of scripts
    "scrape": "node scraper.js"
  }
}

I found that reading through the cheerio documentation it has a built-in method called text which helps with transformations. Additionally, I noticed when grabbing the school name with the selector it included the School Name: as part of the .text() call.

Let's loop it up

So from here after learning that the <td> elements were grabbing the entire field names (ie. School Name, Participant Name, Score, and Teacher Name). I modified the function to set an upper and lower bounds of around 1000 requests.

// scraper.js
const axios = require("axios");
const cheerio = require("cheerio");

const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`;

async function scrapeData() {
  const scrappedData = [];
  try {
    for (let participantId = 125000; participantId <= 126000; participantId++) {
      const { data } = await axios.get(`${url}${participantId}`);
      const $ = cheerio.load(data);

      // select all elements to filter on
      const schoolName = $("body > table > tbody > tr:nth-child(2) > td");
      const teacherName = $("body > table > tbody > tr:nth-child(3) > td");
      const student = $("body > table > tbody > tr:nth-child(4) > td");
      const studentPoints = $(
        "body > table > tbody > tr:nth-child(9) > td:nth-child(3)"
      );

      if (
        schoolName.text().includes("School Name: <Specific Elementary Name>") &&
        teacherName.text().includes("Teacher Name: <Mr. Teachers Name>")
      ) {
        scrappedData.push(`${student.text()} Points: ${studentPoints.text()}`);
      }
    }
  } catch (err) {
    console.error("problem with request", err);
  }
  console.log(scrappedData);
}

scrapeData();

So this definitely got us a few results (around 8 students in his class) and their points however I have a problem. The scrappedData array only exists in-memory and every time I ran this scraping function it would take about 5-8 minutes to run. So that would mean I would have to increase the upper and lower bounds and thus the time it takes to run would increase.

To address these annoying issues I wanted to do the following:

Log the participant numbers I already attempted that were not within our filters
- Skip the ones I already attempted
Log the saved students data to a JSON file

Tweaking and refactoring

Let's start with saving both the found students and attempted participant numbers to a JSON file. I will need to use Node.js built-in file system using fs, so let's go ahead and import it.

// scraper.js
// ... rest of imports
const fs = require("fs");

Our filtered students is good to go but before I can create a file I need some data for the attempted numbers. I first initialized an empty array for attemptedNumbers and every time I did not get a hit I would push the participantId to that array, and when I did get a hit I added it to the scrappedData.

const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`;

async function scrapeData() {
  const scrappedData = [];
  const attemptedNumbers = [];
  try {
    for (let participantId = 124000; participantId <= 127000; participantId++) {
      if (attemptedNumbers.includes(participantId)) {
        break;
      }
      const { data } = await axios.get(`${url}${participantId}`);
      const $ = cheerio.load(data);

      // select all
      const schoolName = $("body > table > tbody > tr:nth-child(2) > td");
      const teacherName = $("body > table > tbody > tr:nth-child(3) > td");
      const student = $("body > table > tbody > tr:nth-child(4) > td");
      const studentPoints = $(
        "body > table > tbody > tr:nth-child(9) > td:nth-child(3)"
      );

      if (
        schoolName.text().includes("School Name: Barnett Elementary") &&
        teacherName.text().includes("Teacher Name: Kinsey")
      ) {
        scrappedData.push(`${student.text()} Points: ${studentPoints.text()}`);
      } else {
        attemptedNumbers.push(participantId);
      }
    }
  } catch {
    console.log("with request error");
  }
}
// calling `scrapeData` function

Now that both of those arrays of data exist, let's save these to a JSON file to ensure I can help speed it up next time it runs. We're going to create the files at the end of our data scraping right after the ending of the for-loop.

I did write all of this code in about 15-20 minutes so you'll see that it's not very DRY code (aka It definitely repeats itself). I didn't feel the need to write out an abstraction to help with file creation but you can do as you like.

    } // ending of for-loop
    fs.writeFile(
      "students.json",
      JSON.stringify(scrappedData, null, 2),
      (err) => {
        if (err) {
          console.error(err);
          return;
        }
        console.log("written <Mr. Teachers Name> class successfully");
      }
    );
    fs.writeFile(
      "attemptedNumbers.json",
      JSON.stringify(attemptedNumbers, null, 2),
      (err) => {
        if (err) {
          console.error(err);
          return;
        }
        console.log("written attempted numbers successful");
      }
    );
  } catch {
    console.log("with request error");
  }
} // end of function

The only other thing I decided to do was to replace the initialized attemptedNumbers with the data that was imported via the newly created attemptedNumbers.json file like so:

// scraper.js
const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");
const attemptedNumbers = require("./attemptedNumbers");

const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`;

async function scrapeData() {
  const scrappedData = [];
  // deleted const attemptedNumbers = []
  // ... rest of code

Final thoughts

This was a fun little web scraping project to find out real information about how many points my son received compared to the others. Not to sound boastful but he got 54.45 points which was more than the rest of the class combined.

Originally Posted on Alexander Garcia's Blog https://alexandergarcia.me/blog/web-scraping-jogathon

Hopefully some of you found that useful. Cheers! 🎉

If you enjoyed this article please feel free to connect with me on Dev.to or on LinkedIn