Web scraping Jogathon data
Background
So I'm sure many of you parents have children in school who run the jog-a-thon to raise funds for the school. When dropping off my son for his jog-a-thon fundraiser today I asked his teacher if he could tell where he stood compared to the rest of his class.
The teach just mentioned that he would not be divulging that information because he just closed his computer. :sweat_smile:
Assumptions
I assume that the reader is knowledgable using Node.js and JavaScript.
The Problem
Because I'm nosy curious I wanted to find out where Noah is in the running with his classmates. So the website presented to students to see their progress is on this page: https://ultrafunrun.com/pledge/print_progress.php?par=125001
with the par
query parameter representing a specific student's participant number.
- I don't know the upper and lower bounds for the participant number (
par
query parameter) for a particular school or class. - I don't want to make requests to parse HTML pages we've already filtered out
Let's get poppin'
So first I want to import some libraries to help us out.
// scraper.js
const axios = require("axios");
const cheerio = require("cheerio");
Next I wanted to perform a single network request just to see how the cheerio
library worked and to see what data I needed to filter on. So using the following URL for UltraFunRun.com I made a single network request and log the results when grabbing the school name.
// scraper.js
const axios = require("axios");
const cheerio = require("cheerio");
const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`;
async function scrapeData() {
const participantId = 125000;
try {
const { data } = await axios.get(`${url}${participantId}`);
const $ = cheerio.load(data);
const schoolName = $("body > table > tbody > tr:nth-child(2) > td");
console.log(schoolName); // School Name: <Specific Elementary Name>
} catch (err) {
console.error("problem with request", err);
}
}
scrapeData();
I then ran modified the scripts
in my package.json
to run the script like so to see the results.
{
"scripts": {
// rest of scripts
"scrape": "node scraper.js"
}
}
I found that reading through the cheerio
documentation it has a built-in method called text
which helps with transformations. Additionally, I noticed when grabbing the school name with the selector it included the School Name:
as part of the .text()
call.
Let's loop it up
So from here after learning that the <td>
elements were grabbing the entire field names (ie. School Name, Participant Name, Score, and Teacher Name). I modified the function to set an upper and lower bounds of around 1000 requests.
// scraper.js
const axios = require("axios");
const cheerio = require("cheerio");
const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`;
async function scrapeData() {
const scrappedData = [];
try {
for (let participantId = 125000; participantId <= 126000; participantId++) {
const { data } = await axios.get(`${url}${participantId}`);
const $ = cheerio.load(data);
// select all elements to filter on
const schoolName = $("body > table > tbody > tr:nth-child(2) > td");
const teacherName = $("body > table > tbody > tr:nth-child(3) > td");
const student = $("body > table > tbody > tr:nth-child(4) > td");
const studentPoints = $(
"body > table > tbody > tr:nth-child(9) > td:nth-child(3)"
);
if (
schoolName.text().includes("School Name: <Specific Elementary Name>") &&
teacherName.text().includes("Teacher Name: <Mr. Teachers Name>")
) {
scrappedData.push(`${student.text()} Points: ${studentPoints.text()}`);
}
}
} catch (err) {
console.error("problem with request", err);
}
console.log(scrappedData);
}
scrapeData();
So this definitely got us a few results (around 8 students in his class) and their points however I have a problem. The scrappedData
array only exists in-memory and every time I ran this scraping function it would take about 5-8 minutes to run. So that would mean I would have to increase the upper and lower bounds and thus the time it takes to run would increase.
To address these annoying issues I wanted to do the following:
- Log the participant numbers I already attempted that were not within our filters
- Skip the ones I already attempted
- Log the saved students data to a JSON file
Tweaking and refactoring
Let's start with saving both the found students and attempted participant numbers to a JSON file. I will need to use Node.js built-in file system using fs
, so let's go ahead and import it.
// scraper.js
// ... rest of imports
const fs = require("fs");
Our filtered students is good to go but before I can create a file I need some data for the attempted numbers. I first initialized an empty array for attemptedNumbers
and every time I did not get a hit I would push the participantId
to that array, and when I did get a hit I added it to the scrappedData
.
const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`;
async function scrapeData() {
const scrappedData = [];
const attemptedNumbers = [];
try {
for (let participantId = 124000; participantId <= 127000; participantId++) {
if (attemptedNumbers.includes(participantId)) {
break;
}
const { data } = await axios.get(`${url}${participantId}`);
const $ = cheerio.load(data);
// select all
const schoolName = $("body > table > tbody > tr:nth-child(2) > td");
const teacherName = $("body > table > tbody > tr:nth-child(3) > td");
const student = $("body > table > tbody > tr:nth-child(4) > td");
const studentPoints = $(
"body > table > tbody > tr:nth-child(9) > td:nth-child(3)"
);
if (
schoolName.text().includes("School Name: Barnett Elementary") &&
teacherName.text().includes("Teacher Name: Kinsey")
) {
scrappedData.push(`${student.text()} Points: ${studentPoints.text()}`);
} else {
attemptedNumbers.push(participantId);
}
}
} catch {
console.log("with request error");
}
}
// calling `scrapeData` function
Now that both of those arrays of data exist, let's save these to a JSON file to ensure I can help speed it up next time it runs. We're going to create the files at the end of our data scraping right after the ending of the for-loop
.
I did write all of this code in about 15-20 minutes so you'll see that it's not very DRY code (aka It definitely repeats itself). I didn't feel the need to write out an abstraction to help with file creation but you can do as you like.
} // ending of for-loop
fs.writeFile(
"students.json",
JSON.stringify(scrappedData, null, 2),
(err) => {
if (err) {
console.error(err);
return;
}
console.log("written <Mr. Teachers Name> class successfully");
}
);
fs.writeFile(
"attemptedNumbers.json",
JSON.stringify(attemptedNumbers, null, 2),
(err) => {
if (err) {
console.error(err);
return;
}
console.log("written attempted numbers successful");
}
);
} catch {
console.log("with request error");
}
} // end of function
The only other thing I decided to do was to replace the initialized attemptedNumbers
with the data that was imported via the newly created attemptedNumbers.json
file like so:
// scraper.js
const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");
const attemptedNumbers = require("./attemptedNumbers");
const url = `https://ultrafunrun.com/pledge/print_progress.php?par=`;
async function scrapeData() {
const scrappedData = [];
// deleted const attemptedNumbers = []
// ... rest of code
Final thoughts
This was a fun little web scraping project to find out real information about how many points my son received compared to the others. Not to sound boastful but he got 54.45 points which was more than the rest of the class combined.
Originally Posted on Alexander Garcia's Blog https://alexandergarcia.me/blog/web-scraping-jogathon
Hopefully some of you found that useful. Cheers! π
If you enjoyed this article please feel free to connect with me on Dev.to or on LinkedIn