How to scrape top 5 headlines from BBC using puppeteer & NodeJS

Web scraping is a method of extracting desired data from web sites. In the world of Javascript "puppeteer" is the most commonly used tool to do such kind of tasks. In this tutorial, we will use puppeteer to scrape the top 5 headlines from the world's leading news website BBC. All in less than 35 lines of code

Let's start by creating a folder and a server.js file. Then we can initialize the project and install the only 2 dependencies we need: express and puppeteer. 

npm init -y

npm i express puppeteer

Setup our server from the express boilerplate. Go to http://localhost:3000/ and make sure you see "Hello World!" message.

const express = require('express')
const app = express()
const port = 3000

app.get('/', (req, res) => res.send('Hello World!'))

app.listen(port, () => console.log(`Example app listening at http://localhost:${port}`))

Since puppeteer takes a while to complete each task we will first wrap our code in an async function called getData which will accept our target URL.

const baseURL = "http://bbc.com/"

app.get('/', (req, res) => {
    async function getData(url) {
        // Our code will go here
    getData(baseURL)
})

Now we have a 6 step process that we want puppeteer to go through within our getData function:

1. Launch the browser

2. Open a new tab

3. Go to our URL

4. Scrape

5. Send data to us

6. Close 

Steps one to three are very straightforward: 

const browser = await puppeteer.launch(); //Step 1
const page = await browser.newPage(); //Step 2
await page.goto(url); //Step 3

Step 4 is where most work lies. Puppeteer has a method on a page called evaluate which runs a callback function where you can specify what DOM element you want the data from. After inspecting the BBC page we can find that the articles are simply list items with the class name of 'media-list__item'. Selecting all of them will return a node list which we want to transform into an array and save into the variable like this: 

const listOfAllNews = Array.from(document.querySelectorAll('.media-list__item'))

Now in order to access the actual titles, we need to dig deeper within each element of the array in the following way:

listOfAllNews[itemNumber].childNodes[1].children[1].childNodes[1].outerText

Finally, we want to loop through listOfAllNews array and push as many titles as we wish into a new array. We should also make sure we are not pushing already existing titles

for (var i = 0; i < 5; i++) {
    const title = listOfAllNews[i].childNodes[1].children[1].childNodes[1].outerText
    topNews.includes(title) ? i++ : topNews.push(title)
}

The whole code for step 4 looks the following way: 

//Step 4
const news = await page.evaluate(() => {
    const topNews = []
    const listOfAllNews = Array.from(document.querySelectorAll('.media-list__item')) 
    for (var i = 0; i < 5; i++) {
        const title = listOfAllNews[i].childNodes[1].children[1].childNodes[1].outerText
        topNews.includes(title) ? i++ : topNews.push(title)
    } 
    return topNews
})

Finally, add steps 5 and 6 inside our getData function and you should be all set. The final code looks the following way: 

const puppeteer = require("puppeteer");
const express = require('express')
const app = express()
const port = 3000

const baseURL = "http://bbc.com/"

app.get('/', (req, res) => {
    async function getData(url) {
        const browser = await puppeteer.launch(); //Step 1
        const page = await browser.newPage(); //Step 2
        await page.goto(url); //Step 3

        const news = await page.evaluate(() => {
            const topNews = []
            const listOfAllNews = Array.from(document.querySelectorAll('.media-list__item')) //page one
            for (var i = 0; i < 5; i++) {
                const title = listOfAllNews[i].childNodes[1].children[1].childNodes[1].outerText
                topNews.includes(title) ? i++ : topNews.push(title)
            }
            return topNews
        }) //Step 4

        console.log(news)

        res.send(news); //Step 5
        browser.close(); // Step 6
    }
    getData(baseURL)
})


app.listen(port, () => console.log(`Example app listening at http://localhost:${port}`))

Now when we go to http://localhost:3000/ we should get an array with 5 strings which are our titles

Hope you found this useful. You can browse through the whole code in this repo and read more about puppeteer here

Cheers,

Ildana