Website scraping with a Google Spreadsheet

In a recent project, a client was doing some market research. They had a (long!) list of website URLs, and needed to get all email addresses displayed anywhere on the page of each of the sites. I created a Google Spreadsheet to accomplish this. The requirements were as follows:

  • the client wanted to be able to add and remove URLs to a spreadsheet over the course of several days
  • as new URLs were added (typed or pasted) to the spreadsheet, the spreadsheet should scrape the email addresses from each site and put them (comma separated) into the corresponding second column of the sheet.
  • if a URL was removed from the spreadsheet, the corresponding email addresses should be removed as well
  • display only unique email addresses (no duplicates) per URL

The requirements were fairly straightforward, but making it responsive and easy to use for an end user was tricky. My approach:

  • Create a trigger to call a custom function that will run anytime the spreadsheet is edited (the on edit event)
  • The function looks at the modified cell (or each cell if a range of cells were modified, for example if someone pasted 10 rows into the spreadsheet) and perform an HTTP GET of each of the URLs
  • The function will then parse the HTML returned looking for email addresses, remove duplicates, and populate the corresponding cell in the spreadsheet with a comma separated list

It turns out parsing out email addressees from a string of characters isn’t completely straight forward either. I used a regex for email listed at http://www.regular-expressions.info/email.html, which will match 99.99% of all email addresses in existence today (good enough!).

I created the following two functions. The first gets called via the “on edit” event (anytime the spreadsheet is modified). It goes through the modified cells, and calls the second function, which performs the HTTP GET and parses all email addresses out of the response.

//triggered via the "on edit" event
function myOnEdit(e){

  var range = e.range;
  var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
  
  // only update if first column is edited (but not header row)
  if (range.getColumn() > 1 || range.getRow() == 1)
    return 0;

  //loop through each of the updated cells
  for (var i = 1; i <= range.getNumRows(); i++) {
    var updatedCell = range.getCell(i,1);

    // get cell to update (beside the edited one)
    var cellToUpdate = sheet.getRange(updatedCell.getRow(), 2);
  
    if (updatedCell.getValue().valueOf() == "") {
      //cell was emptied out.. so delete the value
      cellToUpdate.setValue(null);
    } 
    else {
      //set value by scraping emails
      cellToUpdate.setValue(scrapeEmails(updatedCell.getValue()));
    }
  }
}

//SCRAPE URL for EMAILS
function scrapeEmails(url) {

   var response = UrlFetchApp.fetch(url).getContentText();

   // from http://www.regular-expressions.info/email.html
   // matches 99.99% of all email address formats used today

   var matched = response.match(/[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?/g);
   
   if (matched == null)
     return null;
   
   //remove duplicate emails from the list
   var unique = matched.reduce(function(a,b){if(a.indexOf(b)<0)a.push(b);return a;},[]);
   
   return unique.join();
}

Here's a screenshot of the spreadsheet in use. Any URLs entered into column A automatically get scraped and column B populated.

scrapemails

Any questions about the above or if you need help getting it working for yourself, let me know! I'd be happy to help.

Leave a Reply

Your email address will not be published.