Scraping Meta Information from URLs

by Ruchita Rathi

Last week I completed the following items for readlater app thanks to generous help from brantje`:

  • Delete an item in the readlater app
  • Scrape meta info from a given URL
  • Fix the search function

For the delete function, I ran into the issue where the click event for sidebar items is not being caught in the JS. Per brantje’s recommendation, it seems like the JQuery version of click event handling:

 $('a .icon-delete').click(function(){…}

needs to be replaced by this for the ownCloud JS file:

$(document).on('click','a .icon-delete', function(){…}

Also, per brantje’s pull request, in the routes.php file, the route for delete function was changed from:

array('name' => 'item_api#remove_item', 'url' => '/deleteitem', 'verb' => 'DELETE'),

to

array('name' => 'item_api#remove_item', 'url' => '/deleteitem', 'verb' => 'GET'),


For scraping the meta information from a given URL, brantje recommended to use curl / file get contents to get the page and then search for the meta tag description  http://stackoverflow.com/questions/3711357/get-title-and-meta-tags-of-external-site

Per brantje’s PR, added the logic for scraping meta information in the itemapicontroller.php file like so:

Last week I completed the following items for readlater app thanks to generous help from brantje`:

Delete an item in the readlater app
Scrape meta info from a given URL
Fix the search function
For the delete function, I ran into the issue where the click event for sidebar items is not being caught in the JS. Per brantje’s recommendation, it seems like the JQuery version of click event handling:

$('a .icon-delete').click(function(){…}
needs to be replaced by this for the ownCloud JS file:

$(document).on('click','a .icon-delete', function(){…}
Also, per brantje’s pull request, in the routes.php file, the route for delete function was changed from:

array('name' => 'item_api#remove_item', 'url' => '/deleteitem', 'verb' => 'DELETE'),

to

array('name' => 'item_api#remove_item', 'url' => '/deleteitem', 'verb' => 'GET'),

For scraping the meta information from a given URL, brantje recommended to use curl / file get contents to get the page and then search for the meta tag description http://stackoverflow.com/questions/3711357/get-title-and-meta-tags-of-external-site

Per brantje's PR, added the logic for scraping meta information in the itemapicontroller.php file like so:

$isURL = (bool)parse_url($url);
if($isURL){
$html = $this->file_get_contents_curl($url);
$doc = new \DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
$title = $nodes->item(0)->nodeValue;

$metas = $doc->getElementsByTagName('meta');

for ($i = 0; $i length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'description')
$description = $meta->getAttribute('content');
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}

$item = array();
$item['url'] = $url;
$item['title'] = $title;
$item['description'] = ($description) ? $description : '';
$item['keywords'] = ($keywords) ? $keywords :'';

$result['itemid'] = $this->ItemBusinessLayer->create($item);

}


 

Advertisements