Curl and screen scraping advice please :)

This is a discussion on Curl and screen scraping advice please :) within the PHP forums, part of the Development category; Hi All, as im sure some of you are aware im setting up a price comparison feature from an api ...


Reply
 
LinkBack Thread Tools Display Modes
Old 4th February 2010, 08:06 PM   #1
Member
 
ljackson's Avatar
 
Join Date: Feb 2009
Location: Cornwall
Posts: 412
Thanks: 23
Thanked 4 Times in 4 Posts
ljackson is on a distinguished road
Expertise: PHP
Experience: Intermediate
Exclamation Curl and screen scraping advice please :)

Hi All,

as im sure some of you are aware im setting up a price comparison feature from an api protocal which stores millions of products,

the main problem with this is that its not very reliable for example if i search modern warfare 2 on xbox 360 it might return 10 stores which sells it but half the products it finds are the communicator and there is no record of the actual game from particular merchants. for example play and zavvi dont have the game listed that i can find, so i was wondering how easy curl and screen scraping was and if im more likley to get better result set :)

i have found a website Find DVD - Compare DVD prices from dozens of UK retailers which seems to be spot on when it comes to the price comparison all the links link to the actual product im searching for but im not sure how they do it :)

any advice would be appreciated :)
cheers
Luke
__________________
www.kernow-connect.com - follow us on Twitter
ljackson is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!twitter
Reply With Quote
Old 6th February 2010, 12:13 AM   #2
supermod
 
CloudedVision's Avatar
 
Join Date: Jan 2009
Location: Your Imagination
Posts: 739
Blog Entries: 4
Thanks: 1
Thanked 29 Times in 28 Posts
CloudedVision is a jewel in the roughCloudedVision is a jewel in the roughCloudedVision is a jewel in the rough
Expertise: PHP
Experience: Professional
Default

Well, first you would need a spider:
  • cURL to a product page
  • Parse it with an XML/HTML parser. The default PHP is rather complex, but this looks promising: PHP Simple HTML DOM Parser
  • Find price, product picture, etc.
  • Put into a database
And then users can search the database on your website. Here's a good link on how to create a PHP search function. PHP search engine
CloudedVision is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!twitter
Reply With Quote
Old 7th February 2010, 08:47 PM   #3
Member
 
ljackson's Avatar
 
Join Date: Feb 2009
Location: Cornwall
Posts: 412
Thanks: 23
Thanked 4 Times in 4 Posts
ljackson is on a distinguished road
Expertise: PHP
Experience: Intermediate
Default

interesting.. thanks mate
__________________
www.kernow-connect.com - follow us on Twitter
ljackson is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!twitter
Reply With Quote
Old 19th March 2010, 01:31 PM   #4
Member
 
ljackson's Avatar
 
Join Date: Feb 2009
Location: Cornwall
Posts: 412
Thanks: 23
Thanked 4 Times in 4 Posts
ljackson is on a distinguished road
Expertise: PHP
Experience: Intermediate
Default

Hi im back! lol

after spending some time on the rest of my site i decided to take a closer look at Curl, and i have some code

PHP Code:
<?php
include("dbinfo.php");

function 
storeLink($url,$gathered_from) {
    
$query "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
    
mysql_query($query) or die('Error, insert query failed');
}

$target_url "http://www.play.com/DVD/DVD/4-/-/-/Product.html?title=10674623&source=9593";
$userAgent 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// make the cURL request to $target_url
$ch curl_init();
curl_setopt($chCURLOPT_USERAGENT$userAgent);
curl_setopt($chCURLOPT_URL,$target_url);
curl_setopt($chCURLOPT_FAILONERRORtrue);
#curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($chCURLOPT_AUTOREFERERtrue);
curl_setopt($chCURLOPT_RETURNTRANSFER,true);
curl_setopt($chCURLOPT_TIMEOUT10);
$htmlcurl_exec($ch);
if (!
$html) {
    echo 
"<br />cURL error number:" .curl_errno($ch);
    echo 
"<br />cURL error:" curl_error($ch);
    exit;
}

// parse the html into a DOMDocument
$dom = new DOMDocument();
@
$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs $xpath->evaluate("/html/body/center//object");

for (
$i 0$i $hrefs->length$i++) {
    
$href $hrefs->item($i);
    
$url $href->getAttribute('href');
    
storeLink($url,$target_url);
    echo 
"<br />Link stored: $url";
}
?>
but this throws up an error on this line
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

which is why ive commented it out, but it does not insert anything into the db???

the error it thrown out is

Warning: curl_setopt() [function.curl-setopt]: CURLOPT_FOLLOWLOCATION cannot be activated when in safe_mode or an open_basedir is set on line 17

any ideas
thanks
Luke

p.s im trying to get the trailer from the url

thanks..
__________________
www.kernow-connect.com - follow us on Twitter
ljackson is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!twitter
Reply With Quote
Old 19th March 2010, 07:06 PM   #5
supermod
 
CloudedVision's Avatar
 
Join Date: Jan 2009
Location: Your Imagination
Posts: 739
Blog Entries: 4
Thanks: 1
Thanked 29 Times in 28 Posts
CloudedVision is a jewel in the roughCloudedVision is a jewel in the roughCloudedVision is a jewel in the rough
Expertise: PHP
Experience: Professional
Default

There are several places where it could be going wrong. What HTML is cURL getting? What are the results from XPath?
CloudedVision is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!twitter
Reply With Quote
Old 19th March 2010, 11:00 PM   #6
Member
 
ljackson's Avatar
 
Join Date: Feb 2009
Location: Cornwall
Posts: 412
Thanks: 23
Thanked 4 Times in 4 Posts
ljackson is on a distinguished road
Expertise: PHP
Experience: Intermediate
Default

??? dunno mate, how does one find out?

the code surrouding the flash object is
Code:
<div class="special slice"><ul><li><br><center><object type="application/x-shockwave-flash" width="403" height="298" id="playVideoPlayer" wmode="transparent" data="http://media.play.com/trailers/videoPlayer.swf?file=http://media.play.com/ProductPage_Trailers/Films/10674623.flv&vol=0.5&packShot=http://images.play.com/covers/10674623m.jpg" allowScriptAccess="always"><param name="movie" value="http://media.play.com/trailers/videoPlayer.swf?file=http://media.play.com/ProductPage_Trailers/Films/10674623.flv&vol=0.5&packShot=http://images.play.com/covers/10674623m.jpg" /><param name="wmode" value="transparent" /><param name="allowScriptAccess" value="always" /><embed name="playVideoPlayer" src="http://media.play.com/trailers/videoPlayer.swf?file=http://media.play.com/ProductPage_Trailers/Films/10674623.flv&vol=0.5&packShot=http://images.play.com/covers/10674623m.jpg" loop="false" width="403" height="298" allowScriptAccess="always" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/go/getflashplayer" /></object></center><br><br></li><li> Behind the Scenes</li><li> Interviews with Cast and Crew</li></ul></div>   
it the object i want?

sorry if im being thick just got in from the pub :D
__________________
www.kernow-connect.com - follow us on Twitter
ljackson is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!twitter
Reply With Quote
Old 19th March 2010, 11:04 PM   #7
Member
 
ljackson's Avatar
 
Join Date: Feb 2009
Location: Cornwall
Posts: 412
Thanks: 23
Thanked 4 Times in 4 Posts
ljackson is on a distinguished road
Expertise: PHP
Experience: Intermediate
Default

ok just done a print_r like so

PHP Code:
print_r($xpath);
print_r($hrefs); 
and it outputs
DOMXPath Object ( ) DOMNodeList Object ( )

so not a lot :(

does that mean that this part is wrong
PHP Code:
$hrefs $xpath->evaluate("/html/body/center//object"); 
or is it because that this lie is throwing up an error
PHP Code:
curl_setopt($chCURLOPT_FOLLOWLOCATION1); 
thanks mate
Luke
ljackson is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!twitter
Reply With Quote
Old 20th March 2010, 06:27 AM   #8
Member
 
ljackson's Avatar
 
Join Date: Feb 2009
Location: Cornwall
Posts: 412
Thanks: 23
Thanked 4 Times in 4 Posts
ljackson is on a distinguished road
Expertise: PHP
Experience: Intermediate
Default

hi mate,

turns out i dint need to use curl afterall, the solution was much simpler
PHP Code:
$uri 'http://www.play.com/DVD/Blu-ray/4-/-/-/Product.html?title=12090170&source=9593';
$input file_get_contents($uri);

if (
preg_match('#/Films/(\w+?)\.flv\&#'$input$match))
{
    
$link '<object type="application/x-shockwave-flash" width="403" height="298" id="playVideoPlayer" wmode="transparent" data="http://media.play.com/trailers/videoPlayer.swf?file=http://media.play.com/ProductPage_Trailers/Films/'.$match[1].'.flv&vol=0.5&packShot=http://images.play.com/covers/'.$match[1].'m.jpg" allowScriptAccess="always"><param name="movie" value="http://media.play.com/trailers/videoPlayer.swf?file=http://media.play.com/ProductPage_Trailers/Films/'.$match[1].'.flv&vol=0.5&packShot=http://images.play.com/covers/'.$match[1].'m.jpg" /><param name="wmode" value="transparent" /><param name="allowScriptAccess" value="always" /><embed name="playVideoPlayer" src="http://media.play.com/trailers/videoPlayer.swf?file=http://media.play.com/ProductPage_Trailers/Films/'.$match[1].'.flv&vol=0.5&packShot=http://images.play.com/covers/'.$match[1].'m.jpg" loop="false" width="403" height="298" allowScriptAccess="always" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/go/getflashplayer"/></object>';
print 
$match[1];
    print(
$link); 
which works a treat :)

cheers
Luke
__________________
www.kernow-connect.com - follow us on Twitter
ljackson is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit! Wong this Post!twitter
Reply With Quote
Reply

Tags
curl, scraping

Thread Tools
Display Modes