I’m scraping a website for data. However, the content I need appears at random, so each page load will show a different sentence. I need all the unique sentences that website will display.
How to do this? Like I’ve done many times before: curl_multi_exec. Except I never remember how to do it and end up back at the manuals.
Here’s my code:
<?php $data = array(); $url = "http://www.example.com/"; $mh = curl_multi_init(); // I loop forever and watch the output // put your condition to stop looping here while(true){ // create all cURL resources $chs = array(); // number of consecutive requests sent out // from experience I find 20-40 to be the fastest // but you should experiment and find out yourself for($i = 0; $i < 10; $i++){ $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_multi_add_handle($mh, $ch); $chs[] = $ch; } $active = null; //execute the handles // this loop looks weird because php 5.3.18 broke curl_multi_select do { do { $mrc = curl_multi_exec($mh, $active); } while ($mrc == CURLM_CALL_MULTI_PERFORM); // this fixes the multi select from returning -1 forever usleep(10000); } while(curl_multi_select($mh) === -1); while ($active && $mrc == CURLM_OK) { if (curl_multi_select($mh) != -1) { do { $mrc = curl_multi_exec($mh, $active); } while ($mrc == CURLM_CALL_MULTI_PERFORM); } } //close the handles foreach($chs as $ch){ $html = curl_multi_getcontent($ch); curl_multi_remove_handle($mh, $ch); // parse here... if(!preg_match_all('/pattern/', $html, $matches)){ echo "Error!"; } // store matches as keys so I can find the unique ones // bonus: increment the value to count how many times you find that match foreach($matches[1] as $match){ if(!isset($data[$match])){ $data[$match] = 1; echo "+: ".$match."\n"; } else { $data[$match]++; echo " : ".$match."\n"; } } } } curl_multi_close($mh); ?>
I used the php manual, start here. And special thanks to an ‘Alex Palmer’ for his comment on how to fix the curl_multi_select issue, and ‘bfanger at gmail dot com’ for his quick solution.
Originally curl_multi_select will block until something happens, like a url returns, but now its returning -1 doesn’t matter what. Alex posted a bug report, where bfanger mentions to add a pause before checking multi select again.