Asynchronous URL requests with php and cURL

I’m scraping a website for data. However, the content I need appears at random, so each page load will show a different sentence. I need all the unique sentences that website will display.

How to do this? Like I’ve done many times before: curl_multi_exec. Except I never remember how to do it and end up back at the manuals.

Here’s my code:

<?php

$data = array();
$url = "http://www.example.com/";
$mh = curl_multi_init();

// I loop forever and watch the output
// put your condition to stop looping here
while(true){
	// create all cURL resources
	$chs = array();
	// number of consecutive requests sent out
	// from experience I find 20-40 to be the fastest
	// but you should experiment and find out yourself
	for($i = 0; $i < 10; $i++){
		$ch = curl_init();
		curl_setopt($ch, CURLOPT_URL, $url);
		curl_setopt($ch, CURLOPT_HEADER, 0);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
		curl_multi_add_handle($mh, $ch);
		$chs[] = $ch;
	}

	$active = null;
	//execute the handles
	// this loop looks weird because php 5.3.18 broke curl_multi_select
	do {
		do {
		    $mrc = curl_multi_exec($mh, $active);
		} while ($mrc == CURLM_CALL_MULTI_PERFORM);
		// this fixes the multi select from returning -1 forever
		usleep(10000);
	} while(curl_multi_select($mh) === -1);

	while ($active && $mrc == CURLM_OK) {
	    if (curl_multi_select($mh) != -1) {
	        do {
	            $mrc = curl_multi_exec($mh, $active);
	        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
	    }
	}

	//close the handles
	foreach($chs as $ch){
		$html = curl_multi_getcontent($ch);
		curl_multi_remove_handle($mh, $ch);

		// parse here...
		if(!preg_match_all('/pattern/', $html, $matches)){
			echo "Error!";
		}
		
		// store matches as keys so I can find the unique ones
		// bonus: increment the value to count how many times you find that match
		foreach($matches[1] as $match){
			if(!isset($data[$match])){
				$data[$match] = 1;
				echo "+: ".$match."\n";
			} else {
				$data[$match]++;
				echo " : ".$match."\n";
			}
		}
	}
}

curl_multi_close($mh);

?>

I used the php manual, start here. And special thanks to an ‘Alex Palmer’ for his comment on how to fix the curl_multi_select issue, and ‘bfanger at gmail dot com’ for his quick solution.

Originally curl_multi_select will block until something happens, like a url returns, but now its returning -1 doesn’t matter what. Alex posted a bug report, where bfanger mentions to add a pause before checking multi select again.

Leave a Reply

Your email address will not be published. Required fields are marked *