This article represents code examples on how to Scrape multiple URLs at once using Java Multi-threading API such as ExecutorService. The sole reason why I have been doing scraping lately is the need to get data from web to apply data analytics/science (machine learning algorithms) and extract knowledge from the data. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Following are three different methods whose code samples have been presented below:
- scrapeURLs that takes input file consisting of URLs to be scraped and output file where the output needs to be written
- scrapeIndividualURLs which takes as an argument, URL to be scraped and returns the content of scraped URL
- writeToFile which writes the output to a file
Code Samples – Web scraping using Multi-threading
Pay attention to some of the following concepts:
- ExecutorService instance is used for concurrent scraping. The instance of ExecutorService is created using newFixedThreadPool of size 10.
- To each ExecutorService, a callable task is submitted. This callable task could do some of the following activity and return the String or set of string:
- Get the content from Web
- Parse the content using DOM Parser. One could use Tidy framework to achieve this objective.
- Apply XPath to get selective content from the DOM object.
- Write the returned string in a file. Note that the content is returned and processed using Future object.
/**
* @param urlFile Path for file which consists of URLs to be scraped
* @param outputFile File where scrape results will be written
* @throws InterruptedException
* @throws ExecutionException
* @throws TimeoutException
*/
public void scrapeURLs(String urlFile, String outputFile)
throws InterruptedException, ExecutionException, TimeoutException {
Iterator uiter = null;
// Get the URLs from the file
try {
uiter = FileUtil.getURLIterator( urlFile );
} catch (IOException e1) {
e1.printStackTrace();
}
// Iterate through all URLs
if (uiter != null) {
//
// Create an ExecutorService using a newFixedThreadPool
//
ExecutorService executorService = Executors.newFixedThreadPool(10);
//
// Create a map of Future and URLs
//
Map<Future, String> tasks = new LinkedHashMap<Future, String>();
// Iterator through all URLs for scraping the web
while (uiter.hasNext()) {
String urlstr = uiter.next();
//
// Create a callable instance which calls the function that invokes the scraping for each URL
// and get the content (full or part based on some rules)
//
Callable callable = new Callable() {
public String call() throws Exception {
return scrapeIndividualURL(urlstr);
}
};
//
// Submit the task to executorService; At this point the scraping starts
//
Future future = executorService.submit(callable);
tasks.put(future, urlstr);
}
//
// For each task, iterate and get the content; Write the content to a file
//
tasks.forEach((future, url) -> {
try {
String content = future.get(120, TimeUnit.SECONDS);
writeToFile(url, content, outputFile);
} catch (InterruptedException | ExecutionException
| TimeoutException e) {
e.printStackTrace();
writeToFile(url, "Not Found", outputFile);
}
});
executorService.shutdown();
}
}
/**
* Scrape the URL
* @param urlstr
* @return
*/
public static String scrapeIndividualURls( String urlstr ) {
URL url = null;
StringBuilder contentb = new StringBuilder();
try {
// get URL content
url = new URL(urlstr);
// Create a URL Connection Object
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
// Set the configuration parameters
// Note the readTimeOut set to 30 seconds.
// This is quite important when you are planning to scrape URLs.
conn.setConnectTimeout(100000);
conn.setReadTimeout(30000);
conn.connect();
// open the stream and put it into BufferedReader
InputStream in = null;
if (conn.getResponseCode() >= 400) {
in = conn.getErrorStream();
} else {
BufferedReader br = new BufferedReader(new InputStreamReader(
conn.getInputStream()));
String inputLine;
while ((inputLine = br.readLine()) != null) {
contentb.append(inputLine);
contentb.append("\n");
}
br.close();
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return contentb.toString();
}
/**
* Write to the file
* @param url
* @param value
* @param outputFile
*/
private void writeToFile(String url, String value, String outputFile) throws IOException {
FileWriter fw = new FileWriter( new File( outputFile ), true );
BufferedWriter bw = new BufferedWriter(fw);
if( value != null ) {
bw.write( url + "\t" + value + "\n" );
} else {
bw.write( url + "\t" + "Not Found" + "\n" );
}
bw.close();
}
Latest posts by Ajitesh Kumar (see all)
- Large Language Models (LLMs): Four Critical Modeling Stages - August 4, 2025
- Agentic Workflow Design Patterns Explained with Examples - August 3, 2025
- What is Data Strategy? - August 2, 2025
I found it very helpful. However the differences are not too understandable for me