Categories: JavaWeb

Java – How to Scrape Web using Multi-threading (ExecutorService)

This article represents code examples on how to Scrape multiple URLs at once using Java Multi-threading API such as ExecutorService.  The sole reason why I have been doing scraping lately is the need to get data from web to apply data analytics/science (machine learning algorithms) and extract knowledge from the data. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Following are three different methods whose code samples have been presented below:
  1. scrapeURLs that takes input file consisting of URLs to be scraped and output file where the output needs to be written
  2. scrapeIndividualURLs which takes as an argument, URL to be scraped and returns the content of scraped URL
  3. writeToFile which writes the output to a file


Code Samples – Web scraping using Multi-threading

Pay attention to some of the following concepts:

  • ExecutorService instance is used for concurrent scraping. The instance of ExecutorService is created using newFixedThreadPool of size 10.
  • To each ExecutorService, a callable task is submitted. This callable task could do some of the following activity and return the String or set of string:
    • Get the content from Web
    • Parse the content using DOM Parser. One could use Tidy framework to achieve this objective.
    • Apply XPath to get selective content from the DOM object.
  • Write the returned string in a file. Note that the content is returned and processed using Future object.
 
/**
* @param urlFile Path for file which consists of URLs to be scraped
* @param outputFile File where scrape results will be written
* @throws InterruptedException
* @throws ExecutionException
* @throws TimeoutException
*/public void scrapeURLs(String urlFile, String outputFile)
  throws InterruptedException, ExecutionException, TimeoutException { 
    Iterator uiter = null;
    // Get the URLs from the file
    try {
     uiter = FileUtil.getURLIterator( urlFile );
    } catch (IOException e1) {
  e1.printStackTrace();
    }
    // Iterate through all URLs
    if (uiter != null) {
        //
 // Create an ExecutorService using a newFixedThreadPool
 //
        ExecutorService executorService = Executors.newFixedThreadPool(10);
 // 
 // Create a map of Future and URLs
 //
        Map<Future, String> tasks = new LinkedHashMap<Future, String>();
 // Iterator through all URLs for scraping the web
 while (uiter.hasNext()) {
           String urlstr = uiter.next();
            // 
            // Create a callable instance which calls the function that invokes the scraping for each URL
            // and get the content (full or part based on some rules)
            //
            Callable callable = new Callable() {
                public String call() throws Exception {
                    return scrapeIndividualURL(urlstr);
                }
            };
     //
     // Submit the task to executorService; At this point the scraping starts
     //
     Future future = executorService.submit(callable);
            tasks.put(future, urlstr);
  }
  //
  // For each task, iterate and get the content; Write the content to a file
  //
  tasks.forEach((future, url) -> {
              try {
                  String content = future.get(120, TimeUnit.SECONDS);
                  writeToFile(url, content, outputFile);
             } catch (InterruptedException | ExecutionException
                     | TimeoutException e) {
                     e.printStackTrace();
                     writeToFile(url, "Not Found", outputFile);
             }
         });
         executorService.shutdown();
    }
}

/**
 * Scrape the URL
 * @param urlstr
 * @return
 */ public static String scrapeIndividualURls( String urlstr ) {
     URL url = null;
     StringBuilder contentb = new StringBuilder();
     try {
         // get URL content
         url = new URL(urlstr);
         // Create a URL Connection Object
         HttpURLConnection conn = (HttpURLConnection) url.openConnection();
         // Set the configuration parameters
         // Note the readTimeOut set to 30 seconds.
         // This is quite important when you are planning to scrape URLs. 
         conn.setConnectTimeout(100000);
         conn.setReadTimeout(30000);
         conn.connect();
         // open the stream and put it into BufferedReader
         InputStream in = null;
         if (conn.getResponseCode() >= 400) {
             in = conn.getErrorStream();
         } else {
             BufferedReader br = new BufferedReader(new InputStreamReader(
                                   conn.getInputStream()));
             String inputLine;
             while ((inputLine = br.readLine()) != null) {
                contentb.append(inputLine);
                contentb.append("\n");
             }
             br.close();
         }
    } catch (MalformedURLException e) {
         e.printStackTrace();
    } catch (IOException e) {
         e.printStackTrace();
    }
    return contentb.toString();
}

/**
 * Write to the file
 * @param url 
 * @param value 
 * @param outputFile
 */ private void writeToFile(String url, String value, String outputFile)  throws IOException  {
     FileWriter fw = new FileWriter( new File( outputFile ), true );
     BufferedWriter bw = new BufferedWriter(fw);
     if( value != null ) {
         bw.write( url + "\t" + value + "\n" );
     } else {
         bw.write( url + "\t" + "Not Found" + "\n" );
     }
     bw.close();
}


Latest posts by Ajitesh Kumar (see all)
Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.

Recent Posts

What are AI Agents? How do they work?

Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…

5 days ago

Agentic AI Design Patterns Examples

In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…

6 days ago

List of Agentic AI Resources, Papers, Courses

In this blog, I aim to provide a comprehensive list of valuable resources for learning…

7 days ago

Understanding FAR, FRR, and EER in Auth Systems

Have you ever wondered how systems determine whether to grant or deny access, and how…

1 week ago

Top 10 Gartner Technology Trends for 2025

What revolutionary technologies and industries will define the future of business in 2025? As we…

2 weeks ago

OpenAI GPT Models in 2024: What’s in it for Data Scientists

For data scientists and machine learning researchers, 2024 has been a landmark year in AI…

2 weeks ago