Categories: JavaWeb

Java – How to Scrape Content from a URL

This article represents take-away code sample that could be used to get or scrape content from a given URL. Those wanting to do a quick web scrape could use this piece of code. I shall be posting a series of blogs which would help one to create a web scraper using Java. The reason why I am hooked to Java web scraping is the need to get the data from web for data analysis. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Code Sample – Get Content from URL

Pay attention to some of the following aspects of fetching content from a given URL:

  • Create a URL object using actual URL string
  • Create a URLConnection object using that URL object created in above step
  • Set the configuration parameters. Key is to note the connection and read timeout. At times when scraping the websites, it helps with slow websites.
  • Create a BufferedReader for reading the data
  • Read line by line
 public String getContent(String urlstr) {
  URL url = null;
  StringBuilder contentb = new StringBuilder();
  try {
   // get URL content
   url = new URL(urlstr);
   // Create a URL Connection Object
   URLConnection conn = (HttpURLConnection) url.openConnection();
   // Set the configuration parameters
   // Note the readTimeOut set to 30 seconds.
   // This is quite important when you are planning to scrape URLs. 
   conn.setConnectTimeout(100000);
   conn.setReadTimeout(30000);
   conn.connect();
   // open the stream and put it into BufferedReader
   BufferedReader br = new BufferedReader(new InputStreamReader(
     conn.getInputStream()));
   String inputLine;
   while ((inputLine = br.readLine()) != null) {
    contentb.append(inputLine);
    contentb.append("\n");
   }
   br.close();
  } catch (MalformedURLException e) {
   e.printStackTrace();
  } catch (IOException e) {
   e.printStackTrace();
  }
  return contentb.toString();
 }

 

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com

Recent Posts

Mean Squared Error vs Cross Entropy Loss Function

Last updated: 28th April, 2024 As a data scientist, understanding the nuances of various cost…

2 days ago

Cross Entropy Loss Explained with Python Examples

Last updated: 28th April, 2024 In this post, you will learn the concepts related to…

2 days ago

Logistic Regression in Machine Learning: Python Example

Last updated: 26th April, 2024 In this blog post, we will discuss the logistic regression…

4 days ago

MSE vs RMSE vs MAE vs MAPE vs R-Squared: When to Use?

Last updated: 22nd April, 2024 As data scientists, we navigate a sea of metrics to…

5 days ago

Gradient Descent in Machine Learning: Python Examples

Last updated: 22nd April, 2024 This post will teach you about the gradient descent algorithm…

1 week ago

Loss Function vs Cost Function vs Objective Function: Examples

Last updated: 19th April, 2024 Among the terminologies used in training machine learning models, the…

2 weeks ago