Categories: JavaWeb

Java – How to Scrape Content from a URL

This article represents take-away code sample that could be used to get or scrape content from a given URL. Those wanting to do a quick web scrape could use this piece of code. I shall be posting a series of blogs which would help one to create a web scraper using Java. The reason why I am hooked to Java web scraping is the need to get the data from web for data analysis. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Code Sample – Get Content from URL

Pay attention to some of the following aspects of fetching content from a given URL:

  • Create a URL object using actual URL string
  • Create a URLConnection object using that URL object created in above step
  • Set the configuration parameters. Key is to note the connection and read timeout. At times when scraping the websites, it helps with slow websites.
  • Create a BufferedReader for reading the data
  • Read line by line
 public String getContent(String urlstr) {
  URL url = null;
  StringBuilder contentb = new StringBuilder();
  try {
   // get URL content
   url = new URL(urlstr);
   // Create a URL Connection Object
   URLConnection conn = (HttpURLConnection) url.openConnection();
   // Set the configuration parameters
   // Note the readTimeOut set to 30 seconds.
   // This is quite important when you are planning to scrape URLs. 
   conn.setConnectTimeout(100000);
   conn.setReadTimeout(30000);
   conn.connect();
   // open the stream and put it into BufferedReader
   BufferedReader br = new BufferedReader(new InputStreamReader(
     conn.getInputStream()));
   String inputLine;
   while ((inputLine = br.readLine()) != null) {
    contentb.append(inputLine);
    contentb.append("\n");
   }
   br.close();
  } catch (MalformedURLException e) {
   e.printStackTrace();
  } catch (IOException e) {
   e.printStackTrace();
  }
  return contentb.toString();
 }

 

Latest posts by Ajitesh Kumar (see all)
Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.

Recent Posts

What are AI Agents? How do they work?

Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…

5 days ago

Agentic AI Design Patterns Examples

In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…

6 days ago

List of Agentic AI Resources, Papers, Courses

In this blog, I aim to provide a comprehensive list of valuable resources for learning…

7 days ago

Understanding FAR, FRR, and EER in Auth Systems

Have you ever wondered how systems determine whether to grant or deny access, and how…

1 week ago

Top 10 Gartner Technology Trends for 2025

What revolutionary technologies and industries will define the future of business in 2025? As we…

2 weeks ago

OpenAI GPT Models in 2024: What’s in it for Data Scientists

For data scientists and machine learning researchers, 2024 has been a landmark year in AI…

2 weeks ago