Categories: JavaWeb

Java – How to Scrape Content from a URL

This article represents take-away code sample that could be used to get or scrape content from a given URL. Those wanting to do a quick web scrape could use this piece of code. I shall be posting a series of blogs which would help one to create a web scraper using Java. The reason why I am hooked to Java web scraping is the need to get the data from web for data analysis. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Code Sample – Get Content from URL

Pay attention to some of the following aspects of fetching content from a given URL:

  • Create a URL object using actual URL string
  • Create a URLConnection object using that URL object created in above step
  • Set the configuration parameters. Key is to note the connection and read timeout. At times when scraping the websites, it helps with slow websites.
  • Create a BufferedReader for reading the data
  • Read line by line
 public String getContent(String urlstr) {
  URL url = null;
  StringBuilder contentb = new StringBuilder();
  try {
   // get URL content
   url = new URL(urlstr);
   // Create a URL Connection Object
   URLConnection conn = (HttpURLConnection) url.openConnection();
   // Set the configuration parameters
   // Note the readTimeOut set to 30 seconds.
   // This is quite important when you are planning to scrape URLs. 
   conn.setConnectTimeout(100000);
   conn.setReadTimeout(30000);
   conn.connect();
   // open the stream and put it into BufferedReader
   BufferedReader br = new BufferedReader(new InputStreamReader(
     conn.getInputStream()));
   String inputLine;
   while ((inputLine = br.readLine()) != null) {
    contentb.append(inputLine);
    contentb.append("\n");
   }
   br.close();
  } catch (MalformedURLException e) {
   e.printStackTrace();
  } catch (IOException e) {
   e.printStackTrace();
  }
  return contentb.toString();
 }

 

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.

Recent Posts

What is Embodied AI? Explained with Examples

Artificial Intelligence (AI) has evolved significantly, from its early days of symbolic reasoning to the…

1 month ago

Retrieval Augmented Generation (RAG) & LLM: Examples

Last updated: 25th Jan, 2025 Have you ever wondered how to seamlessly integrate the vast…

4 months ago

How to Setup MEAN App with LangChain.js

Hey there! As I venture into building agentic MEAN apps with LangChain.js, I wanted to…

4 months ago

Build AI Chatbots for SAAS Using LLMs, RAG, Multi-Agent Frameworks

Software-as-a-Service (SaaS) providers have long relied on traditional chatbot solutions like AWS Lex and Google…

4 months ago

Creating a RAG Application Using LangGraph: Example Code

Retrieval-Augmented Generation (RAG) is an innovative generative AI method that combines retrieval-based search with large…

5 months ago

Building a RAG Application with LangChain: Example Code

The combination of Retrieval-Augmented Generation (RAG) and powerful language models enables the development of sophisticated…

5 months ago