Java – How to Scrape Content from a URL

0
This article represents take-away code sample that could be used to get or scrape content from a given URL. Those wanting to do a quick web scrape could use this piece of code. I shall be posting a series of blogs which would help one to create a web scraper using Java. The reason why I am hooked to Java web scraping is the need to get the data from web for data analysis. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.
Code Sample – Get Content from URL

Pay attention to some of the following aspects of fetching content from a given URL:

  • Create a URL object using actual URL string
  • Create a URLConnection object using that URL object created in above step
  • Set the configuration parameters. Key is to note the connection and read timeout. At times when scraping the websites, it helps with slow websites.
  • Create a BufferedReader for reading the data
  • Read line by line
	public String getContent(String urlstr) {
		URL url = null;
		StringBuilder contentb = new StringBuilder();
		try {
			// get URL content
			url = new URL(urlstr);
			// Create a URL Connection Object
			URLConnection conn = (HttpURLConnection) url.openConnection();
			// Set the configuration parameters
			// Note the readTimeOut set to 30 seconds.
			// This is quite important when you are planning to scrape URLs. 
			conn.setConnectTimeout(100000);
			conn.setReadTimeout(30000);
			conn.connect();
			// open the stream and put it into BufferedReader
			BufferedReader br = new BufferedReader(new InputStreamReader(
					conn.getInputStream()));
			String inputLine;
			while ((inputLine = br.readLine()) != null) {
				contentb.append(inputLine);
				contentb.append("\n");
			}
			br.close();
		} catch (MalformedURLException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}
		return contentb.toString();
	}

 

Ajitesh Kumar
Share.

Leave A Reply

Time limit is exhausted. Please reload the CAPTCHA.