Python Scraper for GoogleNews, Twitter, Reddit & Arxiv

Python scraper GoogleNews Twitter Reddit Arxiv

In this post, you will get the Python code for scraping latest and greatest news about any topics from Google News, Twitter, Reddit and Arxiv. This could prove to be very useful for data scientist, machine learning enthusiats to keep track of latest and greatest happening in the field of artificial intelligence. If you are doing some research work, these pieces of code would prove to be very handy to quickly access the information. The code in this post has been worked out in Google Colab notebook.

First and foremost, import the necessary Python libraries such as the following for GoogleNews, Twitter and Arxiv

!pip install GoogleNews
!pip install arxiv
!pip install twitter

Python Code for mining GoogleNews

Here is the code you can use to get the latest news from Google. Make note of the search method invoked on an instance of GoogleNews. Also, note the parameter period passed as 1d (1 day) when creating the GoogleNews instance. It represents the news from the last N number of days. Get to learn further details from my related post – Google News search Python API example

from GoogleNews import GoogleNews
Language: lang as English 
Period: period as number, N, representing news from last N days
googlenews = GoogleNews(lang='en', period='1d')
Search method takes parameter as search text
''''"machine learning"')
Returns JSON objects representing different news
results = googlenews.results(sort=True);
Clear GoogleNews to do fresh search next time
Print result

Python Code for mining Twitter

Here is the code you can use for mining Twitter. Make a note of the consumer key/secret and OAuth access token/access token secret. You would require the need to create a developer app on Twitter and get these details. Get to learn further details from one of my related posts – Mining Twitter data using Python. Pay attention to the API search.tweets which are passed different hashtags such as #machinelearning, #deeplearning, #datascience with AND operator.

import twitter
Instantiate Twitter auth
CONSUMER_KEY = 'vXstCDSkF1Un1244406JmxABmK'
CONSUMER_SECRET = 'vM4PFAqksajjfhsakjfhsaJNMfZxzl2dQ16t4jucmCnrMKMCAO'
OAUTH_ACCESS_TOKEN = '87654312-giJ08B9oTRkajddhadkdoDsLA03w0MbfzweK6auN3'
OAUTH_ACCESS_TOKEN_SECRET = 'sjdahgOI0pyzB0yZXvfksdjjfh2B6MDD3kdjfhNigZDkRDza'
                           CONSUMER_KEY, CONSUMER_SECRET)
Get Twitter object
twitter_api = twitter.Twitter(auth=auth)
Search twitter by hashtag
tweets ="#machinelearning AND #deeplearning AND #datascience", max_results=200)
Print Tweets
for status in tweets['statuses']:
  if status['retweet_count'] > RETWEET_COUNT_THRESHOLD:
    print('\n\n', status['user']['screen_name'], ":", status['text'], '\nTweet URL: ', status['retweeted_status']['entities']['urls'][0]['expanded_url'],
          '\nRetweet count: ', status['retweet_count'])

Python Code for mining Arxiv

Here is the code you can use for mining Arxiv. Make a note of the Search object which represents the query and search criteria. And operator is used to combining multiple keywords search. Also, make note of double quotes used to search for keywords phrases. Get to learn further details on one of my related posts – Scraping Arxiv using Python code.

import arxiv
Create Arxiv Search object
search = arxiv.Search(
  query = "\"python\" AND \"machine learning\"",
  max_results = MAX_RESULTS_COUNT,
  sort_by = arxiv.SortCriterion.SubmittedDate,
  sort_order = arxiv.SortOrder.Descending
Print search results
for result in search.results():
  print('Title: ', result.title, '\nDate: ',result.published , '\nId: ', result.entry_id, '\nSummary: ',
        result.summary ,'\nURL: ', result.pdf_url, '\n\n')

Python code for mining Reddit

Here is the code you can use for mining Reddit. You would require to register with and create an app to get the client id (personal use script code) and secret token. Get to learn further details on one of my related post – Mining Reddit using Python code.

import requests
# note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth('kasjdhsJN-bkskjhVR3e1w', 'SX98I9U9WQasjdhsakdjjdAAAEgkIKiQ')
# here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': 'vitalflux',
        'password': 'vitalflux123'}
# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'vitalflux-pybot/0.0.1'}
# send our request for an OAuth token
res ='',
                    auth=auth, data=data, headers=headers)
# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']
# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# Print the subreddit popular posts
params = {'limit' : 10}
res = requests.get("",
for post in res.json()['data']['children']:
        '\nTitle: ', post['data']['title'], 
        '\nUps: ', post['data']['ups'], ' -- Upvote ratio: ', post['data']['upvote_ratio'], 
        '\nText: ', post['data']['selftext'])

Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog,
Posted in Data Science, Python. Tagged with .

Leave a Reply

Your email address will not be published. Required fields are marked *