Python

Python Scraper Code to Search Arxiv Latest Papers

In this post, you will learn about Python source code related to search Arxiv for relevant and latest machine learning and data science research papers. If you are looking for a faster way to research on Arxiv papers without really going to the Arxiv website, you may want to get this piece of code in your kitty. You can further automate the Arxiv search to get notified based on some logic. Without further ado, let’s get started. 

Step 1: Install Python Arxiv Library

As a first step, install the Python Arxiv library using the code such as below in your Jupyter notebook or Google colab instance:

pip install arxiv

Step 2: Execute the code to search the papers

Once the Arxiv library is set up, the next step is to execute the code to retrieve the papers based on keywords search. Here is the code.

import arxiv

search = arxiv.Search(
  query = "automl",
  max_results = 3,
  sort_by = arxiv.SortCriterion.SubmittedDate,
  sort_order = arxiv.SortOrder.Descending
)

Pay attention to some of the following in the above Python code:

  • Parameter “query” is used to assign the query word (text format).
  • Parameter “max_results” is used to assign the number of results (numeric). If not set, the default value is 10 and the maximum limit is 300,000 results.
  • Parameter “sort_by” is used to specify the criteria that would be used to sort the output. The value can submittedDate, lastUpdatedDate, relevance. When set to submittedDate, you can search for latest papers.
  • Parameter “sort_order” is used to specify the order in which results will be sorted. The value can be Ascending or Descending.
  • There is also an additional parameter called as id_list which can be used in place of query when one wants to get specific set of papers. You can specify the id_list with ids array. It can be in the format such as id_list = [“2107.10495v1”].

You can use some of the following query formats to search specific and focused papers when you have multiple keywords:

  • Use double quotes such as query = “\”logistic regression\””
  • Use AND and OR operator. For example, searching application of random forest in insurance domain such as query = “insurance AND \”random forest\””

Step 3: Print the search results

Finally, you can print the result using commands such as the following.

for result in search.results():
  print('Title: ', result.title, '\nDate: ',result.published , '\nId: ', result.entry_id, '\nSummary: ',result.summary ,'\nURL: ', result.pdf_url, '\n\n')

This will show the output such as the following:

Fig 1. Print search results – Arxiv python library

You can print some of the following using different attributes of the result object:

  • result.entry_id
  • result.published
  • result.title
  • result.summary
  • result.authors
  • result.pdf_url
  • result.primary_category
  • result.categories
  • result.links

Putting the code together

Here is the quick Python code you could copy and get started right away. The code below searches for papers consisting of keywords healthcare and machine learning.

import arxiv 

search = arxiv.Search(
  query = "healthcare AND \"machine learning\"",
  max_results = 3,
  sort_by = arxiv.SortCriterion.SubmittedDate,
  sort_order = arxiv.SortOrder.Descending
)

for result in search.results():
  print('Title: ', result.title, '\nDate: ',result.published , '\nId: ', result.entry_id, '\nSummary: ',
        result.summary ,'\nURL: ', result.pdf_url, '\n\n')
Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.

Recent Posts

Retrieval Augmented Generation (RAG) & LLM: Examples

Last updated: 25th Jan, 2025 Have you ever wondered how to seamlessly integrate the vast…

2 weeks ago

How to Setup MEAN App with LangChain.js

Hey there! As I venture into building agentic MEAN apps with LangChain.js, I wanted to…

3 weeks ago

Build AI Chatbots for SAAS Using LLMs, RAG, Multi-Agent Frameworks

Software-as-a-Service (SaaS) providers have long relied on traditional chatbot solutions like AWS Lex and Google…

3 weeks ago

Creating a RAG Application Using LangGraph: Example Code

Retrieval-Augmented Generation (RAG) is an innovative generative AI method that combines retrieval-based search with large…

4 weeks ago

Building a RAG Application with LangChain: Example Code

The combination of Retrieval-Augmented Generation (RAG) and powerful language models enables the development of sophisticated…

4 weeks ago

Building an OpenAI Chatbot with LangChain

Have you ever wondered how to use OpenAI APIs to create custom chatbots? With advancements…

4 weeks ago