If you love python, you may be interested in doing information retrieval with python language.
So what python tools are out there for information Retrieval?

Today I would like to introduce two that, I think, are the most frequently used and famous.

Notes: 2021-08-14, for starters in acedimia, I guess newly developped tools, like pyseriniCapreolus , are worthy methioned. pyserini is build on top of widely used  lucene (ES, solo) in industry, which mainly focused on reproduction of academical experiments.  Capreolus relies on pyserini, and mainly focused on IR experiments with deep learning based approaches. 

1. pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.

Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections

With Pyserini, it's easy to reproduce runs on a number of standard IR test collections! A low-effort way to try things out is to look at our online notebooks, which will allow you to get started with just a few clicks.

2. Capreolus

A toolkit for end-to-end neural ad hoc retrieval. 
Capreolus is a toolkit for conducting end-to-end ad hoc retrieval experiments. Capreolus provides fine control over the entire experimental pipeline through the use of interchangeable and configurable modules.

3. PyLucene

The first one I should mention is PyLucene,  which is a Python extension for accessing Java LuceneTM 

As you may already know, Lucene is super famous and powerful in the IR community, especially with the version 4.x adding many useful functionalities. PyLucene of course inherited 

One problem for PyLucene is that it is not a pure python package, instead it is a Python wrapper around Java Lucene. This may cause problems to people who want to a pure python environment. 

See more introduction from its official site: 
PyLucene is not a Lucene port but a Python wrapper around Java Lucene. PyLucene embeds a Java VM with Lucene into a Python process. The PyLucene Python extension, a Python module called lucene is machine-generated by JCC.
PyLucene is built with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI). Sources for JCC are included with the PyLucene sources.
See here for more information and documentation about PyLucene.

4. whoosh!

As it is claimed on its homepage, it is a Fast, pure-Python full text indexing, search, and spell checking library. See what is spell checking here.

About Whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.
Some of Whoosh's features include:
  • Pythonic API.
  • Pure-Python. No compilation or binary packages needed, no mysterious crashes.
  • Fielded indexing and search.
  • Fast indexing and retrieval -- faster than any other pure-Python, scoring, full-text search solution I know of.
  • Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
  • Powerful query language.
  • Pure Python spell-checker (as far as I know, the only one).
Whoosh might be useful in the following circumstances:
  • Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them).
  • As a research platform (at least for programmers that find Python easier to read and work with than Java ;)
  • When an easy-to-use Pythonic interface is more important to you than raw speed.  

Other links for further reference: 

Information retrieval software that can be used with Python:


Popular Posts