If you love python, you may be interested in doing information retrieval with python language.
So what python tools are out there for information Retrieval?

Today I would like to introduce two that, I think, are the most frequently used and famous.

1. PyLucene

The first one I should mention is PyLucene,  which is a Python extension for accessing Java LuceneTM 

As you may already know, Lucene is super famous and powerful in the IR community, especially with the version 4.x adding many useful functionalities. PyLucene of course inherited 

One problem for PyLucene is that it is not a pure python package, instead it is a Python wrapper around Java Lucene. This may cause problems to people who want to a pure python environment. 

See more introduction from its official site: 
PyLucene is not a Lucene port but a Python wrapper around Java Lucene. PyLucene embeds a Java VM with Lucene into a Python process. The PyLucene Python extension, a Python module called lucene is machine-generated by JCC.
PyLucene is built with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI). Sources for JCC are included with the PyLucene sources.
See here for more information and documentation about PyLucene.

2. whoosh!

As it is claimed on its homepage, it is a Fast, pure-Python full text indexing, search, and spell checking library. See what is spell checking here.

About Whoosh

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.
Some of Whoosh's features include:
  • Pythonic API.
  • Pure-Python. No compilation or binary packages needed, no mysterious crashes.
  • Fielded indexing and search.
  • Fast indexing and retrieval -- faster than any other pure-Python, scoring, full-text search solution I know of.
  • Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
  • Powerful query language.
  • Pure Python spell-checker (as far as I know, the only one).
Whoosh might be useful in the following circumstances:
  • Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them).
  • As a research platform (at least for programmers that find Python easier to read and work with than Java ;)
  • When an easy-to-use Pythonic interface is more important to you than raw speed.  

Other links for further reference: 

Information retrieval software that can be used with Python:


Popular Posts