If you love python, you may be interested in doing information retrieval with python language.
So what python tools are out there for information Retrieval?
Today I would like to introduce two that, I think, are the most frequently used and famous.
As you may already know, Lucene is super famous and powerful in the IR community, especially with the version 4.x adding many useful functionalities. PyLucene of course inherited
One problem for PyLucene is that it is not a pure python package, instead it is a Python wrapper around Java Lucene. This may cause problems to people who want to a pure python environment.
See more introduction from its official site:
As it is claimed on its homepage, it is a Fast, pure-Python full text indexing, search, and spell checking library. See what is spell checking here.
- Pythonic API.
- Pure-Python. No compilation or binary packages needed, no mysterious crashes.
- Fielded indexing and search.
- Fast indexing and retrieval -- faster than any other pure-Python, scoring, full-text search solution I know of.
- Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
- Powerful query language.
- Pure Python spell-checker (as far as I know, the only one).
- Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them).
- As a research platform (at least for programmers that find Python easier to read and work with than Java ;)
- When an easy-to-use Pythonic interface is more important to you than raw speed.