If you love python, you may be interested in doing information retrieval with python language.
So what python tools are out there for information Retrieval?
Today I would like to introduce two that, I think, are the most frequently used and famous.
1. pyserini
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.
Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections
With Pyserini, it's easy to reproduce runs on a number of standard IR test collections! A low-effort way to try things out is to look at our online notebooks, which will allow you to get started with just a few clicks.
2. Capreolus
3. PyLucene
The first one I should mention is PyLucene, which is a Python extension for accessing Java LuceneTMAs you may already know, Lucene is super famous and powerful in the IR community, especially with the version 4.x adding many useful functionalities. PyLucene of course inherited
One problem for PyLucene is that it is not a pure python package, instead it is a Python wrapper around Java Lucene. This may cause problems to people who want to a pure python environment.
See more introduction from its official site:
4. whoosh!
As it is claimed on its homepage, it is a Fast, pure-Python full text indexing, search, and spell checking library. See what is spell checking here.
About Whoosh
- Pythonic API.
- Pure-Python. No compilation or binary packages needed, no mysterious crashes.
- Fielded indexing and search.
- Fast indexing and retrieval -- faster than any other pure-Python, scoring, full-text search solution I know of.
- Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.
- Powerful query language.
- Pure Python spell-checker (as far as I know, the only one).
- Anywhere a pure-Python solution is desirable to avoid having to build/compile native libraries (or force users to build/compile them).
- As a research platform (at least for programmers that find Python easier to read and work with than Java ;)
- When an easy-to-use Pythonic interface is more important to you than raw speed.
Other links for further reference:
- GrassyKnoll a search engine in Python
- TF IDF a basic TF-IDF module on Google code
- IRLib Information Retrieval Library (in Python)
Post a Comment