3 HTML text extractors in Python

python html to text

I'd like to introduce a few HTML text extractors written in Python.
I have tested all of them. Each one has its own advantages and disadvantages.

1. python-readability

This is a python port of a ruby port of arc90's readability project

http://lab.arc90.com/experiments/readability/

In few words,
Given a html document, it pulls out the main body text and cleans it up.
It also can clean up title based on latest readability.js code.

Based on:
 - Latest readability.js ( https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js )
 - Ruby port by starrhorne and iterationlabs
 - Python port by gfxmonk ( https://github.com/gfxmonk/python-readability , based on BeautifulSoup )
 - Decruft effort to move to lxml ( http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ )
 - "BR to P" fix from readability.js which improves quality for smaller texts.
 - Github users contributions.

Download: https://github.com/buriy/python-readability

2. python-boilerpipe

A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

Configuration

Dependencies: jpype, chardet

The boilerpipe jar files will get fetched and included automatically when building the package.

Usage

Be sure to have set JAVA_HOME properly since jpype depends on this setting.

The constructor takes a keyword argment extractor, being one of the available boilerpipe extractor types:

DefaultExtractor
ArticleExtractor
ArticleSentencesExtractor
KeepEverythingExtractor
KeepEverythingWithMinKWordsExtractor
LargestContentExtractor
NumWordsRulesExtractor
CanolaExtractor

If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.

from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)

Then, to extract relevant content:

extracted_text = extractor.getText()

extracted_html = extractor.getHTML()

Download: https://github.com/misja/python-boilerpipe

3. python-goose

Goose was originally an article extractor written in Java that has most recently (aug2011) converted to a scala project by Gravity.com

This is a complete rewrite in python. The aim of the software is to take any news article or article type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

Goose will try to extract the following information:

Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article (TODO)
Meta Description
Meta tags

Originally, Goose was open sourced by Gravity.com in 2011

Lead Programmer: Jim Plush (Gravity.com)
Contributers: Robbie Coleman (Gravity.com)

The python version was rewrite by:

Xavier Grangier (Recrutae.com)

For more information, please go to its github homepage: https://github.com/xgdlm/python-goose