python html to text

I'd like to introduce a few HTML text extractors written in Python.
I have tested all of them. Each one has its own advantages and disadvantages.

1.  python-readability


This is a python port of a ruby port of arc90's readability project

http://lab.arc90.com/experiments/readability/

In few words,
Given a html document, it pulls out the main body text and cleans it up.
It also can clean up title based on latest readability.js code.

Based on:
 - Latest readability.js ( https://github.com/MHordecki/readability-redux/blob/master/readability/readability.js )
 - Ruby port by starrhorne and iterationlabs
 - Python port by gfxmonk ( https://github.com/gfxmonk/python-readability , based on BeautifulSoup )
 - Decruft effort to move to lxml ( http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ )
 - "BR to P" fix from readability.js which improves quality for smaller texts.
 - Github users contributions.

Download: https://github.com/buriy/python-readability


2.  python-boilerpipe


A python wrapper for Boilerpipe, an excellent Java library for boilerplate removal and fulltext extraction from HTML pages.

Configuration

Dependencies: jpype, chardet
The boilerpipe jar files will get fetched and included automatically when building the package.

Usage

Be sure to have set JAVA_HOME properly since jpype depends on this setting.
The constructor takes a keyword argment extractor, being one of the available boilerpipe extractor types:
  • DefaultExtractor
  • ArticleExtractor
  • ArticleSentencesExtractor
  • KeepEverythingExtractor
  • KeepEverythingWithMinKWordsExtractor
  • LargestContentExtractor
  • NumWordsRulesExtractor
  • CanolaExtractor
If no extractor is passed the DefaultExtractor will be used by default. Additional keyword arguments are either html for HTML text or url.
from boilerpipe.extract import Extractor
extractor = Extractor(extractor='ArticleExtractor', url=your_url)
Then, to extract relevant content:
extracted_text = extractor.getText()

extracted_html = extractor.getHTML()

Download: https://github.com/misja/python-boilerpipe

3. python-goose



Goose was originally an article extractor written in Java that has most recently (aug2011) converted to a scala project by Gravity.com
This is a complete rewrite in python. The aim of the software is to take any news article or article type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.
Goose will try to extract the following information:
  • Main text of an article
  • Main image of article
  • Any Youtube/Vimeo movies embedded in article (TODO)
  • Meta Description
  • Meta tags
Originally, Goose was open sourced by Gravity.com in 2011
  • Lead Programmer: Jim Plush (Gravity.com)
  • Contributers: Robbie Coleman (Gravity.com)
The python version was rewrite by:
  • Xavier Grangier (Recrutae.com)
For more information, please go to its github homepage: https://github.com/xgdlm/python-goose

0 comments

Popular Posts

无觅相关文章插件,迅速提升网站流量