my years of programming in Python and roaming around GitHub's Explore
section, I've come across a few libraries that stood out to me as being
particularly enjoyable to use. This blog post is an effort to further
spread that knowledge.
I specifically excluded awesome libs like requests, SQLAlchemy, Flask, fabricetc.
because I think they're already pretty "main-stream". If you know what
you're trying to do, it's almost guaranteed that you'll stumble over the
aforementioned. This is a list of libraries that in my opinion should be better known, but aren't.
For parsing HTML in Python, Beautiful Soup is
oft recommended and it does a great job. It sports a good pythonic API
and it's easy to find introductory guides on the web. All is good in
parsing-land .. until you want to parse more than a dozen documents at a
time and immediately run head-first into performance problems. It's -
simply put - very, very slow.
immediately stands out is how fast lxml is. Compared to Beautiful Soup,
the lxml docs are pretty sparse and that's what originally kept me from
adopting this mustang of a parsing library. lxml is pretty clunky to
use. Yeah you can learn and use Xpath or cssselect to
select specific elements out of the tree and it becomes kind of
tolerable. But once you've selected the elements that you actually want
to get, you have to navigate the labyrinth of attributes lxml exposes,
some containing the bits you want to get at, but the vast majority just
returning None. This becomes easier after a couple dozen uses but it remains unintuitive.
So either slow and easy to use or fast and hard to use, right?
Handling dates is a pain. Thank god dateutil exists. I won't even go near parsing dates without trying dateutil.parser first:
fromdateutil.parserimportparse>>>parse('Mon, 11 Jul 2011 10:01:56 +0200 (CEST)')datetime.datetime(2011,7,11,10,1,56,tzinfo=tzlocal())# fuzzy ignores unknown tokens>>>s="""Today is 25 of September of 2003, exactly... at 10:49:41 with timezone -03:00.""">>>parse(s,fuzzy=True)datetime.datetime(2003,9,25,10,49,41,tzinfo=tzoffset(None,-10800))
Another thing that dateutil does for you, that would be a total pain to do manually, is recurrence:
fuzzywuzzy allows you to do fuzzy comparison on wuzzes strings. This has a whole host of use cases and is especially nice when you have to deal with human-generated data.
Consider the following code that uses the Levenshtein distance comparing some user input to an array of possible choices.
fromLevenshteinimportdistancecountries=['Canada','Antarctica','Togo',...]defchoose_least_distant(element,choices):'Return the one element of choices that is most similar to element'returnmin(choices,key=lambdas:distance(element,s))user_input='canaderp'choose_least_distant(user_input,countries)>>>'Canada'
is all nice and dandy but we can do better. The ocean of 3rd party libs
in Python is so vast, that in most cases we can just import something and be on our way:
a Python API and shell utilities to monitor file system events. This
means you can watch some directory and define a "push-based" system.
Watchdog supports all kinds of problems. A solid piece of engineering
that does it much better than the 5 or so libraries I tried before
finding out about it.
That listdir is in os and not os.path is
unfortunate and unexpected and one would really hope for more from such
a prominent module. And then all this manual fiddling for what really
should be as simple as possible.
But with the power of path, handling file paths becomes fun again:
Best part of it all? path subclasses Python's str so you can use it completely guilt-free without constantly being forced to cast it to str and worrying about libraries that checkisinstance(s, basestring) (or even worse isinstance(s, str)).
I also found some other interesting packages you may want to know:
Docopt. Forget optparse and argparse, and build beautiful, readable and (if you need) complex command-line interfaces using docstrings. IMO the best module created in 2013.
Requests, or HTTP for humans, is a more pythonic approach to deal with HTTP requests. Much, much, much better than urllib2. And it has been downloaded over 5,000,000 times from PyPI for a reason :)
Bottle is a fast, simple and lightweight WSGI micro web-framework. Build small websites and REST APIs in seconds. All the framework is just one py file that you can drop in your directory.
Structlog is advanced processor for logging. It integrates with any existing logger, wrapping the Python standard library. You can build custom loggers, adding context as you go, to keep your logs consistent and readable.
Delorean lets you play with dates and times in a very convenient way. Set time zones, truncate to seconds, minutes, hours, or even iterate from one date to another using a specific step. Check out the doc, it contains plenty of examples.