Coverage for /media/ldata/code/tendril/tendril/utils/www.py : 65%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# Copyright (C) 2015 Chintalagiri Shashank # # This file is part of Tendril. # # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU Affero General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Affero General Public License for more details. # # You should have received a copy of the GNU Affero General Public License # along with this program. If not, see <http://www.gnu.org/licenses/>. The WWW Utils Module (:mod:`tendril.utils.www`) ===============================================
This module provides utilities to deal with the internet. All application code should access the internet through this module, since this where support for proxies and caching is implemented.
.. rubric:: Main Provided Methods
.. autosummary::
strencode urlopen get_soup
This module uses the following configuration values from :mod:`tendril.utils.config`:
.. rubric:: Basic Settings
- :data:`tendril.utils.config.ENABLE_REDIRECT_CACHING` Whether or not redirect caching should be used. - :data:`tendril.utils.config.TRY_REPLICATOR_CACHE_FIRST` Whether or not a replicator cache should be used.
Redirect caching speeds up network accesses by saving ``301`` and ``302`` redirects, and not needing to get the correct URL on a second access. This redirect cache is stored as a pickled object in the ``INSTANCE_CACHE`` folder. The effect of this caching is far more apparent when a replicator cache is also used.
.. rubric:: Network Proxy Settings
- :data:`tendril.utils.config.NETWORK_PROXY_TYPE` - :data:`tendril.utils.config.NETWORK_PROXY_IP` - :data:`tendril.utils.config.NETWORK_PROXY_PORT` - :data:`tendril.utils.config.NETWORK_PROXY_USER` - :data:`tendril.utils.config.NETWORK_PROXY_PASS`
.. rubric:: Replicator Cache Settings
The replicator cache is intended to be a ``http-replicator`` instance, to be used to cache the web pages that are accessed locally. If ``TRY_REPLICATOR_CACHE_FIRST`` is False, the replicator isn't actually going to be hit.
- :data:`tendril.utils.config.REPLICATOR_PROXY_TYPE` - :data:`tendril.utils.config.REPLICATOR_PROXY_IP` - :data:`tendril.utils.config.REPLICATOR_PROXY_PORT` - :data:`tendril.utils.config.REPLICATOR_PROXY_USER` - :data:`tendril.utils.config.REPLICATOR_PROXY_PASS`
This module also provides the :class:`WWWCachedFetcher` class, an instance of which is available in :data:`cached_fetcher`, which is subsequently used by :func:`get_soup` and any application code that want's cached results.
Overall, caching should look something like this :
- WWWCacheFetcher provides short term (~5 days) caching, aggressively expriring whatever is here.
- RedirectCacheHandler is something of a special case, handling redirects which otherwise would be incredibly expensive. Unfortunately, this layer is also the dumbest cacher, and does not expire anything, ever. To 'invalidate' something in this cache, the entire cache needs to be nuked. It may be worthwhile to consider moving this to redis instead.
- http-replicator provides an underlying caching layer which is HTTP1.1 compliant.
"""
""" This function converts unicode strings to ASCII, using python's :func:`str.encode`, replacing any unicode characters present in the string. Unicode characters which Tendril expects to see in web content related to it are specifically replaced first with ASCII characters or character sequences which reasonably reproduce the original meanings.
:param string: unicode string to be encoded. :return: ASCII version of the string.
"""
except IOError: redirect_cache = {} logger.info('Created new Redirect Cache')
""" Called during python interpreter shutdown, this function dumps the redirect cache to the file system. """ if DUMP_REDIR_CACHE_ON_EXIT: with open(REDIR_CACHE_FILE, 'wb') as f: pickle.dump(redirect_cache, f, protocol=2) logger.info('Dumping Redirect Cache to file')
""" This handler modifies the behavior of :class:`urllib2.HTTPRedirectHandler`, resulting in a HTTP ``301`` or ``302`` status to be included in the ``result``.
When this handler is attached to a ``urllib2`` opener, if the opening of the URL resulted in a redirect via HTTP ``301`` or ``302``, this is reported along with the result. This information can be used by the opener to maintain a redirect cache. """ """ Wraps the :func:`urllib2.HTTPRedirectHandler.http_error_301` handler, setting the ``result.status`` to ``301`` in case a http ``301`` error is encountered. """ self, req, fp, code, msg, headers)
""" Wraps the :func:`urllib2.HTTPRedirectHandler.http_error_302` handler, setting the ``result.status`` to ``302`` in case a http ``302`` error is encountered. """ self, req, fp, code, msg, headers)
else:
""" Tests an opener obtained using :func:`urllib2.build_opener` by attempting to open Google's homepage. This is used to test internet connectivity. """ except URLError: return False
""" Creates an opener for the internet.
It also attaches the :class:`CachingRedirectHandler` to the opener and sets its User-agent to ``Mozilla/5.0``.
If the Network Proxy settings are set and recognized, it creates the opener and attaches the proxy_handler to it. The opener is tested and returned if the test passes.
If the test fails an opener without the proxy settings is created instead and is returned instead. """
use_proxy = True proxyurl = 'http://' + NETWORK_PROXY_IP if NETWORK_PROXY_PORT: proxyurl += ':' + NETWORK_PROXY_PORT proxy_handler = ProxyHandler({NETWORK_PROXY_TYPE: proxyurl}) if NETWORK_PROXY_USER is not None: use_proxy_auth = True password_mgr = HTTPPasswordMgrWithDefaultRealm() password_mgr.add_password( None, proxyurl, NETWORK_PROXY_USER, NETWORK_PROXY_PASS ) proxy_auth_handler = ProxyBasicAuthHandler(password_mgr) if use_proxy_auth: openr = build_opener( proxy_handler, proxy_auth_handler, CachingRedirectHandler ) else: openr = build_opener( proxy_handler, CachingRedirectHandler ) else: openr = build_opener(CachingRedirectHandler) openr.addheaders = [('User-agent', 'Mozilla/5.0')] return openr
""" Creates an opener for the replicator.
It also attaches the :class:`CachingRedirectHandler` to the opener and sets its User-agent to ``Mozilla/5.0``.
If the Network Proxy settings are set and recognized, it creates the opener and attaches the proxy_handler to it, and is returned. """
use_proxy = False use_proxy_auth = False proxy_handler = None proxy_auth_handler = None
if REPLICATOR_PROXY_TYPE == 'http': use_proxy = True proxyurl = 'http://' + REPLICATOR_PROXY_IP if REPLICATOR_PROXY_PORT: proxyurl += ':' + REPLICATOR_PROXY_PORT proxy_handler = ProxyHandler( {REPLICATOR_PROXY_TYPE: proxyurl} ) if REPLICATOR_PROXY_USER is not None: use_proxy_auth = True password_mgr = HTTPPasswordMgrWithDefaultRealm() password_mgr.add_password( None, proxyurl, REPLICATOR_PROXY_USER, REPLICATOR_PROXY_PASS ) proxy_auth_handler = ProxyBasicAuthHandler(password_mgr) if use_proxy: if use_proxy_auth: openr = build_opener( proxy_handler, proxy_auth_handler, CachingRedirectHandler ) else: openr = build_opener( proxy_handler, CachingRedirectHandler ) else: openr = build_opener(CachingRedirectHandler) openr.addheaders = [('User-agent', 'Mozilla/5.0')] return openr
replicator_opener = _create_replicator_opener()
""" Opens a url specified by the ``url`` parameter.
This function handles : - Redirect caching, if enabled. - Trying the replicator first, if enabled. - Retries upto 5 times if it encounters a http ``500`` error.
""" try: page = replicator_opener.open(url) try: if ENABLE_REDIRECT_CACHING is True and page.status == 301: logger.debug('Detected New Permanent Redirect:\n' + url + '\n' + page.url) redirect_cache[url] = page.url except AttributeError: pass return page except HTTPError as e: logger.error("HTTP Error : " + str(e.code) + str(url)) if e.code == 500: time.sleep(0.5) retries -= 1 else: raise except URLError as e: logger.error("URL Error : " + str(e.errno) + " " + str(e.reason)) raise
url + '\n' + page.url) else:
# TODO improve this to use / provide a decent caching layer. """ This class implements a simple filesystem cache which can be used to create and obtain from cached www requests.
The cache is stored in the ``cache_fs`` filesystem, with a filename constructed from the md5 sum of the url (encoded as ``utf-8`` if necessary). """
# Use MD5 hash of the URL as the filename filepath = md5(url.encode('utf-8')).hexdigest() else: # TODO This seriously needs cleaning up. if int(time.time()) - int(time.mktime(self.cache_fs.getinfo(filepath)['modified_time'].timetuple())) < max_age: # noqa return self.cache_fs.open(filepath).read() # Retrieve over HTTP and cache, using rename to avoid collisions # This can be pretty expensive if the move is across a real filesystem # boundary. We should instead use a temporary file in the cache_fs # itself self.cache_fs, filepath)
#: The module's :class:`WWWCachedFetcher` instance which should be #: used whenever cached results are desired.
""" Gets a :mod:`bs4` parsed soup for the ``url`` specified by the parameter. The :mod:`lxml` parser is used.
This function returns a soup constructed of the cached page if one exists and is valid, or obtains one and dumps it into the cache if it doesn't. """ return None |