Browser API

API documentation for the mechanize Browser object. You can create a mechanize Browser instance as:

from mechanize import Browser
br = Browser()

The Browser

class mechanize.Browser(history=None, request_class=None, content_parser=None, factory_class=<class mechanize._html.Factory>, allow_xhtml=False)[source]

Browser-like class with support for history, forms and links.

BrowserStateError is raised whenever the browser is in the wrong state to complete the requested operation - e.g., when back() is called when the browser history is empty, or when follow_link() is called when the current response does not contain HTML data.

Public attributes:

request: current request (mechanize.Request)

form: currently selected form (see select_form())

Parameters:
  • history – object implementing the mechanize.History interface. Note this interface is still experimental and may change in future. This object is owned by the browser instance and must not be shared among browsers.
  • request_class – Request class to use. Defaults to mechanize.Request
  • content_parser – A function that is responsible for parsing received html/xhtml content. See the builtin mechanize._html.content_parser() function for details on the interface this function must support.
  • factory_class – HTML Factory class to use. Defaults to mechanize.Factory
add_client_certificate(url, key_file, cert_file)

Add an SSL client certificate, for HTTPS client auth.

key_file and cert_file must be filenames of the key and certificate files, in PEM format. You can use e.g. OpenSSL to convert a p12 (PKCS 12) file to PEM format:

openssl pkcs12 -clcerts -nokeys -in cert.p12 -out cert.pem openssl pkcs12 -nocerts -in cert.p12 -out key.pem

Note that client certificate password input is very inflexible ATM. At the moment this seems to be console only, which is presumably the default behaviour of libopenssl. In future mechanize may support third-party libraries that (I assume) allow more options here.

back(n=1)[source]

Go back n steps in history, and return response object.

n: go back this number of steps (default 1 step)

click(*args, **kwds)[source]

See mechanize.HTMLForm.click() for documentation.

Find a link and return a Request object for it.

Arguments are as for find_link(), except that a link may be supplied as the first argument.

cookiejar

Return the current cookiejar (mechanize.CookieJar) or None

Find a link in current page.

Links are returned as mechanize.Link objects. Examples:

# Return third link that .search()-matches the regexp "python" (by
# ".search()-matches", I mean that the regular expression method
# .search() is used, rather than .match()).
find_link(text_regex=re.compile("python"), nr=2)

# Return first http link in the current page that points to
# somewhere on python.org whose link text (after tags have been
# removed) is exactly "monty python".
find_link(text="monty python",
        url_regex=re.compile("http.*python.org"))

# Return first link with exactly three HTML attributes.
find_link(predicate=lambda link: len(link.attrs) == 3)

Links include anchors <a>, image maps <area>, and frames <iframe>.

All arguments must be passed by keyword, not position. Zero or more arguments may be supplied. In order to find a link, all arguments supplied must match.

If a matching link is not found, mechanize.LinkNotFoundError is raised.

Parameters:
  • text – link text between link tags: e.g. <a href=”blah”>this bit</a> with whitespace compressed.
  • text_regex – link text between tag (as defined above) must match the regular expression object or regular expression string passed as this argument, if supplied
  • name – as for text and text_regex, but matched against the name HTML attribute of the link tag
  • url – as for text and text_regex, but matched against the URL of the link tag (note this matches against Link.url, which is a relative or absolute URL according to how it was written in the HTML)
  • tag – element name of opening tag, e.g. “a”
  • predicate – a function taking a Link object as its single argument, returning a boolean result, indicating whether the links
  • nr – matches the nth link that matches all other criteria (default 0)

Find a link and open() it.

Arguments are as for click_link().

Return value is same as for open().

forms()[source]

Return iterable over forms.

The returned form objects implement the mechanize.HTMLForm interface.

geturl()[source]

Get URL of current document.

global_form()[source]

Return the global form object, or None if the factory implementation did not supply one.

The “global” form object contains all controls that are not descendants of any FORM element.

The returned form object implements the mechanize.HTMLForm interface.

This is a separate method since the global form is not regarded as part of the sequence of forms in the document – mostly for backwards-compatibility.

Return iterable over links (mechanize.Link objects).

open(url_or_request, data=None, timeout=<object object>)[source]

Open a URL. Loads the page so that you can subsequently use forms(), links(), etc. on it.

Parameters:
  • url_or_request – Either a URL or a mechanize.Request
  • data (dict) – data to send with a POST request
  • timeout – Timeout in seconds
Returns:

A mechanize.Response object

open_novisit(url_or_request, data=None, timeout=<object object>)[source]

Open a URL without visiting it.

Browser state (including request, response, history, forms and links) is left unchanged by calling this function.

The interface is the same as for open().

This is useful for things like fetching images.

See also retrieve()

reload()[source]

Reload current document, and return response object.

response()[source]

Return a copy of the current response.

The returned object has the same interface as the object returned by open()

retrieve(fullurl, filename=None, reporthook=None, data=None, timeout=<object object>, open=<built-in function open>)

Returns (filename, headers).

For remote objects, the default filename will refer to a temporary file. Temporary files are removed when the OpenerDirector.close() method is called.

For file: URLs, at present the returned filename is None. This may change in future.

If the actual number of bytes read is less than indicated by the Content-Length header, raises ContentTooShortError (a URLError subclass). The exception’s .result attribute contains the (filename, headers) that would have been returned.

select_form(name=None, predicate=None, nr=None, **attrs)[source]

Select an HTML form for input.

This is a bit like giving a form the “input focus” in a browser.

If a form is selected, the Browser object supports the HTMLForm interface, so you can call methods like set_value(), set(), and click().

Another way to select a form is to assign to the .form attribute. The form assigned should be one of the objects returned by the forms() method.

If no matching form is found, mechanize.FormNotFoundError is raised.

If name is specified, then the form must have the indicated name.

If predicate is specified, then the form must match that function. The predicate function is passed the mechanize.HTMLForm as its single argument, and should return a boolean value indicating whether the form matched.

nr, if supplied, is the sequence number of the form (where 0 is the first). Note that control 0 is the first form matching all the other arguments (if supplied); it is not necessarily the first control in the form. The “global form” (consisting of all form controls not contained in any FORM element) is considered not to be part of this sequence and to have no name, so will not be matched unless both name and nr are None.

You can also match on any HTML attribute of the <form> tag by passing in the attribute name and value as keyword arguments. To convert HTML attributes into syntactically valid python keyword arguments, the following simple rule is used. The python keyword argument name is converted to an HTML attribute name by: Replacing all underscores with hyphens and removing any trailing underscores. You can pass in strings, functions or regular expression objects as the values to match. For example:

# Match form with the exact action specified
br.select_form(action='http://foo.com/submit.php')
# Match form with a class attribute that contains 'login'
br.select_form(class_=lambda x: 'login' in x)
# Match form with a data-form-type attribute that matches a regex
br.select_form(data_form_type=re.compile(r'a|b'))
set_ca_data(cafile=None, capath=None, cadata=None, context=None)

Set the SSL Context used for connecting to SSL servers.

This method accepts the same arguments as the ssl.SSLContext.load_verify_locations() method from the Python standard library. You can also pass a pre-built ssl.SSLContext via the context keyword argument. Note that to use this feature, you must be using Python >= 2.7.9.

set_client_cert_manager(cert_manager)

Set a mechanize.HTTPClientCertMgr, or None.

Set a cookie.

Note that it is NOT necessary to call this method under ordinary circumstances: cookie handling is normally entirely automatic. The intended use case is rather to simulate the setting of a cookie by client script in a web page (e.g. JavaScript). In that case, use of this method is necessary because mechanize currently does not support JavaScript, VBScript, etc.

The cookie is added in the same way as if it had arrived with the current response, as a result of the current request. This means that, for example, if it is not appropriate to set the cookie based on the current request, no cookie will be set.

The cookie will be returned automatically with subsequent responses made by the Browser instance whenever that’s appropriate.

cookie_string should be a valid value of the Set-Cookie header.

For example:

browser.set_cookie(
    "sid=abcdef; expires=Wednesday, 09-Nov-06 23:12:40 GMT")

Currently, this method does not allow for adding RFC 2986 cookies. This limitation will be lifted if anybody requests it.

See also set_simple_cookie() for an easier way to set cookies without needing to create a Set-Cookie header string.

set_cookiejar(cookiejar)

Set a mechanize.CookieJar, or None.

set_debug_http(handle)

Print HTTP headers to sys.stdout.

set_debug_redirects(handle)

Log information about HTTP redirects (including refreshes).

Logging is performed using module logging. The logger name is “mechanize.http_redirects”. To actually print some debug output, eg:

import sys, logging
logger = logging.getLogger("mechanize.http_redirects")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.INFO)

Other logger names relevant to this module:

  • mechanize.http_responses
  • mechanize.cookies

To turn on everything:

import sys, logging
logger = logging.getLogger("mechanize")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.INFO)
set_debug_responses(handle)

Log HTTP response bodies.

See set_debug_redirects() for details of logging.

Response objects may be .seek()able if this is set (currently returned responses are, raised HTTPError exception responses are not).

set_handle_equiv(handle, head_parser_class=None)

Set whether to treat HTML http-equiv headers like HTTP headers.

Response objects may be .seek()able if this is set (currently returned responses are, raised HTTPError exception responses are not).

set_handle_gzip(handle)

Add header indicating to server that we handle gzip content encoding. Note that if the server sends gzip’ed content, it is handled automatically in any case, regardless of this setting.

set_handle_redirect(handle)

Set whether to handle HTTP 30x redirections.

set_handle_referer(handle)[source]

Set whether to add Referer header to each request.

set_handle_refresh(handle, max_time=None, honor_time=True)

Set whether to handle HTTP Refresh headers.

set_handle_robots(handle)

Set whether to observe rules from robots.txt.

set_handled_schemes(schemes)

Set sequence of URL scheme (protocol) strings.

For example: ua.set_handled_schemes([“http”, “ftp”])

If this fails (with ValueError) because you’ve passed an unknown scheme, the set of handled schemes will not be changed.

set_header(header, value=None)[source]

Convenience method to set a header value in self.addheaders so that the header is sent out with all requests automatically.

Parameters:
  • header – The header name, e.g. User-Agent
  • value – The header value. If set to None the header is removed.
set_html(html, url='http://example.com/')[source]

Set the response to dummy with given HTML, and URL if given.

Allows you to then parse that HTML, especially to extract forms information. If no URL was given then the default is “example.com”.

set_password_manager(password_manager)

Set a mechanize.HTTPPasswordMgrWithDefaultRealm, or None.

set_proxies(proxies=None, proxy_bypass=None)

Configure proxy settings.

Parameters:
  • proxies – dictionary mapping URL scheme to proxy specification. None means use the default system-specific settings.
  • proxy_bypass – function taking hostname, returning whether proxy should be used. None means use the default system-specific settings.

The default is to try to obtain proxy settings from the system (see the documentation for urllib.urlopen for information about the system-specific methods used – note that’s urllib, not urllib2).

To avoid all use of proxies, pass an empty proxies dict.

>>> ua = UserAgentBase()
>>> def proxy_bypass(hostname):
...     return hostname == "noproxy.com"
>>> ua.set_proxies(
...     {"http": "joe:password@myproxy.example.com:3128",
...      "ftp": "proxy.example.com"},
...     proxy_bypass)
set_proxy_password_manager(password_manager)

Set a mechanize.HTTPProxyPasswordMgr, or None.

set_request_gzip(handle)

Add header indicating to server that we handle gzip content encoding. Note that if the server sends gzip’ed content, it is handled automatically in any case, regardless of this setting.

set_response(response)[source]

Replace current response with (a copy of) response.

response may be None.

This is intended mostly for HTML-preprocessing.

Similar to set_cookie() except that instead of using a cookie string, you simply specify the name, value, domain and optionally the path. The created cookie will never expire. For example:

browser.set_simple_cookie('some_key', 'some_value', '.example.com',
                          path='/some-page')
submit(*args, **kwds)[source]

Submit current form.

Arguments are as for mechanize.HTMLForm.click().

Return value is same as for open().

title()[source]

Return title, or None if there is no title element in the document.

viewing_html()[source]

Return whether the current response contains HTML data.

visit_response(response, request=None)[source]

Visit the response, as if it had been open() ed.

Unlike set_response(), this updates history rather than replacing the current response.

The Request

class mechanize.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, visit=None, timeout=<object object>, method=None)[source]

A request for some network resource. Note that if you specify the method as ‘GET’ and the data as a dict, then it will be automatically appended to the URL. If you leave method as None, then the method will be auto-set to POST and the data will become part of the POST request.

Parameters:
  • url (str) – The URL to request
  • data – Data to send with this request. Can be either a dictionary which will be encoded and sent as application/x-www-form-urlencoded data or a bytestring which will be sent as is. If you use a bytestring you should also set the Content-Type header appropriately.
  • headers (dict) – Headers to send with this request
  • method (str) – Method to use for HTTP requests. If not specified mechanize will choose GET or POST automatically as appropriate.
  • timeout (float) – Timeout in seconds

The remaining arguments are for internal use.

add_data(data)

Set the data (a bytestring) to be sent with this request

add_header(key, val=None)[source]

Add the specified header, replacing existing one, if needed. If val is None, remove the header.

add_unredirected_header(key, val)[source]

Same as add_header() except that this header will not be sent for redirected requests.

get_data()[source]

The data to be sent with this request

get_header(header_name, default=None)[source]

Get the value of the specified header. If absent, return default

get_method()[source]

The method used for HTTP requests

has_data()[source]

True iff there is some data to be sent with this request

has_header(header_name)[source]

Check if the specified header is present

has_proxy()[source]

Private method.

header_items()[source]

Get a copy of all headers for this request as a list of 2-tuples

set_data(data)[source]

Set the data (a bytestring) to be sent with this request

The Response

Response objects in mechanize are seek() able file-like objects that support some additional methods, depending on the protocol used for the connection. The documentation below is for HTTP(s) responses, as these are the most common.

Additional methods present for HTTP responses:

class mechanize._mechanize.HTTPResponse
code

The HTTP status code

getcode()

Return HTTP status code

geturl()

Return the URL of the resource retrieved, commonly used to determine if a redirect was followed

get_all_header_names(normalize=True)

Return a list of all headers names. When normalize is True, the case of the header names is normalized.

get_all_header_values(name, normalize=True)

Return a list of all values for the specified header name (which is case-insensitive. Since headers in HTTP can be specified multiple times, the returned value is always a list. See rfc822.Message.getheaders().

info()

Return the headers of the response as a rfc822.Message instance.

__getitem__(header_name)

Return the last HTTP Header matching the specified name as string. mechanize Response object act like dictionaries for convenient access to header values. For example: response['Date']. You can access header values using the header names, case-insensitively. Note that when more than one header with the same name is present, only the value of the last header is returned, use get_all_header_values() to get the values of all headers.

get(header_name, default=None):

Return the header value for the specified header_name or default if the header is not present. See __getitem__().

Miscellaneous

A link in a HTML document

Variables:
  • absolute_url – The absolutized link URL
  • url – The link URL
  • base_url – The base URL against which this link is resolved
  • text – The link text
  • tag – The link tag name
  • attrs – The tag attributes
class mechanize.History[source]

Though this will become public, the implied interface is not yet stable.

mechanize._html.content_parser(data, url=None, response_info=None, transport_encoding=None, default_encoding='utf-8', is_html=True)[source]

Parse data (a bytes object) into an etree representation such as xml.etree.ElementTree or lxml.etree

Parameters:
  • data (bytes) – The data to parse
  • url – The URL of the document being parsed or None
  • response_info – Information about the document (contains all HTTP headers as HTTPMessage)
  • transport_encoding – The character encoding for the document being parsed as specified in the HTTP headers or None.
  • default_encoding – The character encoding to use if no encoding could be detected and no transport_encoding is specified
  • is_html – If the document is to be parsed as HTML.