Browser API¶
API documentation for the mechanize Browser
object.
You can create a mechanize Browser
instance as:
from mechanize import Browser
br = Browser()
The Browser¶
-
class
mechanize.
Browser
(history=None, request_class=None, content_parser=None, factory_class=<class mechanize._html.Factory>, allow_xhtml=False)[source]¶ Browser-like class with support for history, forms and links.
BrowserStateError
is raised whenever the browser is in the wrong state to complete the requested operation - e.g., whenback()
is called when the browser history is empty, or whenfollow_link()
is called when the current response does not contain HTML data.Public attributes:
request: current request (
mechanize.Request
)form: currently selected form (see
select_form()
)Parameters: - history – object implementing the
mechanize.History
interface. Note this interface is still experimental and may change in future. This object is owned by the browser instance and must not be shared among browsers. - request_class – Request class to use. Defaults to
mechanize.Request
- content_parser – A function that is responsible for parsing
received html/xhtml content. See the builtin
mechanize._html.content_parser()
function for details on the interface this function must support. - factory_class – HTML Factory class to use. Defaults to
mechanize.Factory
-
add_client_certificate
(url, key_file, cert_file)¶ Add an SSL client certificate, for HTTPS client auth.
key_file and cert_file must be filenames of the key and certificate files, in PEM format. You can use e.g. OpenSSL to convert a p12 (PKCS 12) file to PEM format:
openssl pkcs12 -clcerts -nokeys -in cert.p12 -out cert.pem openssl pkcs12 -nocerts -in cert.p12 -out key.pem
Note that client certificate password input is very inflexible ATM. At the moment this seems to be console only, which is presumably the default behaviour of libopenssl. In future mechanize may support third-party libraries that (I assume) allow more options here.
-
back
(n=1)[source]¶ Go back n steps in history, and return response object.
n: go back this number of steps (default 1 step)
-
click
(*args, **kwds)[source]¶ See
mechanize.HTMLForm.click()
for documentation.
-
click_link
(link=None, **kwds)[source]¶ Find a link and return a Request object for it.
Arguments are as for
find_link()
, except that a link may be supplied as the first argument.
Return the current cookiejar (
mechanize.CookieJar
) or None
-
find_link
(text=None, text_regex=None, name=None, name_regex=None, url=None, url_regex=None, tag=None, predicate=None, nr=0)[source]¶ Find a link in current page.
Links are returned as
mechanize.Link
objects. Examples:# Return third link that .search()-matches the regexp "python" (by # ".search()-matches", I mean that the regular expression method # .search() is used, rather than .match()). find_link(text_regex=re.compile("python"), nr=2) # Return first http link in the current page that points to # somewhere on python.org whose link text (after tags have been # removed) is exactly "monty python". find_link(text="monty python", url_regex=re.compile("http.*python.org")) # Return first link with exactly three HTML attributes. find_link(predicate=lambda link: len(link.attrs) == 3)
Links include anchors <a>, image maps <area>, and frames <iframe>.
All arguments must be passed by keyword, not position. Zero or more arguments may be supplied. In order to find a link, all arguments supplied must match.
If a matching link is not found,
mechanize.LinkNotFoundError
is raised.Parameters: - text – link text between link tags: e.g. <a href=”blah”>this bit</a> with whitespace compressed.
- text_regex – link text between tag (as defined above) must match the regular expression object or regular expression string passed as this argument, if supplied
- name – as for text and text_regex, but matched against the name HTML attribute of the link tag
- url – as for text and text_regex, but matched against the URL of the link tag (note this matches against Link.url, which is a relative or absolute URL according to how it was written in the HTML)
- tag – element name of opening tag, e.g. “a”
- predicate – a function taking a Link object as its single argument, returning a boolean result, indicating whether the links
- nr – matches the nth link that matches all other criteria (default 0)
-
follow_link
(link=None, **kwds)[source]¶ Find a link and
open()
it.Arguments are as for
click_link()
.Return value is same as for
open()
.
-
forms
()[source]¶ Return iterable over forms.
The returned form objects implement the
mechanize.HTMLForm
interface.
-
global_form
()[source]¶ Return the global form object, or None if the factory implementation did not supply one.
The “global” form object contains all controls that are not descendants of any FORM element.
The returned form object implements the
mechanize.HTMLForm
interface.This is a separate method since the global form is not regarded as part of the sequence of forms in the document – mostly for backwards-compatibility.
-
links
(**kwds)[source]¶ Return iterable over links (
mechanize.Link
objects).
-
open
(url_or_request, data=None, timeout=<object object>)[source]¶ Open a URL. Loads the page so that you can subsequently use
forms()
,links()
, etc. on it.Parameters: - url_or_request – Either a URL or a
mechanize.Request
- data (dict) – data to send with a POST request
- timeout – Timeout in seconds
Returns: A
mechanize.Response
object- url_or_request – Either a URL or a
-
open_novisit
(url_or_request, data=None, timeout=<object object>)[source]¶ Open a URL without visiting it.
Browser state (including request, response, history, forms and links) is left unchanged by calling this function.
The interface is the same as for
open()
.This is useful for things like fetching images.
See also
retrieve()
-
response
()[source]¶ Return a copy of the current response.
The returned object has the same interface as the object returned by
open()
-
retrieve
(fullurl, filename=None, reporthook=None, data=None, timeout=<object object>, open=<built-in function open>)¶ Returns (filename, headers).
For remote objects, the default filename will refer to a temporary file. Temporary files are removed when the OpenerDirector.close() method is called.
For file: URLs, at present the returned filename is None. This may change in future.
If the actual number of bytes read is less than indicated by the Content-Length header, raises ContentTooShortError (a URLError subclass). The exception’s .result attribute contains the (filename, headers) that would have been returned.
-
select_form
(name=None, predicate=None, nr=None, **attrs)[source]¶ Select an HTML form for input.
This is a bit like giving a form the “input focus” in a browser.
If a form is selected, the Browser object supports the HTMLForm interface, so you can call methods like
set_value()
,set()
, andclick()
.Another way to select a form is to assign to the .form attribute. The form assigned should be one of the objects returned by the
forms()
method.If no matching form is found,
mechanize.FormNotFoundError
is raised.If name is specified, then the form must have the indicated name.
If predicate is specified, then the form must match that function. The predicate function is passed the
mechanize.HTMLForm
as its single argument, and should return a boolean value indicating whether the form matched.nr, if supplied, is the sequence number of the form (where 0 is the first). Note that control 0 is the first form matching all the other arguments (if supplied); it is not necessarily the first control in the form. The “global form” (consisting of all form controls not contained in any FORM element) is considered not to be part of this sequence and to have no name, so will not be matched unless both name and nr are None.
You can also match on any HTML attribute of the <form> tag by passing in the attribute name and value as keyword arguments. To convert HTML attributes into syntactically valid python keyword arguments, the following simple rule is used. The python keyword argument name is converted to an HTML attribute name by: Replacing all underscores with hyphens and removing any trailing underscores. You can pass in strings, functions or regular expression objects as the values to match. For example:
# Match form with the exact action specified br.select_form(action='http://foo.com/submit.php') # Match form with a class attribute that contains 'login' br.select_form(class_=lambda x: 'login' in x) # Match form with a data-form-type attribute that matches a regex br.select_form(data_form_type=re.compile(r'a|b'))
-
set_ca_data
(cafile=None, capath=None, cadata=None, context=None)¶ Set the SSL Context used for connecting to SSL servers.
This method accepts the same arguments as the
ssl.SSLContext.load_verify_locations()
method from the Python standard library. You can also pass a pre-builtssl.SSLContext
via the context keyword argument. Note that to use this feature, you must be using Python >= 2.7.9.
-
set_client_cert_manager
(cert_manager)¶ Set a mechanize.HTTPClientCertMgr, or None.
Set a cookie.
Note that it is NOT necessary to call this method under ordinary circumstances: cookie handling is normally entirely automatic. The intended use case is rather to simulate the setting of a cookie by client script in a web page (e.g. JavaScript). In that case, use of this method is necessary because mechanize currently does not support JavaScript, VBScript, etc.
The cookie is added in the same way as if it had arrived with the current response, as a result of the current request. This means that, for example, if it is not appropriate to set the cookie based on the current request, no cookie will be set.
The cookie will be returned automatically with subsequent responses made by the Browser instance whenever that’s appropriate.
cookie_string should be a valid value of the Set-Cookie header.
For example:
browser.set_cookie( "sid=abcdef; expires=Wednesday, 09-Nov-06 23:12:40 GMT")
Currently, this method does not allow for adding RFC 2986 cookies. This limitation will be lifted if anybody requests it.
See also
set_simple_cookie()
for an easier way to set cookies without needing to create a Set-Cookie header string.
Set a mechanize.CookieJar, or None.
-
set_debug_http
(handle)¶ Print HTTP headers to sys.stdout.
-
set_debug_redirects
(handle)¶ Log information about HTTP redirects (including refreshes).
Logging is performed using module logging. The logger name is “mechanize.http_redirects”. To actually print some debug output, eg:
import sys, logging logger = logging.getLogger("mechanize.http_redirects") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO)
Other logger names relevant to this module:
- mechanize.http_responses
- mechanize.cookies
To turn on everything:
import sys, logging logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO)
-
set_debug_responses
(handle)¶ Log HTTP response bodies.
See
set_debug_redirects()
for details of logging.Response objects may be .seek()able if this is set (currently returned responses are, raised HTTPError exception responses are not).
-
set_handle_equiv
(handle, head_parser_class=None)¶ Set whether to treat HTML http-equiv headers like HTTP headers.
Response objects may be .seek()able if this is set (currently returned responses are, raised HTTPError exception responses are not).
-
set_handle_gzip
(handle)¶ Add header indicating to server that we handle gzip content encoding. Note that if the server sends gzip’ed content, it is handled automatically in any case, regardless of this setting.
-
set_handle_redirect
(handle)¶ Set whether to handle HTTP 30x redirections.
-
set_handle_refresh
(handle, max_time=None, honor_time=True)¶ Set whether to handle HTTP Refresh headers.
-
set_handle_robots
(handle)¶ Set whether to observe rules from robots.txt.
-
set_handled_schemes
(schemes)¶ Set sequence of URL scheme (protocol) strings.
For example: ua.set_handled_schemes([“http”, “ftp”])
If this fails (with ValueError) because you’ve passed an unknown scheme, the set of handled schemes will not be changed.
-
set_header
(header, value=None)[source]¶ Convenience method to set a header value in self.addheaders so that the header is sent out with all requests automatically.
Parameters: - header – The header name, e.g. User-Agent
- value – The header value. If set to None the header is removed.
-
set_html
(html, url='http://example.com/')[source]¶ Set the response to dummy with given HTML, and URL if given.
Allows you to then parse that HTML, especially to extract forms information. If no URL was given then the default is “example.com”.
-
set_password_manager
(password_manager)¶ Set a mechanize.HTTPPasswordMgrWithDefaultRealm, or None.
-
set_proxies
(proxies=None, proxy_bypass=None)¶ Configure proxy settings.
Parameters: - proxies – dictionary mapping URL scheme to proxy specification. None means use the default system-specific settings.
- proxy_bypass – function taking hostname, returning whether proxy should be used. None means use the default system-specific settings.
The default is to try to obtain proxy settings from the system (see the documentation for urllib.urlopen for information about the system-specific methods used – note that’s urllib, not urllib2).
To avoid all use of proxies, pass an empty proxies dict.
>>> ua = UserAgentBase() >>> def proxy_bypass(hostname): ... return hostname == "noproxy.com" >>> ua.set_proxies( ... {"http": "joe:password@myproxy.example.com:3128", ... "ftp": "proxy.example.com"}, ... proxy_bypass)
-
set_proxy_password_manager
(password_manager)¶ Set a mechanize.HTTPProxyPasswordMgr, or None.
-
set_request_gzip
(handle)¶ Add header indicating to server that we handle gzip content encoding. Note that if the server sends gzip’ed content, it is handled automatically in any case, regardless of this setting.
-
set_response
(response)[source]¶ Replace current response with (a copy of) response.
response may be None.
This is intended mostly for HTML-preprocessing.
Similar to
set_cookie()
except that instead of using a cookie string, you simply specify the name, value, domain and optionally the path. The created cookie will never expire. For example:browser.set_simple_cookie('some_key', 'some_value', '.example.com', path='/some-page')
-
submit
(*args, **kwds)[source]¶ Submit current form.
Arguments are as for
mechanize.HTMLForm.click()
.Return value is same as for
open()
.
-
visit_response
(response, request=None)[source]¶ Visit the response, as if it had been
open()
ed.Unlike
set_response()
, this updates history rather than replacing the current response.
- history – object implementing the
The Request¶
-
class
mechanize.
Request
(url, data=None, headers={}, origin_req_host=None, unverifiable=False, visit=None, timeout=<object object>, method=None)[source]¶ A request for some network resource. Note that if you specify the method as ‘GET’ and the data as a dict, then it will be automatically appended to the URL. If you leave method as None, then the method will be auto-set to POST and the data will become part of the POST request.
Parameters: - url (str) – The URL to request
- data – Data to send with this request. Can be either a dictionary which will be encoded and sent as application/x-www-form-urlencoded data or a bytestring which will be sent as is. If you use a bytestring you should also set the Content-Type header appropriately.
- headers (dict) – Headers to send with this request
- method (str) – Method to use for HTTP requests. If not specified mechanize will choose GET or POST automatically as appropriate.
- timeout (float) – Timeout in seconds
The remaining arguments are for internal use.
-
add_data
(data)¶ Set the data (a bytestring) to be sent with this request
-
add_header
(key, val=None)[source]¶ Add the specified header, replacing existing one, if needed. If val is None, remove the header.
-
add_unredirected_header
(key, val)[source]¶ Same as
add_header()
except that this header will not be sent for redirected requests.
The Response¶
Response objects in mechanize are seek() able file
-like objects that support
some additional methods, depending on the protocol used for the connection. The documentation
below is for HTTP(s) responses, as these are the most common.
Additional methods present for HTTP responses:
-
class
mechanize._mechanize.
HTTPResponse
¶ -
code
¶ The HTTP status code
-
getcode
()¶ Return HTTP status code
-
geturl
()¶ Return the URL of the resource retrieved, commonly used to determine if a redirect was followed
-
get_all_header_names
(normalize=True)¶ Return a list of all headers names. When normalize is True, the case of the header names is normalized.
-
get_all_header_values
(name, normalize=True)¶ Return a list of all values for the specified header name (which is case-insensitive. Since headers in HTTP can be specified multiple times, the returned value is always a list. See
rfc822.Message.getheaders()
.
-
info
()¶ Return the headers of the response as a
rfc822.Message
instance.
-
__getitem__
(header_name)¶ Return the last HTTP Header matching the specified name as string. mechanize Response object act like dictionaries for convenient access to header values. For example:
response['Date']
. You can access header values using the header names, case-insensitively. Note that when more than one header with the same name is present, only the value of the last header is returned, useget_all_header_values()
to get the values of all headers.
-
get(header_name, default=None):
Return the header value for the specified header_name or default if the header is not present. See
__getitem__()
.
-
Miscellaneous¶
-
class
mechanize.
Link
(base_url, url, text, tag, attrs)[source]¶ A link in a HTML document
Variables: - absolute_url – The absolutized link URL
- url – The link URL
- base_url – The base URL against which this link is resolved
- text – The link text
- tag – The link tag name
- attrs – The tag attributes
-
class
mechanize.
History
[source]¶ Though this will become public, the implied interface is not yet stable.
-
mechanize._html.
content_parser
(data, url=None, response_info=None, transport_encoding=None, default_encoding='utf-8', is_html=True)[source]¶ Parse data (a bytes object) into an etree representation such as
xml.etree.ElementTree
or lxml.etreeParameters: - data (bytes) – The data to parse
- url – The URL of the document being parsed or None
- response_info – Information about the document
(contains all HTTP headers as
HTTPMessage
) - transport_encoding – The character encoding for the document being parsed as specified in the HTTP headers or None.
- default_encoding – The character encoding to use if no encoding could be detected and no transport_encoding is specified
- is_html – If the document is to be parsed as HTML.