| |
- AdvancedHTMLParser.Parser.AdvancedHTMLParser(HTMLParser.HTMLParser)
-
- ValidatingAdvancedHTMLParser
- AdvancedHTMLParser.exceptions.HTMLValidationException(exceptions.Exception)
-
- AdvancedHTMLParser.exceptions.InvalidAttributeNameException
- AdvancedHTMLParser.exceptions.InvalidCloseException
- AdvancedHTMLParser.exceptions.MissedCloseException
class ValidatingAdvancedHTMLParser(AdvancedHTMLParser.Parser.AdvancedHTMLParser) |
|
ValidatingAdvancedHTMLParser - A parser which will raise Exceptions for a couple HTML errors that would otherwise cause
an assumption to be made during parsing.
exceptions.InvalidCloseException - The parsed string/file tried to close something it shouldn't have.
exceptions.MissedCloseException - The parsed string/file missed closing an item. |
|
- Method resolution order:
- ValidatingAdvancedHTMLParser
- AdvancedHTMLParser.Parser.AdvancedHTMLParser
- HTMLParser.HTMLParser
- markupbase.ParserBase
Methods defined here:
- handle_endtag(self, tagName)
- Internal for parsing
- handle_starttag(self, tagName, attributeList, isSelfClosing=False)
- handle_starttag - internal for parsing,
ValidatingAdvancedHTMLParser will run through the attributes list and make sure
none have an invalid name, or will raise an error.
@raises - InvalidAttributeNameException if an attribute name is passed with invalid character(s)
Methods inherited from AdvancedHTMLParser.Parser.AdvancedHTMLParser:
- __contains__(self, other)
- __getstate__(self)
- __getstate__ - Get state for pickling
@return <dict>
- __init__(self, filename=None, encoding='utf-8')
- __init__ - Creates an Advanced HTML parser object. For read-only parsing, consider IndexedAdvancedHTMLParser for faster searching.
@param filename <str> - Optional filename to parse. Otherwise use parseFile or parseStr methods.
@param encoding <str> - Specifies the document encoding. Default utf-8
- __setstate__(self, state)
- __setstate - Restore state for loading pickle
@param state <dict> - The state
- asHTML = getHTML(self)
- getHTML - Get the full HTML as contained within this tree.
If parsed from a document, this will contain the original whitespacing.
@returns - <str> of html
@see getFormattedHTML
@see getMiniHTML
- contains(self, em)
- Checks if #em is found anywhere within this element tree
@param em <AdvancedTag> - Tag of interest
@return <bool> - If element #em is within this tree
- containsUid(self, uid)
- Check if #uid is found anywhere within this element tree
@param uid <uuid.UUID> - Uid
@return <bool> - If #uid is found within this tree
- createElement(self, tagName)
- createElement - Create an unattached tag with the given tag name
@param tagName <str> - Name of tag
@return <AdvancedTag> - A tag with the given tag name
- evaluate(self, xpathExprStr, whichDoc=None)
- evaluate - Evaluate an xpath expression against this document
@param xpathExprStr <str> - An XPath expression string (e.x. """//div[@name="someName"]/span[3]""" )
@param whichDoc <None/Parser.AdvancedHTMLParser> Default None - Which document.
NOTE: This is for compatibility with the JS DOM interface.
This must be None (Default) to refer to the current document, or "self" to refer to the same.
May allow other values in the future.
@return <TagCollection> - TagCollection of all matching elements
NOTE: JS DOM returns an iterable object for this function's return. May in the future match that interface.
For now the XPath engine does not run off a generator, so this will likely at first be a wrapper for interface compatibility sake
@see AdvancedHTMLParser.xpath.XPathExpression.evaluate for @throws and similar
- feed(self, contents)
- feed - Feed contents. Use parseStr or parseFile instead.
@param contents - Contents
- filter(self, **kwargs)
- filter aka filterAnd - Filter ALL the elements in this DOM.
Results must match ALL the filter criteria. for ANY, use the *Or methods
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative without QueryableList,
consider #AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
Special Keys:
tagname - The tag name
text - The inner text
@return TagCollection<AdvancedTag>
- filterAnd = filter(self, **kwargs)
- filter aka filterAnd - Filter ALL the elements in this DOM.
Results must match ALL the filter criteria. for ANY, use the *Or methods
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative without QueryableList,
consider #AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
Special Keys:
tagname - The tag name
text - The inner text
@return TagCollection<AdvancedTag>
- filterOr(self, **kwargs)
- filterOr - Perform a filter operation on this node and all children (and their children, onto the end)
Results must match ANY the filter criteria. for ALL, use the *AND methods
For special filter keys, @see #AdvancedHTMLParser.AdvancedHTMLParser.filter
Requires the QueryableList module to be installed (i.e. AdvancedHTMLParser was installed
without '--no-deps' flag.)
For alternative, consider AdvancedHTMLParser.AdvancedHTMLParser.find method or the getElement* methods
@return TagCollection<AdvancedTag>
- find(self, **kwargs)
- find - Perform a search of elements using attributes as keys and potential values as values
(i.e. parser.find(name='blah', tagname='span') will return all elements in this document
with the name "blah" of the tag type "span" )
Arguments are key = value, or key can equal a tuple/list of values to match ANY of those values.
Append a key with __contains to test if some strs (or several possible strs) are within an element
Append a key with __icontains to perform the same __contains op, but ignoring case
Special keys:
tagname - The tag name of the element
text - The text within an element
NOTE: Empty string means both "not set" and "no value" in this implementation.
NOTE: If you installed the QueryableList module (i.e. ran setup.py without --no-deps) it is
better to use the "filter"/"filterAnd" or "filterOr" methods, which are also available
on all tags and tag collections (tag collections also have filterAllAnd and filterAllOr)
@return TagCollection<AdvancedTag> - A list of tags that matched the filter criteria
- getAllNodes(self)
- getAllNodes - Get every element
@return TagCollection<AdvancedTag>
- getElementById(self, _id, root='root')
- getElementById - Searches and returns the first (should only be one) element with the given ID.
@param id <str> - A string of the id attribute.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByAttr(self, attrName, attrValue, root='root')
- getElementsByAttr - Searches the full tree for elements with a given attribute name and value combination. This is always a full scan.
@param attrName <lowercase str> - A lowercase attribute name
@param attrValue <str> - Expected value of attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
- getElementsByClassName(self, className, root='root')
- getElementsByClassName - Searches and returns all elements containing a given class name.
@param className <str> - One or more space-separated class names
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByName(self, name, root='root')
- getElementsByName - Searches and returns all elements with a specific name.
@param name <str> - A string of the name attribute
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root' [default], the root of the parsed tree will be used.
- getElementsByTagName(self, tagName, root='root')
- getElementsByTagName - Searches and returns all elements with a specific tag name.
@param tagName <lowercase str> - A lowercase string of the tag name.
@param root <AdvancedTag/'root'> - Search starting at a specific node, if provided. if string 'root', the root of the parsed tree will be used.
- getElementsByXPath = getElementsByXPathExpression(self, xpathExprStr)
- getElementsByXPathExpression - Evaluate an XPath expression string against this document
@param xpathExprStr <str> - An XPath expression string (e.x. """//div[@name="someName"]/span[3]""" )
@return <TagCollection> - TagCollection of all matching elements
@see AdvancedHTMLParser.xpath.XPathExpression.evaluate for @throws and similar
- getElementsByXPathExpression(self, xpathExprStr)
- getElementsByXPathExpression - Evaluate an XPath expression string against this document
@param xpathExprStr <str> - An XPath expression string (e.x. """//div[@name="someName"]/span[3]""" )
@return <TagCollection> - TagCollection of all matching elements
@see AdvancedHTMLParser.xpath.XPathExpression.evaluate for @throws and similar
- getElementsCustomFilter(self, filterFunc, root='root')
- getElementsCustomFilter - Scan elements using a provided function
@param filterFunc <function>(node) - A function that takes an AdvancedTag as an argument, and returns True if some arbitrary criteria is met
@return - TagCollection of all matching elements
- getElementsWithAttrValues(self, attrName, attrValues, root='root')
- getElementsWithAttrValues - Returns elements with an attribute, named by #attrName contains one of the values in the list, #values
@param attrName <lowercase str> - A lowercase attribute name
@param attrValues set<str> - A set of all valid values.
@return - TagCollection of all matching elements
- getFirstElementCustomFilter(self, filterFunc, root='root')
- getFirstElementCustomFilter - Scan elements using a provided function, stop and return the first match.
@see getElementsCustomFilter to match multiple elements
@param filterFunc <function>(node) - A function that takes an AdvancedTag as an argument, and returns True if some arbitrary criteria is met
@return - An AdvancedTag of the node that matched, or None if no match.
- getFormattedHTML(self, indent=' ')
- getFormattedHTML - Get formatted and xhtml of this document, replacing the original whitespace
with a pretty-printed version
@param indent - space/tab/newline of each level of indent, or integer for how many spaces per level
@return - <str> Formatted html
@see getHTML - Get HTML with original whitespace
@see getMiniHTML - Get HTML with only functional whitespace remaining
- getHTML(self)
- getHTML - Get the full HTML as contained within this tree.
If parsed from a document, this will contain the original whitespacing.
@returns - <str> of html
@see getFormattedHTML
@see getMiniHTML
- getMiniHTML(self)
- getMiniHTML - Gets the HTML representation of this document without any pretty formatting
and disregarding original whitespace beyond the functional.
@return <str> - HTML with only functional whitespace present
- getRoot(self)
- getRoot - returns the root Tag.
NOTE: if there are multiple roots, this will be a special tag.
You may want to consider using getRootNodes instead if this
is a possible situation for you.
@return AdvancedTag
- getRootNodes(self)
- getRootNodes - Gets all objects at the "root" (first level; no parent). Use this if you may have multiple roots (not children of <html>)
Use this method to get objects, for example, in an AJAX request where <html> may not be your root.
Note: If there are multiple root nodes (i.e. no <html> at the top), getRoot will return a special tag. This function automatically
handles that, and returns all root nodes.
@return list<AdvancedTag> - A list of AdvancedTags which are at the root level of the tree.
- handle_charref(self, charRef)
- Internal for parsing
- handle_comment(self, comment)
- Internal for parsing
- handle_data(self, data)
- Internal for parsing
- handle_decl(self, decl)
- Internal for parsing
- handle_entityref(self, entity)
- Internal for parsing
- handle_startendtag(self, tagName, attributeList)
- Internal for parsing
- parseFile(self, filename)
- parseFile - Parses a file and creates the DOM tree and indexes
@param filename <str/file> - A string to a filename or a file object. If file object, it will not be closed, you must close.
- parseStr(self, html)
- parseStr - Parses a string and creates the DOM tree and indexes.
@param html <str> - valid HTML
- setDoctype(self, newDoctype)
- setDoctype - Set the doctype for this document, or clear it.
@param newDoctype <str/None> -
If None, will clear the doctype and not return one with #getHTML
Otherwise, a string of the full doctype tag.
For example, the HTML5 doctype would be "DOCTYPE html"
- setRoot(self, root)
- Sets the root node, and reprocesses the indexes
- toHTML = getHTML(self)
- getHTML - Get the full HTML as contained within this tree.
If parsed from a document, this will contain the original whitespacing.
@returns - <str> of html
@see getFormattedHTML
@see getMiniHTML
- unknown_decl(self, decl)
- Internal for parsing
Class methods inherited from AdvancedHTMLParser.Parser.AdvancedHTMLParser:
- createBlocksFromHTML(cls, html, encoding='utf-8') from __builtin__.classobj
- createBlocksFromHTML - Returns the root level node (unless multiple nodes), and
a list of "blocks" added (text and nodes).
@return list< str/AdvancedTag > - List of blocks created. May be strings (text nodes) or AdvancedTag (tags)
NOTE:
Results may be checked by:
issubclass(block.__class__, AdvancedTag)
If True, block is a tag, otherwise, it is a text node
- createElementFromHTML(cls, html, encoding='utf-8') from __builtin__.classobj
- createElementFromHTML - Creates an element from a string of HTML.
If this could create multiple root-level elements (children are okay),
you must use #createElementsFromHTML which returns a list of elements created.
@param html <str> - Some html data
@param encoding <str> - Encoding to use for document
@raises MultipleRootNodeException - If given html would produce multiple root-level elements (use #createElementsFromHTML instead)
@return AdvancedTag - A single AdvancedTag
NOTE: If there is text outside the tag, they will be lost in this.
Use createBlocksFromHTML instead if you need to retain both text and tags.
Also, if you are just appending to an existing tag, use AdvancedTag.appendInnerHTML
- createElementsFromHTML(cls, html, encoding='utf-8') from __builtin__.classobj
- createElementsFromHTML - Creates elements from provided html, and returns a list of the root-level elements
children of these root-level nodes are accessable via the usual means.
@param html <str> - Some html data
@param encoding <str> - Encoding to use for document
@return list<AdvancedTag> - The root (top-level) tags from parsed html.
NOTE: If there is text outside the tags, they will be lost in this.
Use createBlocksFromHTML instead if you need to retain both text and tags.
Also, if you are just appending to an existing tag, use AdvancedTag.appendInnerHTML
Data descriptors inherited from AdvancedHTMLParser.Parser.AdvancedHTMLParser:
- body
- body - Get the body element
@return <AdvancedTag> - The body tag, or None if no body tag present
- forms
- forms - Return all forms associated with this document
@return <TagCollection> - All "form" elements
- head
- head - Get the head element
@return <AdvancedTag> - The head tag, or None if no head tag present
Methods inherited from HTMLParser.HTMLParser:
- check_for_whole_start_tag(self, i)
- # Internal -- check to see if we have a complete starttag; return end
# or -1 if incomplete.
- clear_cdata_mode(self)
- close(self)
- Handle any buffered data.
- error(self, message)
- get_starttag_text(self)
- Return full source of start tag: '<...>'.
- goahead(self, end)
- # Internal -- handle data as far as reasonable. May leave state
# and data to be processed by a subsequent call. If 'end' is
# true, force handling all data as if followed by EOF marker.
- handle_pi(self, data)
- # Overridable -- handle processing instruction
- parse_bogus_comment(self, i, report=1)
- # Internal -- parse bogus comment, return length or -1 if not terminated
# see http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
- parse_endtag(self, i)
- # Internal -- parse endtag, return end or -1 if incomplete
- parse_html_declaration(self, i)
- # Internal -- parse html declarations, return length or -1 if not terminated
# See w3.org/TR/html5/tokenization.html#markup-declaration-open-state
# See also parse_declaration in _markupbase
- parse_pi(self, i)
- # Internal -- parse processing instr, return end or -1 if not terminated
- parse_starttag(self, i)
- # Internal -- handle starttag, return end or -1 if not terminated
- reset(self)
- Reset this instance. Loses all unprocessed data.
- set_cdata_mode(self, elem)
- unescape(self, s)
Data and other attributes inherited from HTMLParser.HTMLParser:
- CDATA_CONTENT_ELEMENTS = ('script', 'style')
- entitydefs = None
Methods inherited from markupbase.ParserBase:
- getpos(self)
- Return current line number and offset.
- parse_comment(self, i, report=1)
- # Internal -- parse comment, return length or -1 if not terminated
- parse_declaration(self, i)
- # Internal -- parse declaration (for use by subclasses).
- parse_marked_section(self, i, report=1)
- # Internal -- parse a marked section
# Override this to handle MS-word extension syntax <![if word]>content<![endif]>
- updatepos(self, i, j)
- # Internal -- update line number and offset. This should be
# called for each piece of data exactly once, in order -- in other
# words the concatenation of all the input strings to this
# function should be exactly the entire input.
| |