Selectors are used to specifically point to certain tags on a web page, from which content has to be extracted. In Scrapple, selectors are implemented through selector classes, which define methods to extract necessary content through specified selector expressions and to extract links from anchor tags to be crawled through.
There are two selector types that are supported in Scrapple :
These selector types are implemented through the XpathSelector and CssSelector classes, respectively. These two classes use the Selector class as their super class.
In the super class, the URL of the web page to be loaded is validated - ensuring the schema has been specified, and that the URL is valid. A HTTP GET request is made to load the web page, and the HTML content of this fetched web page is used to generate the element tree. This is the element tree that will be parsed to extract the necessary content.
The XpathSelector object defines XPath expressions.
Method for performing the content extraction for the given XPath expression.
The XPath selector expression can be used to extract content from the element tree corresponding to the fetched web page.
If the selector is “url”, the URL of the current web page is returned. Otherwise, the selector expression is used to extract content. The particular attribute to be extracted (“text”, “href”, etc.) is specified in the method arguments, and this is used to extract the required content. If the content extracted is a link (from an attr value of “href” or “src”), the URL is parsed to convert the relative path into an absolute path.
If the selector does not fetch any content, the default value is returned. If no default value is specified, an exception is raised.
Parameters: |
|
---|---|
Returns: | The extracted content |
Method for performing the link extraction for the crawler.
The selector passed as the argument is a selector to point to the anchor tags that the crawler should pass through. A list of links is obtained, and the links are iterated through. The relative paths are converted into absolute paths and a CssSelector object is created with the URL of the next page as the argument and this created object is yielded.
The extract_links method basically generates CssSelector objects for all of the links to be crawled through.
Parameters: | selector – The selector for the anchor tags to be crawled through |
---|---|
Returns: | A CssSelector object for every page to be crawled through |
The CssSelector object defines CSS selector expressions.
Method for performing the content extraction for the given CSS selector.
The cssselect library is used to handle CSS selector expressions. XPath expressions have a higher speed of execution, so the given CSS selector expression is translated into the corresponding XPath expression, by the cssselect.CSSSelector class. This selector can be used to extract content from the element tree corresponding to the fetched web page.
If the selector is “url”, the URL of the current web page is returned. Otherwise, the selector expression is used to extract content. The particular attribute to be extracted (“text”, “href”, etc.) is specified in the method arguments, and this is used to extract the required content. If the content extracted is a link (from an attr value of “href” or “src”), the URL is parsed to convert the relative path into an absolute path.
If the selector does not fetch any content, the default value is returned. If no default value is specified, an exception is raised.
Parameters: |
|
---|---|
Returns: | The extracted content |
Method for performing the link extraction for the crawler implementation.
As in the extract_content method, the cssselect library is used to translate the CSS selector expression into an XPath expression.
The selector passed as the argument is a selector to point to the anchor tags that the crawler should pass through. A list of links is obtained, and the links are iterated through. The relative paths are converted into absolute paths and a CssSelector object is created with the URL of the next page as the argument and this created object is yielded.
The extract_links method basically generates CssSelector objects for all of the links to be crawled through.
Parameters: | selector – The selector for the anchor tags to be crawled through |
---|---|
Returns: | A CssSelector object for every page to be crawled through |