--- title: Query keywords: fastai sidebar: home_sidebar nb_path: "nbs/01_query.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
%load_ext autoreload
%autoreload 2
{% endraw %}

This is an interface to querying resources that can be fetched.

It covers local files, the Common Crawl Index and the Internet Archive's CDX Server for finding archived resources based on URL patterns. The latter two have a lot of overlap with cdx_toolkit but this is more optimised for speed and ease of use.

{% raw %}
{% endraw %}

Local WARC Files

Object

{% raw %}

class WarcFileRecord[source]

WarcFileRecord(url:str, timestamp:datetime, mime:str, status:int, path:Path, offset:int, digest:str)

WarcFileRecord(url: 'str', timestamp: 'datetime', mime: 'str', status: 'int', path: 'Path', offset: 'int', digest: 'str')

{% endraw %} {% raw %}

get_warc_url[source]

get_warc_url(record:ArcWarcRecord)

{% endraw %} {% raw %}

get_warc_timestamp[source]

get_warc_timestamp(record:ArcWarcRecord)

{% endraw %} {% raw %}

get_warc_mime[source]

get_warc_mime(record:ArcWarcRecord)

{% endraw %} {% raw %}

get_warc_status[source]

get_warc_status(record:ArcWarcRecord)

{% endraw %} {% raw %}

get_warc_digest[source]

get_warc_digest(record:ArcWarcRecord)

{% endraw %} {% raw %}

class WarcFileQuery[source]

WarcFileQuery(path:Union[str, Path])

{% endraw %} {% raw %}
{% endraw %}

Testing

Generated some test data with:

wget -r -Q1M --domains skeptric.com --warc-file=skeptric --delete-after --no-directories https://skeptric.com/pagination-wayback-cdx/

See warcio library for how to do this in Python.

{% raw %}
test_data = '../resources/test/skeptric.warc.gz'
{% endraw %} {% raw %}
warc = WarcFileQuery(test_data)
results = warc.query()
results
[WarcFileRecord(url='https://skeptric.com/pagination-wayback-cdx/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 34), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=889, digest='Q6HKH563Z7HF2333QILSSSHY2K3B6NOK'),
 WarcFileRecord(url='https://skeptric.com/robots.txt', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 34), mime='text/html', status=404, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=5804, digest='QRNGXIUXE4LAI3XR5RVATIUX5GTB33HX'),
 WarcFileRecord(url='https://skeptric.com/style.main.min.5ea2f07be7e07e221a7112a3095b89d049b96c48b831f16f1015bf2d95d914e5.css', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 35), mime='text/css', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=7197, digest='LINCDTSPQGAQGZZ6LY2XFXZHG2X476H6'),
 WarcFileRecord(url='https://skeptric.com/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 36), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=17122, digest='JJVB3MQERHRZJCHOJNKS5VDOODXPZAV2'),
 WarcFileRecord(url='https://skeptric.com/about/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 37), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=125261, digest='Z5NRUTRW3XTKZDCJFDKGPJ5BWIBNQCG7'),
 WarcFileRecord(url='https://skeptric.com/tags/data', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 37), mime='text/html', status=302, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=129093, digest='ZZZXDZTTV2KTABRO64ESHVWFPNKB4I5H'),
 WarcFileRecord(url='https://skeptric.com/tags/data/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 38), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=130269, digest='R7CLAACFU5L7T5LKI5G53RZSMCNUNV6F'),
 WarcFileRecord(url='https://skeptric.com/images/wayback_empty_returns.png', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 38), mime='image/png', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=160971, digest='SU7JRTHNW6KFCJQFL5PMMKV33U2VLV7T'),
 WarcFileRecord(url='https://skeptric.com/searching-100b-pages-cdx', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 39), mime='text/html', status=302, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=173368, digest='AYVHQLVFIVGZGUYPEHX46CHMZ5NUDDBF'),
 WarcFileRecord(url='https://skeptric.com/searching-100b-pages-cdx/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 39), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=174558, digest='CMT3ZNELTRC7H7ICVCGAYYS6GQ2NSZGP'),
 WarcFileRecord(url='https://skeptric.com/fast-web-data-workflow/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 39), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=188608, digest='UMQCERJOQ3AGE3Z576ABKKKNDGSNBQAX'),
 WarcFileRecord(url='https://skeptric.com/key-web-captures/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 40), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=195651, digest='5GLUI5DL5PFUM6QY53NWUCMJBC2CTXKS'),
 WarcFileRecord(url='https://skeptric.com/emacs-tempfile-hugo/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 40), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=201243, digest='JY2UE6SKCYPHXSUN5IR6SQVMZRRAGU64')]
{% endraw %}

Try fetching a record

{% raw %}
image_record = [r for r in results if r.mime == 'image/png'][0]

content = image_record.content
from IPython.display import Image
Image(content)
{% endraw %}

Check the digests

{% raw %}
for result in results:
    assert result.digest == sha1_digest(result.get_content())
{% endraw %}

Internet Archive

Querying CDX

The Internet Archive runs it's own Java CDX Server. Hakernoon has an article that gives a good overview of querying it.

See the documentation for all the paramemters; here are the ones that I find most interesting

  • url: The URL to query, a wildcard * can implicitly define it (which means we don't need matchType)
  • output: "json" returns JSON instead of space separated
  • fl: Comma separated list of fields to return
  • from, to: Date filtering 1-14 digits yyyyMMddhhmmss, inclusive (this seems to do the right thing for truncation).
  • filter: [!]field:regex filter a returned field with Java Regex, inverting with !
  • limit/offset: For getting a small number of results (sampling). Internal limit is 150000 results.

Default fields returned: ["urlkey","timestamp","original","mimetype","statuscode","digest","length"]

Pagination isn't really useful because it's applied before filtering (including date filtering) so most pages are empty with a date filter. Unpaginated requests are fast enough anyway.

Exporting the data

The data is a JSON array or arrays; the first line contains the headers and the subsequent the data. Let's transform it into a list of dictionaries of keys to values which make it a bit easier to work with (although less efficient in memory).

An alternative would be to directly use something like Pandas

{% raw %}

header_and_rows_to_dict[source]

header_and_rows_to_dict(rows:Iterable[list[Any]])

{% endraw %} {% raw %}
{% endraw %}

Check it on some data

{% raw %}
assert header_and_rows_to_dict([['col_1', 'col_2'], [1, 'a'], [2, 'b']]) == [
    {'col_1': 1, 'col_2': 'a'},
    {'col_1': 2, 'col_2': 'b'}]
{% endraw %} {% raw %}
assert header_and_rows_to_dict([['col_1', 'col_2']]) == []
{% endraw %} {% raw %}
assert header_and_rows_to_dict([]) == []
{% endraw %}

Making a query

{% raw %}

mimetypes_to_regex[source]

mimetypes_to_regex(mime:list[str], prefix='mimetype:')

{% endraw %} {% raw %}

query_wayback_cdx[source]

query_wayback_cdx(url:str, start:Optional[str], end:Optional[str], status_ok:bool=True, mime:Optional[Union[str, Iterable[str]]]=None, limit:Optional[int]=None, offset:Optional[int]=None, session:Optional[Session]=None)

Get references to Wayback Machine Captures for url.

Queries the Internet Archive Capture Index (CDX) for url.

Arguments:

  • start: Minimum date in format YYYYmmddHHMMSS (or any substring) inclusive
  • end: Maximum date in format YYYYmmddHHMMSS (or any substring) inclusive
  • status_ok: Only return those with a HTTP status 200
  • mime: Filter on mimetypes, '' is a wildcard (e.g. 'image/')
  • limit: Only return first limit records
  • offset: Skip the first offset records, combine with limit
  • session: Session to use when making requests Filters results between start and end inclusive, in format YYYYmmddHHMMSS or any substring (e.g. start="202001", end="202001" will get all captures in January 2020)
{% endraw %} {% raw %}
{% endraw %}

Test querying

{% raw %}
%%time
full_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False)
CPU times: user 29.7 ms, sys: 7.14 ms, total: 36.8 ms
Wall time: 1.17 s
{% endraw %}

Test statuscode filter

{% raw %}
%%time
ok_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None) #status_ok=True
CPU times: user 20.1 ms, sys: 3.37 ms, total: 23.5 ms
Wall time: 928 ms
{% endraw %} {% raw %}
assert ok_sample == [x for x in full_sample if x['statuscode'] == '200']
{% endraw %}

Test mimetypes

{% raw %}
%%time
html_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, mime='text/html')
CPU times: user 8.03 ms, sys: 6.29 ms, total: 14.3 ms
Wall time: 631 ms
{% endraw %} {% raw %}
assert html_sample == [x for x in full_sample if x['mimetype'] == 'text/html']
{% endraw %} {% raw %}
%%time
image_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, mime='image/*')
CPU times: user 12.9 ms, sys: 3.26 ms, total: 16.2 ms
Wall time: 659 ms
{% endraw %} {% raw %}
assert image_sample == [x for x in full_sample if x['mimetype'].startswith('image/')]
{% endraw %} {% raw %}
%%time
prog_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, 
                                 mime=['text/css', 'application/javascript'])
CPU times: user 7.98 ms, sys: 3.15 ms, total: 11.1 ms
Wall time: 452 ms
{% endraw %} {% raw %}
assert prog_sample == [x for x in full_sample if x['mimetype'] in ['text/css', 'application/javascript']]
{% endraw %}

Test date filters

{% raw %}
%%time
sample_2020 = query_wayback_cdx('skeptric.com/*', start='2020', end='2020', status_ok=False)
len(sample_2020)
CPU times: user 4.74 ms, sys: 3.33 ms, total: 8.08 ms
Wall time: 458 ms
106
{% endraw %} {% raw %}
assert sample_2020 == [x for x in full_sample if '2020' <= x['timestamp'] < '2021']
{% endraw %} {% raw %}
%%time
sample_to_2020 = query_wayback_cdx('skeptric.com/*', start=None, end='2020', status_ok=False)
CPU times: user 10.6 ms, sys: 0 ns, total: 10.6 ms
Wall time: 476 ms
{% endraw %} {% raw %}
assert sample_to_2020 == [x for x in full_sample if x['timestamp'] < '2021']
{% endraw %} {% raw %}
%%time
sample_from_2020 = query_wayback_cdx('skeptric.com/*', start='2020', end=None, status_ok=False)
CPU times: user 16.2 ms, sys: 13.2 ms, total: 29.4 ms
Wall time: 1 s
{% endraw %} {% raw %}
assert sample_from_2020 == [x for x in full_sample if x['timestamp'] >= '2020']
{% endraw %}

Test limits

{% raw %}
%%time
sample_10 = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, limit=10)
CPU times: user 4.71 ms, sys: 2.66 ms, total: 7.37 ms
Wall time: 442 ms
{% endraw %} {% raw %}
sample_10 == full_sample[:10]
True
{% endraw %} {% raw %}
%%time
sample_10_offset_20 = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, limit=10, offset=20)
CPU times: user 5.54 ms, sys: 3.12 ms, total: 8.66 ms
Wall time: 434 ms
{% endraw %} {% raw %}
assert sample_10_offset_20  == full_sample[20:20+10]
{% endraw %}

Fetching Content

The original content is at http://web.archive.org/web/{timestamp}_id/{url}, and the Wayback Machine version is the same excluding _id. The Wayback Machine version makes some changes that make it better for interactive viewing, but the content is different.

{% raw %}

wayback_url[source]

wayback_url(timestamp:str, url:str, wayback:bool=False)

{% endraw %} {% raw %}

fetch_wayback_content[source]

fetch_wayback_content(timestamp:str, url:str, session:Optional[Session]=None)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
record = image_sample[0]
record
{'urlkey': 'com,skeptric)/favicon.ico',
 'timestamp': '20180819051126',
 'original': 'http://skeptric.com/favicon.ico',
 'mimetype': 'image/vnd.microsoft.icon',
 'statuscode': '200',
 'digest': 'R6YE2GPPT4BM4IAMHGUJPDJF6BGKKHDA',
 'length': '851'}
{% endraw %} {% raw %}
content = fetch_wayback_content(record['timestamp'], record['original'])
{% endraw %} {% raw %}
from IPython.display import Image
Image(content)
{% endraw %} {% raw %}
assert record['digest'] == sha1_digest(content)
{% endraw %}

And fetching a HTML webpage

{% raw %}
record = ok_sample[0]
{% endraw %} {% raw %}
content = fetch_wayback_content(record['timestamp'], record['original'])
{% endraw %} {% raw %}
assert record['digest'] == sha1_digest(content)
{% endraw %}

Get the Wayback URL for preview

{% raw %}
URL(wayback_url(record['timestamp'], record['original'], wayback=True))
{% endraw %}

Wayback Record Object

{% raw %}

class WaybackRecord[source]

WaybackRecord(url:str, timestamp:datetime, mime:str, status:Optional[int], digest:str)

WaybackRecord(url: 'str', timestamp: 'datetime', mime: 'str', status: 'Optional[int]', digest: 'str')

{% endraw %} {% raw %}
{% endraw %} {% raw %}
record = WaybackRecord.from_dict(full_sample[0])
record
WaybackRecord(url='http://skeptric.com/', timestamp=datetime.datetime(2018, 8, 5, 13, 27, 31), mime='text/html', status=200, digest='2N5QQYZFAM36CSESTGSDKRTRV7I5HEXJ')
{% endraw %} {% raw %}
record.preview()
{% endraw %} {% raw %}
%%time
content = record.content
content[:100]
CPU times: user 7.08 ms, sys: 0 ns, total: 7.08 ms
Wall time: 372 ms
b'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8" />\n    <title>Edward Ross - An \xc9'
{% endraw %} {% raw %}
assert sha1_digest(content) == record.digest
{% endraw %}

Wayback Query

{% raw %}

class WaybackQuery[source]

WaybackQuery(url:str, start:Optional[str], end:Optional[str], status_ok:bool=True, mime:Optional[Union[str, Iterable[str]]]=None)

WaybackQuery(url: 'str', start: 'Optional[str]', end: 'Optional[str]', status_ok: 'bool' = True, mime: 'Optional[Union[str, Iterable[str]]]' = None)

{% endraw %} {% raw %}
{% endraw %}

Test fetching one

{% raw %}
wb = WaybackQuery('skeptric.com/*', start=None, end=None)
{% endraw %} {% raw %}
%%time
items = list(wb.query())
CPU times: user 74.8 ms, sys: 6.06 ms, total: 80.9 ms
Wall time: 1.31 s
{% endraw %} {% raw %}
assert len(items) > 50
{% endraw %} {% raw %}
content = items[0].content
assert sha1_digest(content) == items[0].digest
{% endraw %}

Test fetching many

I have no idea whether Session is actually thread-safe; if it's not we should have 1 session per thread. As far as I can tell the issue occurs when you have lots of hosts, so in this case it should be ok.

I guess we'll try it and see. Maybe long term we're better going with asyncio. Using a Session makes things slightly faster; we can always turn it off by passing session=False.

{% raw %}

wayback_fetch_parallel[source]

wayback_fetch_parallel(items, threads=8, session=None)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
%%time
contents_16 = WaybackRecord.fetch_parallel(items[:16], threads=8)
CPU times: user 278 ms, sys: 59.9 ms, total: 338 ms
Wall time: 3.57 s
{% endraw %} {% raw %}
assert contents_16[0] == content
{% endraw %}

Common Crawl

Common Crawl is similar, but slightly different, because it is split over many indexes, roughly monthly covering 2 weeks.

Indexes

Cache within a session, but not between sessions since over months the indexes change.

{% raw %}

get_cc_indexes[source]

get_cc_indexes()

{% endraw %} {% raw %}
{% endraw %} {% raw %}
indexes = get_cc_indexes()
indexes[:3]
[{'id': 'CC-MAIN-2021-43',
  'name': 'October 2021 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2021-43/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2021-43-index'},
 {'id': 'CC-MAIN-2021-39',
  'name': 'September 2021 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2021-39/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2021-39-index'},
 {'id': 'CC-MAIN-2021-31',
  'name': 'July 2021 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2021-31/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2021-31-index'}]
{% endraw %}

Notice the oldest ones have a different format

{% raw %}
indexes[-4:]
[{'id': 'CC-MAIN-2013-20',
  'name': 'Summer 2013 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2013-20/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2013-20-index'},
 {'id': 'CC-MAIN-2012',
  'name': 'Index of 2012 ARC files',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2012/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2012-index'},
 {'id': 'CC-MAIN-2009-2010',
  'name': 'Index of 2009 - 2010 ARC files',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2009-2010/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2009-2010-index'},
 {'id': 'CC-MAIN-2008-2009',
  'name': 'Index of 2008 - 2009 ARC files',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2008-2009/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2008-2009-index'}]
{% endraw %}

Parsing the times

We can parse approximate periods for the crawls using the same logic as cdx_toolkit.

However in reality it's complicated

{% raw %}

parse_cc_crawl_date[source]

parse_cc_crawl_date(crawl_id:str)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
dates = [parse_cc_crawl_date(i['id']) for i in indexes]
{% endraw %} {% raw %}

cc_index_by_time[source]

cc_index_by_time(start:Optional[datetime]=None, end:Optional[datetime]=None, indexes:Optional[list[str]]=None)

Gets all indexes that may contain entries between start and end

Generally errs on the side of giving an additional index

{% endraw %} {% raw %}
{% endraw %} {% raw %}
cc_index_by_time(start=datetime(2021,5,1), end=datetime(2021,12,1))
['CC-MAIN-2021-43',
 'CC-MAIN-2021-39',
 'CC-MAIN-2021-31',
 'CC-MAIN-2021-25',
 'CC-MAIN-2021-21',
 'CC-MAIN-2021-17',
 'CC-MAIN-2021-10']
{% endraw %} {% raw %}
cc_index_by_time(end=datetime(2010,1,1))
['CC-MAIN-2009-2010', 'CC-MAIN-2008-2009']
{% endraw %} {% raw %}
cc_index_by_time(start=datetime(2021,1,1))
['CC-MAIN-2021-43',
 'CC-MAIN-2021-39',
 'CC-MAIN-2021-31',
 'CC-MAIN-2021-25',
 'CC-MAIN-2021-21',
 'CC-MAIN-2021-17',
 'CC-MAIN-2021-10',
 'CC-MAIN-2021-04',
 'CC-MAIN-2020-50']
{% endraw %}

Querying CDX

{% raw %}
cc_sample_api = 'https://index.commoncrawl.org/CC-MAIN-2021-43-index'
assert cc_sample_api in [i['cdx-api'] for i in indexes]
{% endraw %}

The Common Crawl CDX is much slower than the Wayback Machines. It also uses the zipped CDX by default which automatically paginates.

{% raw %}

jsonl_loads[source]

jsonl_loads(jsonl)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
test_jline = b'{"status": "200", "mime": "text/html"}\n{"status": "301", "mime": "-"}\n'
test_jdict = [{'status': '200', 'mime': 'text/html'}, {'status': '301', 'mime': '-'}]
assert jsonl_loads(test_jline) == test_jdict
assert jsonl_loads(test_jline.rstrip()) == test_jdict
{% endraw %} {% raw %}
{% endraw %} {% raw %}

query_cc_cdx_num_pages[source]

query_cc_cdx_num_pages(api:str, url:str, page_size:int=5, session:Optional[Session]=None)

{% endraw %} {% raw %}

query_cc_cdx_page[source]

query_cc_cdx_page(api:str, url:str, page:int, start:Optional[str]=None, end:Optional[str]=None, status_ok:bool=True, mime:Optional[Union[str, Iterable[str]]]=None, limit:Optional[int]=None, offset:Optional[int]=None, page_size:int=5, session:Optional[Session]=None)

Get references to Common Crawl Captures for url.

Queries the Common Crawl Capture Index (CDX) for url.

Filters:

  • api: API endpoint to use (e.g. 'https://index.commoncrawl.org/CC-MAIN-2021-43-index')
  • start: Minimum date in format YYYYmmddHHMMSS (or any substring) inclusive
  • end: Maximum date in format YYYYmmddHHMMSS (or any substring) inclusive
  • status_ok: Only return those with a HTTP status 200
  • mime: Filter on mimetypes, '' is a wildcard (e.g. 'image/')
  • limit: Only return first limit records
  • offset: Skip the first offset records, combine with limit
  • session: Session to use when making requests Filters results between start and end inclusive, in format YYYYmmddHHMMSS or any substring (e.g. start="202001", end="202001" will get all captures in January 2020)
{% endraw %} {% raw %}
{% endraw %}

Testing pagination

{% raw %}
test_url = 'rmy.wikipedia.org/*'
{% endraw %} {% raw %}
pages = query_cc_cdx_num_pages(cc_sample_api, test_url, page_size=1)
pages
2
{% endraw %} {% raw %}
records = []
for page in range(pages):
    records += query_cc_cdx_page(cc_sample_api, test_url, page=page, mime='text/html', page_size=1)
len(records)
576
{% endraw %} {% raw %}
timestamps = [record['timestamp'] for record in records]
min(timestamps), max(timestamps)
('20211017144648', '20211026191502')
{% endraw %}

We should get an error querying past the last page (400: Invalid Request)

{% raw %}
from requests.exceptions import HTTPError

try:
    query_cc_cdx_page(cc_sample_api, test_url, page=pages, mime='text/html', page_size=1)
    raise AssertionError('Expected Failure')
except HTTPError as e:
    status = e.response.status_code
    if status != 400:
        raise AssertionError(f'Expected 400, got {status}')
{% endraw %}

Error for missing fields

For CC-MAIN-2015-11 and CC-MAIN-2015-06 both the status and mime are missing from the index which leads to problems.

{% raw %}
api_2015_11 = next(idx['cdx-api'] for idx in get_cc_indexes() if idx['id'] == 'CC-MAIN-2015-11')
{% endraw %}

This turns up in this particular case as a 404

{% raw %}
try:
    results = query_cc_cdx_page(api_2015_11, test_url, page=0, page_size=1)
    raise ValueError("Has it been fixed? Remove the blacklist")
except requests.exceptions.HTTPError as e:
    if e.response.status_code != 404:
        raise
{% endraw %}

Which doesn't happen when we remove the filters

{% raw %}
results = query_cc_cdx_page(api_2015_11, test_url, page=0, status_ok=False, page_size=1)
{% endraw %} {% raw %}
{% endraw %}

Downloading

{% raw %}

fetch_cc[source]

fetch_cc(filename:str, offset:int, length:int, session:Optional[Session]=None)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
record = records[0]
record
{'urlkey': 'org,wikipedia,rmy)/wiki/%c3%93lvega',
 'timestamp': '20211019031358',
 'url': 'https://rmy.wikipedia.org/wiki/%C3%93lvega',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'status': '200',
 'digest': 'QZSQHAQSLMBHHCJSADEUS3I4U3EJALH5',
 'length': '11604',
 'offset': '641473227',
 'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323585231.62/warc/CC-MAIN-20211019012407-20211019042407-00526.warc.gz',
 'languages': 'ron,nno,vie',
 'encoding': 'UTF-8'}
{% endraw %} {% raw %}
record = {'urlkey': 'org,wikipedia,en)/?curid=3516101',
 'timestamp': '20211024051554',
 'url': 'https://en.wikipedia.org/?curid=3516101',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'status': '200',
 'digest': 'TFZFWXWKS3NFJLLLXLYU6JTNZ77D3IVD',
 'length': '8922',
 'offset': '332329771',
 'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323585911.17/warc/CC-MAIN-20211024050128-20211024080128-00689.warc.gz',
 'languages': 'eng',
 'encoding': 'UTF-8'}
{% endraw %} {% raw %}
%%time
content = fetch_cc(record['filename'], record['offset'], record['length'])
CPU times: user 28 ms, sys: 349 µs, total: 28.3 ms
Wall time: 1.09 s
{% endraw %} {% raw %}
content[:1000]
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Category:Tad Morose albums - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"362a73a3-175d-4af3-92bb-4707797d6067","wgCSPNonce":!1,"wgCanonicalNamespace":"Category","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":14,"wgPageName":"Category:Tad_Morose_albums","wgTitle":"Tad Morose albums","wgCurRevisionId":906041584,"wgRevisionId":906041584,"wgArticleId":3516101,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Set categories","Albums by artist","Power metal albums by Swedish artists"],"wgPageContentLanguage":"en","wgPageContentModel":"wik'
{% endraw %}

Check the digest

{% raw %}
assert sha1_digest(content) == record['digest']
{% endraw %}

Put it into an Object

{% raw %}

class CommonCrawlRecord[source]

CommonCrawlRecord(url:str, timestamp:datetime, filename:str, offset:int, length:int, mime:Optional[str], status:Optional[int], digest:Optional[str])

CommonCrawlRecord(url: 'str', timestamp: 'datetime', filename: 'str', offset: 'int', length: 'int', mime: 'Optional[str]', status: 'Optional[int]', digest: 'Optional[str]')

{% endraw %} {% raw %}
{% endraw %} {% raw %}
record
{'urlkey': 'org,wikipedia,en)/?curid=3516101',
 'timestamp': '20211024051554',
 'url': 'https://en.wikipedia.org/?curid=3516101',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'status': '200',
 'digest': 'TFZFWXWKS3NFJLLLXLYU6JTNZ77D3IVD',
 'length': '8922',
 'offset': '332329771',
 'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323585911.17/warc/CC-MAIN-20211024050128-20211024080128-00689.warc.gz',
 'languages': 'eng',
 'encoding': 'UTF-8'}
{% endraw %} {% raw %}
x = _cc_cdx_to_record(record)
x
CommonCrawlRecord(url='https://en.wikipedia.org/?curid=3516101', timestamp=datetime.datetime(2021, 10, 24, 5, 15, 54), filename='crawl-data/CC-MAIN-2021-43/segments/1634323585911.17/warc/CC-MAIN-20211024050128-20211024080128-00689.warc.gz', offset='332329771', length='8922', mime='text/html', status=200, digest='TFZFWXWKS3NFJLLLXLYU6JTNZ77D3IVD')
{% endraw %} {% raw %}
x.content[:100]
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'
{% endraw %} {% raw %}
from tempfile import NamedTemporaryFile
f = NamedTemporaryFile('wb', dir='.', suffix='.html')
name = Path(f.name).relative_to(Path('.').absolute())
x.preview(name)
{% endraw %} {% raw %}
f.close()
{% endraw %}

Common Crawl Query

{% raw %}

class CommonCrawlQuery[source]

CommonCrawlQuery(url:str, start:Optional[str]=None, end:Optional[str]=None, apis:Optional[list[str]]=None, status_ok:bool=True, mime:Optional[Union[str, Iterable[str]]]=None)

CommonCrawlQuery(url: 'str', start: 'Optional[str]' = None, end: 'Optional[str]' = None, apis: 'Optional[list[str]]' = None, status_ok: 'bool' = True, mime: 'Optional[Union[str, Iterable[str]]]' = None)

{% endraw %} {% raw %}
{% endraw %}

Test querying on blacklist APIs

{% raw %}
CC_API_FILTER_BLACKLIST
['CC-MAIN-2015-11', 'CC-MAIN-2015-06']
{% endraw %} {% raw %}
query_cc_all = CommonCrawlQuery('www.commoncrawl.org/*',
                           start=datetime(2015,3,1), end=datetime(2015,3,1))
{% endraw %} {% raw %}
%%time
query_cc_all.cdx_apis
CPU times: user 4.66 ms, sys: 359 µs, total: 5.02 ms
Wall time: 4.98 ms
{'CC-MAIN-2015-11': 'https://index.commoncrawl.org/CC-MAIN-2015-11-index',
 'CC-MAIN-2015-06': 'https://index.commoncrawl.org/CC-MAIN-2015-06-index'}
{% endraw %} {% raw %}
x = list(query_cc_all.query())
len(x)
13
{% endraw %}

Notice there is no Mime or Status. We could fill it in from the archive record.

{% raw %}
x[0]
CommonCrawlRecord(url='http://commoncrawl.org/', timestamp=datetime.datetime(2015, 3, 2, 3, 27, 5), filename='crawl-data/CC-MAIN-2015-11/segments/1424936462700.28/warc/CC-MAIN-20150226074102-00159-ip-10-28-5-156.ec2.internal.warc.gz', offset='53235662', length='2526', mime=None, status=None, digest='QE4UUUWUJWEZBBK6PUG3CHFAGEKDMDBZ')
{% endraw %} {% raw %}
x[0].content[:100]
b'\n<!doctype html>\n<html class="no-js" lang="en">\n<head>\n<meta charset="utf-8"/>\n<meta name="viewport"'
{% endraw %}

Test querying on happy case data

{% raw %}
test_url = 'upload.wikimedia.org/wikipedia/commons/*'
{% endraw %} {% raw %}
cc_archive_all = CommonCrawlQuery(test_url,
                                  apis=['CC-MAIN-2021-43'],
                                  status_ok=False)

results = list(cc_archive_all.query(page_size=5))

from collections import Counter
Counter((r.mime, r.status) for r in results)
Counter({('application/pdf', 200): 561,
         ('text/html', 404): 181,
         ('image/vnd.djvu', 200): 125,
         ('image/jpeg', 200): 65,
         ('image/png', 200): 20,
         ('unk', 301): 31,
         ('application/ogg', 200): 2,
         ('application/sla', 200): 2,
         ('image/svg+xml', 200): 4,
         ('image/gif', 200): 2,
         ('image/x-xcf', 200): 2,
         ('text/html', 412): 2,
         ('warc/revisit', 304): 1,
         ('text/html', 400): 154})
{% endraw %}

Test status

{% raw %}
results_ok = list(CommonCrawlQuery(test_url,
                                  apis=['CC-MAIN-2021-43']).query(page_size=5))

assert results_ok == [r for r in results if r.status == 200]
{% endraw %}

Test mimetypes

These would fail in the BLACKLIST apis.

{% raw %}
results_html = list(CommonCrawlQuery(test_url,
                                  apis=['CC-MAIN-2021-43'],
                                  status_ok=False,
                                  mime='text/html').query(page_size=5))

assert results_html == [r for r in results if r.mime == 'text/html']
{% endraw %} {% raw %}
mimes = ['application/pdf', 'image/vnd.djvu', 'application/ogg']
results_mime = list(CommonCrawlQuery(test_url,
                                  apis=['CC-MAIN-2021-43'],
                                  status_ok=False,
                                  mime=mimes).query(page_size=5))

assert results_mime == [r for r in results if r.mime in mimes]
{% endraw %} {% raw %}
results_image = list(CommonCrawlQuery(test_url,
                                  apis=['CC-MAIN-2021-43'],
                                  status_ok=False,
                                  mime='image/*').query(page_size=5))

assert results_image == [r for r in results if r.mime.startswith('image/')]
{% endraw %}

Check fetching

{% raw %}
result = _cc_cdx_to_record({'urlkey': 'com,skeptric)/',
 'timestamp': '20211028110756',
 'url': 'https://skeptric.com/',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'status': '200',
 'digest': '7RBLUZ55MD4FUPDRPVVJVTM7YDYQXRS3',
 'length': '107238',
 'offset': '650789450',
 'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323588284.71/warc/CC-MAIN-20211028100619-20211028130619-00465.warc.gz',
 'languages': 'eng',
 'encoding': 'UTF-8'})
{% endraw %} {% raw %}
assert sha1_digest(result.content) == result.digest
{% endraw %}

Check fetching in parallel

Since they're hosted on AWS S3 we can go wild with parallelism. This code is actually identical to the wayback version

{% raw %}

cc_fetch_parallel[source]

cc_fetch_parallel(items, threads=64, session=None, callback=None)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
many_content = CommonCrawlRecord.fetch_parallel(results_html[:100])
{% endraw %}