--- title: CDX keywords: fastai sidebar: home_sidebar nb_path: "nbs/01_cdx.ipynb" ---
{% raw %}
{% endraw %}

This is an interface to the Common Crawl Index and the Internet Archive's CDX Server for finding archived resources based on URL patterns. It has a lot of overlap with cdx_toolkit but is more optimised for speed and ease of use.

{% raw %}
{% endraw %} {% raw %}
test_cache = '../data/test/cache'
{% endraw %}

Internet Archive

Querying CDX

The Internet Archive runs it's own Java CDX Server. Hakernoon has an article that gives a good overview of querying it.

See the documentation for all the paramemters; here are the ones that I find most interesting

  • url: The URL to query, a wildcard * can implicitly define it (which means we don't need matchType)
  • output: "json" returns JSON instead of space separated
  • fl: Comma separated list of fields to return
  • from, to: Date filtering 1-14 digits yyyyMMddhhmmss, inclusive (this seems to do the right thing for truncation).
  • filter: [!]field:regex filter a returned field with Java Regex, inverting with !
  • limit/offset: For getting a small number of results (sampling). Internal limit is 150000 results.

Default fields returned: ["urlkey","timestamp","original","mimetype","statuscode","digest","length"]

Pagination isn't really useful because it's applied before filtering (including date filtering) so most pages are empty with a date filter. Unpaginated requests are fast enough anyway.

Exporting the data

The data is a JSON array or arrays; the first line contains the headers and the subsequent the data. Let's transform it into a list of dictionaries of keys to values which make it a bit easier to work with (although less efficient in memory).

An alternative would be to directly use something like Pandas

{% raw %}

header_and_rows_to_dict[source]

header_and_rows_to_dict(rows:Iterable[list[Any]])

{% endraw %} {% raw %}
{% endraw %}

Check it on some data

{% raw %}
assert header_and_rows_to_dict([['col_1', 'col_2'], [1, 'a'], [2, 'b']]) == [
    {'col_1': 1, 'col_2': 'a'},
    {'col_1': 2, 'col_2': 'b'}]
{% endraw %} {% raw %}
assert header_and_rows_to_dict([['col_1', 'col_2']]) == []
{% endraw %} {% raw %}
assert header_and_rows_to_dict([]) == []
{% endraw %}

Making a query

{% raw %}

mimetypes_to_regex[source]

mimetypes_to_regex(mime:list[str], prefix='mimetype:')

{% endraw %} {% raw %}

query_wayback_cdx[source]

query_wayback_cdx(url:str, start:Optional[str], end:Optional[str], status_ok:bool=True, mime:Optional[Union[str, Iterable[str]]]=None, limit:Optional[int]=None, offset:Optional[int]=None, session:Optional[Session]=None)

Get references to Wayback Machine Captures for url.

Queries the Internet Archive Capture Index (CDX) for url.

Arguments:

  • start: Minimum date in format YYYYmmddHHMMSS (or any substring) inclusive
  • end: Maximum date in format YYYYmmddHHMMSS (or any substring) inclusive
  • status_ok: Only return those with a HTTP status 200
  • mime: Filter on mimetypes, '' is a wildcard (e.g. 'image/')
  • limit: Only return first limit records
  • offset: Skip the first offset records, combine with limit
  • session: Session to use when making requests Filters results between start and end inclusive, in format YYYYmmddHHMMSS or any substring (e.g. start="202001", end="202001" will get all captures in January 2020)
{% endraw %} {% raw %}
{% endraw %}

Test querying

{% raw %}
%%time
full_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False)
CPU times: user 16.8 ms, sys: 0 ns, total: 16.8 ms
Wall time: 1.53 s
{% endraw %} {% raw %}
assert len(full_sample) > 1000
len(full_sample)
1579
{% endraw %} {% raw %}
import pandas as pd
pd.DataFrame(full_sample)
urlkey timestamp original mimetype statuscode digest length
0 com,skeptric)/ 20180805132731 http://skeptric.com/ text/html 200 2N5QQYZFAM36CSESTGSDKRTRV7I5HEXJ 1982
1 com,skeptric)/ 20190610142553 http://skeptric.com/ warc/revisit - 2N5QQYZFAM36CSESTGSDKRTRV7I5HEXJ 580
2 com,skeptric)/ 20201126064102 https://skeptric.com/ text/html 200 WDYU3RU7ZMFFSZPAPE56PC4L3EK4FE3D 82985
3 com,skeptric)/ 20210109075617 https://skeptric.com/ text/html 200 M6GZ4ZVD7U5L2TUXS5POHN36SSLOEPO7 92869
4 com,skeptric)/ 20210417050827 https://skeptric.com/ text/html 200 ZOYZF2RVY44PBSNNXMVH27MHUPGGO4OK 101114
... ... ... ... ... ... ... ...
1574 com,skeptric)/wsl2-xserver 20201216173655 https://skeptric.com/wsl2-xserver/ warc/revisit - UAS4RFHUDI6TNMPWVTD4N6RZWLKQAOGT 720
1575 com,skeptric)/wsl2-xserver 20201216185608 https://skeptric.com/wsl2-xserver/ warc/revisit - UAS4RFHUDI6TNMPWVTD4N6RZWLKQAOGT 721
1576 com,skeptric)/wsl2-xserver 20210128160531 https://skeptric.com/wsl2-xserver/ text/html 200 BBSHS372N5MIYDLGSQWQKF6TKNBEDD35 5794
1577 com,skeptric)/wsl2-xserver 20210520180504 https://skeptric.com/wsl2-xserver/ text/html 200 BBSHS372N5MIYDLGSQWQKF6TKNBEDD35 4821
1578 com,skeptric)/wsl2-xserver 20211011012618 https://skeptric.com/wsl2-xserver/ text/html 200 LZPYUW2572JR5QP3W3JLQ4SAHQWCY7PQ 5603

1579 rows × 7 columns

{% endraw %} {% raw %}
timestamps = [x['timestamp'] for x in full_sample]
min(timestamps), max(timestamps)
('20170915060403', '20211123083615')
{% endraw %} {% raw %}
pd.DataFrame(full_sample).groupby(['mimetype', 'statuscode'])['urlkey'].count()
mimetype                  statuscode
application/javascript    200             1
application/octet-stream  200             3
image/gif                 200             3
image/jpeg                200            59
image/png                 200           292
image/svg+xml             200            51
image/vnd.microsoft.icon  200             1
image/x-icon              200             2
text/css                  200            10
text/html                 200           393
                          302             1
                          404             5
warc/revisit              -             758
Name: urlkey, dtype: int64
{% endraw %}

Test statuscode filter

{% raw %}
%%time
ok_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None) #status_ok=True
CPU times: user 14.6 ms, sys: 2.33 ms, total: 16.9 ms
Wall time: 829 ms
{% endraw %} {% raw %}
assert ok_sample == [x for x in full_sample if x['statuscode'] == '200']
{% endraw %}

Test mimetypes

{% raw %}
%%time
html_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, mime='text/html')
CPU times: user 6.73 ms, sys: 0 ns, total: 6.73 ms
Wall time: 634 ms
{% endraw %} {% raw %}
assert html_sample == [x for x in full_sample if x['mimetype'] == 'text/html']
{% endraw %} {% raw %}
%%time
image_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, mime='image/*')
CPU times: user 9.46 ms, sys: 0 ns, total: 9.46 ms
Wall time: 701 ms
{% endraw %} {% raw %}
assert image_sample == [x for x in full_sample if x['mimetype'].startswith('image/')]
{% endraw %} {% raw %}
%%time
prog_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, 
                                 mime=['text/css', 'application/javascript'])
CPU times: user 0 ns, sys: 4.06 ms, total: 4.06 ms
Wall time: 472 ms
{% endraw %} {% raw %}
assert prog_sample == [x for x in full_sample if x['mimetype'] in ['text/css', 'application/javascript']]
{% endraw %}

Test date filters

{% raw %}
%%time
sample_2020 = query_wayback_cdx('skeptric.com/*', start='2020', end='2020', status_ok=False)
len(sample_2020)
CPU times: user 3.84 ms, sys: 0 ns, total: 3.84 ms
Wall time: 505 ms
106
{% endraw %} {% raw %}
assert sample_2020 == [x for x in full_sample if '2020' <= x['timestamp'] < '2021']
{% endraw %} {% raw %}
%%time
sample_to_2020 = query_wayback_cdx('skeptric.com/*', start=None, end='2020', status_ok=False)
CPU times: user 5.08 ms, sys: 0 ns, total: 5.08 ms
Wall time: 447 ms
{% endraw %} {% raw %}
assert sample_to_2020 == [x for x in full_sample if x['timestamp'] < '2021']
{% endraw %} {% raw %}
%%time
sample_from_2020 = query_wayback_cdx('skeptric.com/*', start='2020', end=None, status_ok=False)
CPU times: user 15.9 ms, sys: 0 ns, total: 15.9 ms
Wall time: 1.34 s
{% endraw %} {% raw %}
assert sample_from_2020 == [x for x in full_sample if x['timestamp'] >= '2020']
{% endraw %}

Test limits

{% raw %}
%%time
sample_10 = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, limit=10)
CPU times: user 4.59 ms, sys: 0 ns, total: 4.59 ms
Wall time: 442 ms
{% endraw %} {% raw %}
sample_10 == full_sample[:10]
True
{% endraw %} {% raw %}
%%time
sample_10_offset_20 = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, limit=10, offset=20)
CPU times: user 0 ns, sys: 4.72 ms, total: 4.72 ms
Wall time: 432 ms
{% endraw %} {% raw %}
assert sample_10_offset_20  == full_sample[20:20+10]
{% endraw %}

Fetching Content

{% raw %}

fetch_internet_archive_content[source]

fetch_internet_archive_content(timestamp:str, url:str, session:Optional[Session]=None)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
record = image_sample[0]
record
{'urlkey': 'com,skeptric)/favicon.ico',
 'timestamp': '20180819051126',
 'original': 'http://skeptric.com/favicon.ico',
 'mimetype': 'image/vnd.microsoft.icon',
 'statuscode': '200',
 'digest': 'R6YE2GPPT4BM4IAMHGUJPDJF6BGKKHDA',
 'length': '851'}
{% endraw %} {% raw %}
content = fetch_internet_archive_content(record['timestamp'], record['original'])
{% endraw %} {% raw %}
from IPython.display import Image
Image(content)
{% endraw %} {% raw %}
len(content)
1150
{% endraw %} {% raw %}
record = ok_sample[0]
record
{'urlkey': 'com,skeptric)/',
 'timestamp': '20180805132731',
 'original': 'http://skeptric.com/',
 'mimetype': 'text/html',
 'statuscode': '200',
 'digest': '2N5QQYZFAM36CSESTGSDKRTRV7I5HEXJ',
 'length': '1982'}
{% endraw %} {% raw %}
content = fetch_internet_archive_content(record['timestamp'], record['original'])
{% endraw %} {% raw %}
content[:100]
b'<!DOCTYPE html>\n<html lang="en">\n  <head><script src="//archive.org/includes/analytics.js?v=cf34f82"'
{% endraw %}

Wrap it in a fast Object

I have no idea whether Session is actually thread-safe; if it's not we should have 1 session per thread. As far as I can tell the issue occurs when you have lots of hosts, so in this case it should be ok.

I guess we'll try it and see. Maybe long term we're better going with asyncio. Using a Session makes things slightly faster; we can always turn it off by passing session=False.

{% raw %}

make_session[source]

make_session(pool_maxsize)

{% endraw %} {% raw %}
{% endraw %}

We cache everything with joblib and make it fast.

Note we only use 8 threads to avoid overloading the servers.

{% raw %}

class WaybackResult[source]

WaybackResult(params, session=None)

{% endraw %} {% raw %}

class WaybackMachine[source]

WaybackMachine(location:Optional[str], session:Union[bool, Session]=True, threads:int=8, verbose=0)

{% endraw %} {% raw %}
{% endraw %}

Test fetching one

{% raw %}
wb = WaybackMachine(test_cache, verbose=1)
{% endraw %} {% raw %}
wb.memory.clear()
WARNING:root:[Memory(location=data/test/cache/joblib)]: Flushing completely the cache
{% endraw %} {% raw %}
%%time
items = wb.query('skeptric.com/*', None, None)
________________________________________________________________________________
[Memory] Calling __main__--tmp-ipykernel-2773891608.query_wayback_cdx...
query_wayback_cdx('skeptric.com/*', None, None, True, None, None, None, session=<requests.sessions.Session object at 0x7f81dbb45100>)
________________________________________________query_wayback_cdx - 0.9s, 0.0min
CPU times: user 49 ms, sys: 0 ns, total: 49 ms
Wall time: 894 ms
{% endraw %} {% raw %}
%%time
items2 = wb.query('skeptric.com/*', None, None)
CPU times: user 22.3 ms, sys: 158 µs, total: 22.4 ms
Wall time: 19.4 ms
{% endraw %} {% raw %}
assert items2 == items
{% endraw %} {% raw %}
%%time
items3 = wb.query('skeptric.com/*', None, None, force=True)
________________________________________________________________________________
[Memory] Calling __main__--tmp-ipykernel-2773891608.query_wayback_cdx...
query_wayback_cdx('skeptric.com/*', None, None, True, None, None, None, session=<requests.sessions.Session object at 0x7f81dbb45100>)
________________________________________________query_wayback_cdx - 0.7s, 0.0min
CPU times: user 41.8 ms, sys: 4.34 ms, total: 46.1 ms
Wall time: 751 ms
{% endraw %} {% raw %}
assert items3 == items
{% endraw %} {% raw %}
%%time
content = wb.fetch_one(items[0])
________________________________________________________________________________
[Memory] Calling __main__--tmp-ipykernel-2187831798.fetch_internet_archive_content...
fetch_internet_archive_content('20180805132731', 'http://skeptric.com/', session=<requests.sessions.Session object at 0x7f81dbb45100>)
___________________________________fetch_internet_archive_content - 0.2s, 0.0min
CPU times: user 5.63 ms, sys: 2.69 ms, total: 8.32 ms
Wall time: 199 ms
{% endraw %} {% raw %}
%%time
content2 = wb.fetch_one(items[0])
CPU times: user 1.5 ms, sys: 714 µs, total: 2.21 ms
Wall time: 1.18 ms
{% endraw %} {% raw %}
assert content2 == content
{% endraw %} {% raw %}
%%time
content3 = wb.fetch_one(items[0], force=True)
________________________________________________________________________________
[Memory] Calling __main__--tmp-ipykernel-2187831798.fetch_internet_archive_content...
fetch_internet_archive_content('20180805132731', 'http://skeptric.com/', session=<requests.sessions.Session object at 0x7f81dbb45100>)
___________________________________fetch_internet_archive_content - 0.2s, 0.0min
CPU times: user 4.82 ms, sys: 2.29 ms, total: 7.11 ms
Wall time: 189 ms
{% endraw %} {% raw %}
assert content3 == content
{% endraw %}

Test fetching many

{% raw %}
wb = WaybackMachine(test_cache, verbose=0)
{% endraw %} {% raw %}
%%time
contents_16 = wb.fetch(items[:16], threads=8)
CPU times: user 197 ms, sys: 9.6 ms, total: 207 ms
Wall time: 4.24 s
{% endraw %} {% raw %}
assert contents_16[0] == content
{% endraw %}

Common Crawl

Common Crawl is similar, but slightly different, because it is split over many indexes, roughly monthly covering 2 weeks.

Indexes

Cache within a session, but not between sessions since over months the indexes change.

{% raw %}

get_cc_indexes[source]

get_cc_indexes()

{% endraw %} {% raw %}
{% endraw %} {% raw %}
indexes = get_cc_indexes()
indexes[:3]
[{'id': 'CC-MAIN-2021-43',
  'name': 'October 2021 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2021-43/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2021-43-index'},
 {'id': 'CC-MAIN-2021-39',
  'name': 'September 2021 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2021-39/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2021-39-index'},
 {'id': 'CC-MAIN-2021-31',
  'name': 'July 2021 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2021-31/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2021-31-index'}]
{% endraw %}

Notice the oldest ones have a different format

{% raw %}
indexes[-4:]
[{'id': 'CC-MAIN-2013-20',
  'name': 'Summer 2013 Index',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2013-20/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2013-20-index'},
 {'id': 'CC-MAIN-2012',
  'name': 'Index of 2012 ARC files',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2012/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2012-index'},
 {'id': 'CC-MAIN-2009-2010',
  'name': 'Index of 2009 - 2010 ARC files',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2009-2010/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2009-2010-index'},
 {'id': 'CC-MAIN-2008-2009',
  'name': 'Index of 2008 - 2009 ARC files',
  'timegate': 'https://index.commoncrawl.org/CC-MAIN-2008-2009/',
  'cdx-api': 'https://index.commoncrawl.org/CC-MAIN-2008-2009-index'}]
{% endraw %}

Parsing the times

{% raw %}
import re
from datetime import datetime
year_week = re.match(r'^CC-MAIN-(\d{4}-\d{2})$', indexes[0]['id']).group(1)
datetime.strptime(year_week  + '-6', '%Y-%W-%w')
datetime.datetime(2021, 10, 30, 0, 0)
{% endraw %} {% raw %}
indexes[0]['id']
'CC-MAIN-2021-43'
{% endraw %}

Querying CDX

{% raw %}
cc_sample_api = 'https://index.commoncrawl.org/CC-MAIN-2021-43-index'
assert cc_sample_api in [i['cdx-api'] for i in indexes]
{% endraw %}

The Common Crawl CDX is much slower than the Wayback Machines. It also uses the zipped CDX by default which automatically paginates.

{% raw %}

jsonl_loads[source]

jsonl_loads(jsonl)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
test_jline = b'{"status": "200", "mime": "text/html"}\n{"status": "301", "mime": "-"}\n'
test_jdict = [{'status': '200', 'mime': 'text/html'}, {'status': '301', 'mime': '-'}]
assert jsonl_loads(test_jline) == test_jdict
assert jsonl_loads(test_jline.rstrip()) == test_jdict
{% endraw %} {% raw %}

query_cc_cdx_num_pages[source]

query_cc_cdx_num_pages(api:str, url:str, session:Optional[Session]=None)

{% endraw %} {% raw %}

query_cc_cdx_page[source]

query_cc_cdx_page(api:str, url:str, page:int, start:Optional[str]=None, end:Optional[str]=None, status_ok:bool=True, mime:Optional[Union[str, Iterable[str]]]=None, limit:Optional[int]=None, offset:Optional[int]=None, session:Optional[Session]=None)

Get references to Common Crawl Captures for url.

Queries the Common Crawl Capture Index (CDX) for url.

Filters:

  • api: API endpoint to use (e.g. 'https://index.commoncrawl.org/CC-MAIN-2021-43-index')
  • start: Minimum date in format YYYYmmddHHMMSS (or any substring) inclusive
  • end: Maximum date in format YYYYmmddHHMMSS (or any substring) inclusive
  • status_ok: Only return those with a HTTP status 200
  • mime: Filter on mimetypes, '' is a wildcard (e.g. 'image/')
  • limit: Only return first limit records
  • offset: Skip the first offset records, combine with limit
  • session: Session to use when making requests Filters results between start and end inclusive, in format YYYYmmddHHMMSS or any substring (e.g. start="202001", end="202001" will get all captures in January 2020)
{% endraw %} {% raw %}
{% endraw %}

Testing pagination

{% raw %}
test_url = 'mn.wikipedia.org/*'
{% endraw %} {% raw %}
pages = query_cc_cdx_num_pages(cc_sample_api, test_url)
pages
2
{% endraw %} {% raw %}
records = []
for page in range(pages):
    records += query_cc_cdx_page(cc_sample_api, test_url, page=page, mime='text/html')
len(records)
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/tmp/ipykernel_17414/431236362.py in <module>
      2 records = []
      3 for page in range(pages):
----> 4     records += query_cc_cdx_page(cc_sample_api, test_url, page=page, mime='text/html')
      5 len(records)

/tmp/ipykernel_17414/1284078378.py in query_cc_cdx_page(api, url, page, start, end, status_ok, mime, limit, offset, session)
     56 
     57     params = {k:v for k,v in params.items() if v}
---> 58     response = session.get(api, params=params)
     59     response.raise_for_status()
     60     return jsonl_loads(response.content)

~/.venv/webdata/lib/python3.8/site-packages/requests/api.py in get(url, params, **kwargs)
     73     """
     74 
---> 75     return request('get', url, params=params, **kwargs)
     76 
     77 

~/.venv/webdata/lib/python3.8/site-packages/requests/api.py in request(method, url, **kwargs)
     59     # cases, and look like a memory leak in others.
     60     with sessions.Session() as session:
---> 61         return session.request(method=method, url=url, **kwargs)
     62 
     63 

~/.venv/webdata/lib/python3.8/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    540         }
    541         send_kwargs.update(settings)
--> 542         resp = self.send(prep, **send_kwargs)
    543 
    544         return resp

~/.venv/webdata/lib/python3.8/site-packages/requests/sessions.py in send(self, request, **kwargs)
    653 
    654         # Send the request
--> 655         r = adapter.send(request, **kwargs)
    656 
    657         # Total elapsed time of the request (approximately)

~/.venv/webdata/lib/python3.8/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    437         try:
    438             if not chunked:
--> 439                 resp = conn.urlopen(
    440                     method=request.method,
    441                     url=url,

~/.venv/webdata/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    697 
    698             # Make the request on the httplib connection object.
--> 699             httplib_response = self._make_request(
    700                 conn,
    701                 method,

~/.venv/webdata/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443                     # Python 3 (including for exceptions like SystemExit).
    444                     # Otherwise it looks like a bug in the code.
--> 445                     six.raise_from(e, None)
    446         except (SocketTimeout, BaseSSLError, SocketError) as e:
    447             self._raise_timeout(err=e, url=url, timeout_value=read_timeout)

~/.venv/webdata/lib/python3.8/site-packages/urllib3/packages/six.py in raise_from(value, from_value)

~/.venv/webdata/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    438                 # Python 3
    439                 try:
--> 440                     httplib_response = conn.getresponse()
    441                 except BaseException as e:
    442                     # Remove the TypeError from the exception chain in

/usr/lib/python3.8/http/client.py in getresponse(self)
   1345         try:
   1346             try:
-> 1347                 response.begin()
   1348             except ConnectionError:
   1349                 self.close()

/usr/lib/python3.8/http/client.py in begin(self)
    305         # read until we get a non-100 response
    306         while True:
--> 307             version, status, reason = self._read_status()
    308             if status != CONTINUE:
    309                 break

/usr/lib/python3.8/http/client.py in _read_status(self)
    266 
    267     def _read_status(self):
--> 268         line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
    269         if len(line) > _MAXLINE:
    270             raise LineTooLong("status line")

/usr/lib/python3.8/socket.py in readinto(self, b)
    667         while True:
    668             try:
--> 669                 return self._sock.recv_into(b)
    670             except timeout:
    671                 self._timeout_occurred = True

/usr/lib/python3.8/ssl.py in recv_into(self, buffer, nbytes, flags)
   1239                   "non-zero flags not allowed in calls to recv_into() on %s" %
   1240                   self.__class__)
-> 1241             return self.read(nbytes, buffer)
   1242         else:
   1243             return super().recv_into(buffer, nbytes, flags)

/usr/lib/python3.8/ssl.py in read(self, len, buffer)
   1097         try:
   1098             if buffer is not None:
-> 1099                 return self._sslobj.read(len, buffer)
   1100             else:
   1101                 return self._sslobj.read(len)

KeyboardInterrupt: 
{% endraw %} {% raw %}
timestamps = [record['timestamp'] for record in records]
min(timestamps), max(timestamps)
{% endraw %}

We should get an error querying past the last page (400: Invalid Request)

{% raw %}
from requests.exceptions import HTTPError

try:
    query_cc_cdx_page(cc_sample_api, test_url, page=pages, mime='text/html')
    raise AssertionError('Expected Failure')
except HTTPError as e:
    status = e.response.status_code
    if status != 400:
        raise AssertionError(f'Expected 400, got {status}')
{% endraw %}

Downloading

{% raw %}

fetch_cc[source]

fetch_cc(filename:str, offset:int, length:int, session:Optional[Session]=None)

{% endraw %} {% raw %}
{% endraw %} {% raw %}
record = records[0]
record
{% endraw %} {% raw %}
record = {'urlkey': 'org,wikipedia,en)/?curid=3516101',
 'timestamp': '20211024051554',
 'url': 'https://en.wikipedia.org/?curid=3516101',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'status': '200',
 'digest': 'TFZFWXWKS3NFJLLLXLYU6JTNZ77D3IVD',
 'length': '8922',
 'offset': '332329771',
 'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323585911.17/warc/CC-MAIN-20211024050128-20211024080128-00689.warc.gz',
 'languages': 'eng',
 'encoding': 'UTF-8'}
{% endraw %} {% raw %}
%%time
content = fetch_cc(record['filename'], record['offset'], record['length'])
{% endraw %} {% raw %}
content[:1000]
{% endraw %}

Check the digest

{% raw %}
from hashlib import sha1
from base64 import b32encode

def get_digest(content: bytes) -> str:
    return b32encode(sha1(content).digest()).decode('ascii')

assert get_digest(content) == record['digest']
{% endraw %}

Make it go fast

{% raw %}

class CommonCrawl[source]

CommonCrawl(location:Optional[str], session:Union[bool, Session]=True, threads:int=64, verbose=0)

{% endraw %} {% raw %}
{% endraw %}

Test querying

{% raw %}
cc = CommonCrawl(test_cache, verbose=1)
{% endraw %} {% raw %}
cc.memory.clear()
{% endraw %} {% raw %}
assert 'CC-MAIN-2021-43' in cc.cdx_apis
{% endraw %} {% raw %}
results = cc.query('CC-MAIN-2021-43', 'archive.org/s*', status_ok=False)


pd.DataFrame(results).groupby(['mime', 'status'])['url'].count()
{% endraw %} {% raw %}
results2 = cc.query('CC-MAIN-2021-43', 'archive.org/s*', status_ok=False)

assert results == results2
{% endraw %} {% raw %}
results3 = cc.query('CC-MAIN-2021-43', 'archive.org/s*', status_ok=False, force=True)

assert results == results3
{% endraw %}

Test status

{% raw %}
results_ok = cc.query('CC-MAIN-2021-43', 'archive.org/s*', status_ok=True)

assert results_ok == [r for r in results if r['status'] == '200']
{% endraw %}

Test mimetypes

{% raw %}
results_html = cc.query('CC-MAIN-2021-43', 'archive.org/s*', mime='text/html', status_ok=False)

assert results_html == [r for r in results if r['mime'] == 'text/html']
{% endraw %} {% raw %}
mimes = ['audio/x-mpegurl', 'image/x.djvu', 'text/xml']
results_mime = cc.query('CC-MAIN-2021-43', 'archive.org/s*', mime=mimes, status_ok=False)

assert results_mime == [r for r in results if r['mime'] in mimes]
{% endraw %} {% raw %}
results_image = cc.query('CC-MAIN-2021-43', 'archive.org/s*', mime='image/*', status_ok=False)

assert results_image == [r for r in results if r['mime'].startswith('image/')]
{% endraw %}

Check fetching

{% raw %}
cc = CommonCrawl(test_cache)
{% endraw %} {% raw %}
result = {'urlkey': 'com,skeptric)/',
 'timestamp': '20211028110756',
 'url': 'https://skeptric.com/',
 'mime': 'text/html',
 'mime-detected': 'text/html',
 'status': '200',
 'digest': '7RBLUZ55MD4FUPDRPVVJVTM7YDYQXRS3',
 'length': '107238',
 'offset': '650789450',
 'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323588284.71/warc/CC-MAIN-20211028100619-20211028130619-00465.warc.gz',
 'languages': 'eng',
 'encoding': 'UTF-8'}
{% endraw %} {% raw %}
contents = cc.fetch([result])
content = cc.fetch_one(result)
assert len(contents) == 1
assert content == contents[0]
{% endraw %}