--- title: CDX keywords: fastai sidebar: home_sidebar nb_path: "nbs/01_cdx.ipynb" ---
This is an interface to the Common Crawl Index and the Internet Archive's CDX Server for finding archived resources based on URL patterns. It has a lot of overlap with cdx_toolkit but is more optimised for speed and ease of use.
test_cache = '../data/test/cache'
The Internet Archive runs it's own Java CDX Server. Hakernoon has an article that gives a good overview of querying it.
See the documentation for all the paramemters; here are the ones that I find most interesting
*
can implicitly define it (which means we don't need matchType
)[!]field:regex
filter a returned field with Java Regex, inverting with !
Default fields returned: ["urlkey","timestamp","original","mimetype","statuscode","digest","length"]
Pagination isn't really useful because it's applied before filtering (including date filtering) so most pages are empty with a date filter. Unpaginated requests are fast enough anyway.
The data is a JSON array or arrays; the first line contains the headers and the subsequent the data. Let's transform it into a list of dictionaries of keys to values which make it a bit easier to work with (although less efficient in memory).
An alternative would be to directly use something like Pandas
Check it on some data
assert header_and_rows_to_dict([['col_1', 'col_2'], [1, 'a'], [2, 'b']]) == [
{'col_1': 1, 'col_2': 'a'},
{'col_1': 2, 'col_2': 'b'}]
assert header_and_rows_to_dict([['col_1', 'col_2']]) == []
assert header_and_rows_to_dict([]) == []
%%time
full_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False)
assert len(full_sample) > 1000
len(full_sample)
import pandas as pd
pd.DataFrame(full_sample)
timestamps = [x['timestamp'] for x in full_sample]
min(timestamps), max(timestamps)
pd.DataFrame(full_sample).groupby(['mimetype', 'statuscode'])['urlkey'].count()
%%time
ok_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None) #status_ok=True
assert ok_sample == [x for x in full_sample if x['statuscode'] == '200']
%%time
html_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, mime='text/html')
assert html_sample == [x for x in full_sample if x['mimetype'] == 'text/html']
%%time
image_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, mime='image/*')
assert image_sample == [x for x in full_sample if x['mimetype'].startswith('image/')]
%%time
prog_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False,
mime=['text/css', 'application/javascript'])
assert prog_sample == [x for x in full_sample if x['mimetype'] in ['text/css', 'application/javascript']]
%%time
sample_2020 = query_wayback_cdx('skeptric.com/*', start='2020', end='2020', status_ok=False)
len(sample_2020)
assert sample_2020 == [x for x in full_sample if '2020' <= x['timestamp'] < '2021']
%%time
sample_to_2020 = query_wayback_cdx('skeptric.com/*', start=None, end='2020', status_ok=False)
assert sample_to_2020 == [x for x in full_sample if x['timestamp'] < '2021']
%%time
sample_from_2020 = query_wayback_cdx('skeptric.com/*', start='2020', end=None, status_ok=False)
assert sample_from_2020 == [x for x in full_sample if x['timestamp'] >= '2020']
%%time
sample_10 = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, limit=10)
sample_10 == full_sample[:10]
%%time
sample_10_offset_20 = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, limit=10, offset=20)
assert sample_10_offset_20 == full_sample[20:20+10]
record = image_sample[0]
record
content = fetch_internet_archive_content(record['timestamp'], record['original'])
from IPython.display import Image
Image(content)
len(content)
record = ok_sample[0]
record
content = fetch_internet_archive_content(record['timestamp'], record['original'])
content[:100]
I have no idea whether Session is actually thread-safe; if it's not we should have 1 session per thread. As far as I can tell the issue occurs when you have lots of hosts, so in this case it should be ok.
I guess we'll try it and see. Maybe long term we're better going with asyncio. Using a Session makes things slightly faster; we can always turn it off by passing session=False.
We cache everything with joblib and make it fast.
Note we only use 8 threads to avoid overloading the servers.
wb = WaybackMachine(test_cache, verbose=1)
wb.memory.clear()
%%time
items = wb.query('skeptric.com/*', None, None)
%%time
items2 = wb.query('skeptric.com/*', None, None)
assert items2 == items
%%time
items3 = wb.query('skeptric.com/*', None, None, force=True)
assert items3 == items
%%time
content = wb.fetch_one(items[0])
%%time
content2 = wb.fetch_one(items[0])
assert content2 == content
%%time
content3 = wb.fetch_one(items[0], force=True)
assert content3 == content
wb = WaybackMachine(test_cache, verbose=0)
%%time
contents_16 = wb.fetch(items[:16], threads=8)
assert contents_16[0] == content
Common Crawl is similar, but slightly different, because it is split over many indexes, roughly monthly covering 2 weeks.
Cache within a session, but not between sessions since over months the indexes change.
indexes = get_cc_indexes()
indexes[:3]
Notice the oldest ones have a different format
indexes[-4:]
import re
from datetime import datetime
year_week = re.match(r'^CC-MAIN-(\d{4}-\d{2})$', indexes[0]['id']).group(1)
datetime.strptime(year_week + '-6', '%Y-%W-%w')
indexes[0]['id']
cc_sample_api = 'https://index.commoncrawl.org/CC-MAIN-2021-43-index'
assert cc_sample_api in [i['cdx-api'] for i in indexes]
The Common Crawl CDX is much slower than the Wayback Machines. It also uses the zipped CDX by default which automatically paginates.
test_jline = b'{"status": "200", "mime": "text/html"}\n{"status": "301", "mime": "-"}\n'
test_jdict = [{'status': '200', 'mime': 'text/html'}, {'status': '301', 'mime': '-'}]
assert jsonl_loads(test_jline) == test_jdict
assert jsonl_loads(test_jline.rstrip()) == test_jdict
test_url = 'mn.wikipedia.org/*'
pages = query_cc_cdx_num_pages(cc_sample_api, test_url)
pages
records = []
for page in range(pages):
records += query_cc_cdx_page(cc_sample_api, test_url, page=page, mime='text/html')
len(records)
timestamps = [record['timestamp'] for record in records]
min(timestamps), max(timestamps)
We should get an error querying past the last page (400: Invalid Request)
from requests.exceptions import HTTPError
try:
query_cc_cdx_page(cc_sample_api, test_url, page=pages, mime='text/html')
raise AssertionError('Expected Failure')
except HTTPError as e:
status = e.response.status_code
if status != 400:
raise AssertionError(f'Expected 400, got {status}')
record = records[0]
record
record = {'urlkey': 'org,wikipedia,en)/?curid=3516101',
'timestamp': '20211024051554',
'url': 'https://en.wikipedia.org/?curid=3516101',
'mime': 'text/html',
'mime-detected': 'text/html',
'status': '200',
'digest': 'TFZFWXWKS3NFJLLLXLYU6JTNZ77D3IVD',
'length': '8922',
'offset': '332329771',
'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323585911.17/warc/CC-MAIN-20211024050128-20211024080128-00689.warc.gz',
'languages': 'eng',
'encoding': 'UTF-8'}
%%time
content = fetch_cc(record['filename'], record['offset'], record['length'])
content[:1000]
Check the digest
from hashlib import sha1
from base64 import b32encode
def get_digest(content: bytes) -> str:
return b32encode(sha1(content).digest()).decode('ascii')
assert get_digest(content) == record['digest']
cc = CommonCrawl(test_cache, verbose=1)
cc.memory.clear()
assert 'CC-MAIN-2021-43' in cc.cdx_apis
results = cc.query('CC-MAIN-2021-43', 'archive.org/s*', status_ok=False)
pd.DataFrame(results).groupby(['mime', 'status'])['url'].count()
results2 = cc.query('CC-MAIN-2021-43', 'archive.org/s*', status_ok=False)
assert results == results2
results3 = cc.query('CC-MAIN-2021-43', 'archive.org/s*', status_ok=False, force=True)
assert results == results3
results_ok = cc.query('CC-MAIN-2021-43', 'archive.org/s*', status_ok=True)
assert results_ok == [r for r in results if r['status'] == '200']
results_html = cc.query('CC-MAIN-2021-43', 'archive.org/s*', mime='text/html', status_ok=False)
assert results_html == [r for r in results if r['mime'] == 'text/html']
mimes = ['audio/x-mpegurl', 'image/x.djvu', 'text/xml']
results_mime = cc.query('CC-MAIN-2021-43', 'archive.org/s*', mime=mimes, status_ok=False)
assert results_mime == [r for r in results if r['mime'] in mimes]
results_image = cc.query('CC-MAIN-2021-43', 'archive.org/s*', mime='image/*', status_ok=False)
assert results_image == [r for r in results if r['mime'].startswith('image/')]
cc = CommonCrawl(test_cache)
result = {'urlkey': 'com,skeptric)/',
'timestamp': '20211028110756',
'url': 'https://skeptric.com/',
'mime': 'text/html',
'mime-detected': 'text/html',
'status': '200',
'digest': '7RBLUZ55MD4FUPDRPVVJVTM7YDYQXRS3',
'length': '107238',
'offset': '650789450',
'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323588284.71/warc/CC-MAIN-20211028100619-20211028130619-00465.warc.gz',
'languages': 'eng',
'encoding': 'UTF-8'}
contents = cc.fetch([result])
content = cc.fetch_one(result)
assert len(contents) == 1
assert content == contents[0]