--- title: Query keywords: fastai sidebar: home_sidebar nb_path: "nbs/01_query.ipynb" ---
%load_ext autoreload
%autoreload 2
This is an interface to querying resources that can be fetched.
It covers local files, the Common Crawl Index and the Internet Archive's CDX Server for finding archived resources based on URL patterns. The latter two have a lot of overlap with cdx_toolkit but this is more optimised for speed and ease of use.
Generated some test data with:
wget -r -Q1M --domains skeptric.com --warc-file=skeptric --delete-after --no-directories https://skeptric.com/pagination-wayback-cdx/
See warcio library for how to do this in Python.
test_data = '../resources/test/skeptric.warc.gz'
warc = WarcFileQuery(test_data)
results = warc.query()
results
Try fetching a record
image_record = [r for r in results if r.mime == 'image/png'][0]
content = image_record.content
from IPython.display import Image
Image(content)
Check the digests
for result in results:
assert result.digest == sha1_digest(result.get_content())
The Internet Archive runs it's own Java CDX Server. Hakernoon has an article that gives a good overview of querying it.
See the documentation for all the paramemters; here are the ones that I find most interesting
*
can implicitly define it (which means we don't need matchType
)[!]field:regex
filter a returned field with Java Regex, inverting with !
Default fields returned: ["urlkey","timestamp","original","mimetype","statuscode","digest","length"]
Pagination isn't really useful because it's applied before filtering (including date filtering) so most pages are empty with a date filter. Unpaginated requests are fast enough anyway.
The data is a JSON array or arrays; the first line contains the headers and the subsequent the data. Let's transform it into a list of dictionaries of keys to values which make it a bit easier to work with (although less efficient in memory).
An alternative would be to directly use something like Pandas
Check it on some data
assert header_and_rows_to_dict([['col_1', 'col_2'], [1, 'a'], [2, 'b']]) == [
{'col_1': 1, 'col_2': 'a'},
{'col_1': 2, 'col_2': 'b'}]
assert header_and_rows_to_dict([['col_1', 'col_2']]) == []
assert header_and_rows_to_dict([]) == []
%%time
full_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False)
%%time
ok_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None) #status_ok=True
assert ok_sample == [x for x in full_sample if x['statuscode'] == '200']
%%time
html_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, mime='text/html')
assert html_sample == [x for x in full_sample if x['mimetype'] == 'text/html']
%%time
image_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, mime='image/*')
assert image_sample == [x for x in full_sample if x['mimetype'].startswith('image/')]
%%time
prog_sample = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False,
mime=['text/css', 'application/javascript'])
assert prog_sample == [x for x in full_sample if x['mimetype'] in ['text/css', 'application/javascript']]
%%time
sample_2020 = query_wayback_cdx('skeptric.com/*', start='2020', end='2020', status_ok=False)
len(sample_2020)
assert sample_2020 == [x for x in full_sample if '2020' <= x['timestamp'] < '2021']
%%time
sample_to_2020 = query_wayback_cdx('skeptric.com/*', start=None, end='2020', status_ok=False)
assert sample_to_2020 == [x for x in full_sample if x['timestamp'] < '2021']
%%time
sample_from_2020 = query_wayback_cdx('skeptric.com/*', start='2020', end=None, status_ok=False)
assert sample_from_2020 == [x for x in full_sample if x['timestamp'] >= '2020']
%%time
sample_10 = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, limit=10)
sample_10 == full_sample[:10]
%%time
sample_10_offset_20 = query_wayback_cdx('skeptric.com/*', start=None, end=None, status_ok=False, limit=10, offset=20)
assert sample_10_offset_20 == full_sample[20:20+10]
The original content is at http://web.archive.org/web/{timestamp}_id/{url}
, and the Wayback Machine version is the same excluding _id
.
The Wayback Machine version makes some changes that make it better for interactive viewing, but the content is different.
record = image_sample[0]
record
content = fetch_wayback_content(record['timestamp'], record['original'])
from IPython.display import Image
Image(content)
assert record['digest'] == sha1_digest(content)
And fetching a HTML webpage
record = ok_sample[0]
content = fetch_wayback_content(record['timestamp'], record['original'])
assert record['digest'] == sha1_digest(content)
Get the Wayback URL for preview
URL(wayback_url(record['timestamp'], record['original'], wayback=True))
record = WaybackRecord.from_dict(full_sample[0])
record
record.preview()
%%time
content = record.content
content[:100]
assert sha1_digest(content) == record.digest
wb = WaybackQuery('skeptric.com/*', start=None, end=None)
%%time
items = list(wb.query())
assert len(items) > 50
content = items[0].content
assert sha1_digest(content) == items[0].digest
I have no idea whether Session is actually thread-safe; if it's not we should have 1 session per thread. As far as I can tell the issue occurs when you have lots of hosts, so in this case it should be ok.
I guess we'll try it and see. Maybe long term we're better going with asyncio. Using a Session makes things slightly faster; we can always turn it off by passing session=False.
%%time
contents_16 = WaybackRecord.fetch_parallel(items[:16], threads=8)
assert contents_16[0] == content
Common Crawl is similar, but slightly different, because it is split over many indexes, roughly monthly covering 2 weeks.
Cache within a session, but not between sessions since over months the indexes change.
indexes = get_cc_indexes()
indexes[:3]
Notice the oldest ones have a different format
indexes[-4:]
We can parse approximate periods for the crawls using the same logic as cdx_toolkit.
However in reality it's complicated
dates = [parse_cc_crawl_date(i['id']) for i in indexes]
cc_index_by_time(start=datetime(2021,5,1), end=datetime(2021,12,1))
cc_index_by_time(end=datetime(2010,1,1))
cc_index_by_time(start=datetime(2021,1,1))
cc_sample_api = 'https://index.commoncrawl.org/CC-MAIN-2021-43-index'
assert cc_sample_api in [i['cdx-api'] for i in indexes]
The Common Crawl CDX is much slower than the Wayback Machines. It also uses the zipped CDX by default which automatically paginates.
test_jline = b'{"status": "200", "mime": "text/html"}\n{"status": "301", "mime": "-"}\n'
test_jdict = [{'status': '200', 'mime': 'text/html'}, {'status': '301', 'mime': '-'}]
assert jsonl_loads(test_jline) == test_jdict
assert jsonl_loads(test_jline.rstrip()) == test_jdict
test_url = 'rmy.wikipedia.org/*'
pages = query_cc_cdx_num_pages(cc_sample_api, test_url, page_size=1)
pages
records = []
for page in range(pages):
records += query_cc_cdx_page(cc_sample_api, test_url, page=page, mime='text/html', page_size=1)
len(records)
timestamps = [record['timestamp'] for record in records]
min(timestamps), max(timestamps)
We should get an error querying past the last page (400: Invalid Request)
from requests.exceptions import HTTPError
try:
query_cc_cdx_page(cc_sample_api, test_url, page=pages, mime='text/html', page_size=1)
raise AssertionError('Expected Failure')
except HTTPError as e:
status = e.response.status_code
if status != 400:
raise AssertionError(f'Expected 400, got {status}')
For CC-MAIN-2015-11 and CC-MAIN-2015-06 both the status and mime are missing from the index which leads to problems.
api_2015_11 = next(idx['cdx-api'] for idx in get_cc_indexes() if idx['id'] == 'CC-MAIN-2015-11')
This turns up in this particular case as a 404
try:
results = query_cc_cdx_page(api_2015_11, test_url, page=0, page_size=1)
raise ValueError("Has it been fixed? Remove the blacklist")
except requests.exceptions.HTTPError as e:
if e.response.status_code != 404:
raise
Which doesn't happen when we remove the filters
results = query_cc_cdx_page(api_2015_11, test_url, page=0, status_ok=False, page_size=1)
record = records[0]
record
record = {'urlkey': 'org,wikipedia,en)/?curid=3516101',
'timestamp': '20211024051554',
'url': 'https://en.wikipedia.org/?curid=3516101',
'mime': 'text/html',
'mime-detected': 'text/html',
'status': '200',
'digest': 'TFZFWXWKS3NFJLLLXLYU6JTNZ77D3IVD',
'length': '8922',
'offset': '332329771',
'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323585911.17/warc/CC-MAIN-20211024050128-20211024080128-00689.warc.gz',
'languages': 'eng',
'encoding': 'UTF-8'}
%%time
content = fetch_cc(record['filename'], record['offset'], record['length'])
content[:1000]
Check the digest
assert sha1_digest(content) == record['digest']
record
x = _cc_cdx_to_record(record)
x
x.content[:100]
from tempfile import NamedTemporaryFile
f = NamedTemporaryFile('wb', dir='.', suffix='.html')
name = Path(f.name).relative_to(Path('.').absolute())
x.preview(name)
f.close()
CC_API_FILTER_BLACKLIST
query_cc_all = CommonCrawlQuery('www.commoncrawl.org/*',
start=datetime(2015,3,1), end=datetime(2015,3,1))
%%time
query_cc_all.cdx_apis
x = list(query_cc_all.query())
len(x)
Notice there is no Mime or Status. We could fill it in from the archive record.
x[0]
x[0].content[:100]
test_url = 'upload.wikimedia.org/wikipedia/commons/*'
cc_archive_all = CommonCrawlQuery(test_url,
apis=['CC-MAIN-2021-43'],
status_ok=False)
results = list(cc_archive_all.query(page_size=5))
from collections import Counter
Counter((r.mime, r.status) for r in results)
results_ok = list(CommonCrawlQuery(test_url,
apis=['CC-MAIN-2021-43']).query(page_size=5))
assert results_ok == [r for r in results if r.status == 200]
These would fail in the BLACKLIST apis.
results_html = list(CommonCrawlQuery(test_url,
apis=['CC-MAIN-2021-43'],
status_ok=False,
mime='text/html').query(page_size=5))
assert results_html == [r for r in results if r.mime == 'text/html']
mimes = ['application/pdf', 'image/vnd.djvu', 'application/ogg']
results_mime = list(CommonCrawlQuery(test_url,
apis=['CC-MAIN-2021-43'],
status_ok=False,
mime=mimes).query(page_size=5))
assert results_mime == [r for r in results if r.mime in mimes]
results_image = list(CommonCrawlQuery(test_url,
apis=['CC-MAIN-2021-43'],
status_ok=False,
mime='image/*').query(page_size=5))
assert results_image == [r for r in results if r.mime.startswith('image/')]
result = _cc_cdx_to_record({'urlkey': 'com,skeptric)/',
'timestamp': '20211028110756',
'url': 'https://skeptric.com/',
'mime': 'text/html',
'mime-detected': 'text/html',
'status': '200',
'digest': '7RBLUZ55MD4FUPDRPVVJVTM7YDYQXRS3',
'length': '107238',
'offset': '650789450',
'filename': 'crawl-data/CC-MAIN-2021-43/segments/1634323588284.71/warc/CC-MAIN-20211028100619-20211028130619-00465.warc.gz',
'languages': 'eng',
'encoding': 'UTF-8'})
assert sha1_digest(result.content) == result.digest
many_content = CommonCrawlRecord.fetch_parallel(results_html[:100])