PyCommonCrawl – Processing the Web

The goal of this project is to create a simple tool to process the data provided by CommonCrawl in Python.

I wanted to do something simple. The problem was that downloading all the data from CommonCrawl requires too much space (for the compressed version, something like 53TiB) and so I required a “streaming/online solution”. Fortunately, the data is divided into segments (around 56.000). So, what my solution does is to make all the process of downloading and deleting the segments transparent. It looks it if it was a single file. It is possible to iterate by line and by WARC bloc (the format for the Internet archives).

However, this solution can be extremely long, the main limiting factor being the downloading part.

All the code is avalible on Github and I created a python package.

Leave a Reply

Your email address will not be published / Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.