Resources

Table of Contents

Dataset

Do not dowload the files using the web browser!
Use the code provided in the notebooks instead.

The WARC files are organized into 4 categories:

    WARC collection        Size
    ====================================
    Lifranum-method        2.84 Gb
    Cartoweb               336 Mb
    autres                 721 Mb
    Repo-ecriture-num      158 Mb

Each category name corresponds to the methodology used in the LINFRANUM project for defining the web crawler input list. For each url in the list, the web crawler retrieved:

  • all ressources 1-level away from the URLs used as input (see LIFRANUM curated URLs)

  • all ressources (i.e., recursively) within the same domain

Notebooks

Docker compatible notebooks available at the Datathon repository.

Visualization Tools

Online docs

Posters

Previous
Next