Table of Contents
Dataset
Do not dowload the files using the web browser!
Use the code provided in the notebooks instead.
Use the code provided in the notebooks instead.
The WARC files are organized into 4 categories:
WARC collection Size
====================================
Lifranum-method 2.84 Gb
Cartoweb 336 Mb
autres 721 Mb
Repo-ecriture-num 158 Mb
Each category name corresponds to the methodology used in the LINFRANUM project for defining the web crawler input list. For each url
in the list, the web crawler retrieved:
all ressources 1-level away from the URLs used as input (see LIFRANUM curated URLs)
all ressources (i.e., recursively) within the same domain
Notebooks
Docker compatible notebooks available at the Datathon repository.
Datathon WebArchives Demo notebook
Datathon WebArchives Quickstart notebook
AUT text analysis notebook
Visualization Tools
Gephi: Graph Visualization Platform
Voyant tools: analysis environment for digital texts.
- see Voyant Tutorial
Plotly Express (python library)
Online docs
- Spark by Examples
- Archives Unleashed Tools (0.91.0): Framework for analyzing web archives with Apache Spark
- Spark (3.0.0)