Schema
To answer the profiling queries, you need to extend the dataframe returned by the AUT .webpages() method as shown in the extended schema below.
Note that:
- all new columns are derivated from the
url
column. - the derivation can be done using the tldextract and urllib.parse libraries.
Extended Schema
df.webpages() | |-- crawl_date |-- mime_type_web_server |-- mime_type_tika |-- language |-- content |-- url: http://forums.news.cnn.com:80/ |-- url_host_name: forums.news.cnn.com |-- url_domain: cnn |-- url_subdomain: forums.news |-- url_tld: com |-- url_registered_domain: cnn.com |-- url_domain_reversed: com.cnn.news.forums |-- url_protocol: http |-- url_host_port: 80
Optional
(see HTTP header fields)
url_host_ip 192.168.1.10 cf. log.txt file content_length Content-Length: 348 cf. content’s HTTP header content_charset Content-Type: text/html; charset=utf-8 cf. content’s HTTP header
Queries
The following queries must be computed usning the WARC data collection of your choice.
Domain names
Which are the top registered domains (see example)?
#
urls per domain%
domain urls in collection
Which are the Top TLD & gTLDs domains (see example)?
- see What Is a TLD? for more information
Identify domains having more than one TLD (e.g.,
amazon/.com/.fr
).Identify subdomains for each domain.
URLs
Compute the distribution of the URLs components. For instance:
http
/https
port
number
Count the number of words used in URLs
- e.g., split URLs at
.
(dot) and-
(hyphen)
- e.g., split URLs at
Web pages content
Compute the list of languages used in webpages?
Compute the distribution of:
Identify page titles (optional)
Collection & Hosts
How many pages, per language, were collected per host?
What is the total number of bytes that was collected by the crawler for each host?
What is the ratio between the webpages and Web ressources that were collected per host? (optional)
For each host, list the servers IPs to which the crawler interacted to. Then, determine whether custom domains point to any of these hosts (optional)
Images
Identify the largest/smallest image in the dataset (
width
xheight
)Identify the images appearing in the data collection under different names (i.e., images having the same MD5 hash)
Find images shared between more than 2 domains (example)
Web graph
- Identify domains with strong/weak connectivity
Multimedia
- Compute the statistics of the multimedia files within the data collection (optional)