10181 20190318 galley


A Systematic Approach 
Towards Web Preservation 

Muzammil Khan  
and Arif Ur Rahman 

 
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 71 

 
Muzammil	Khan	(muzammilkhan86@gmail.com)	Assistant	Professor,	Department	of	Computer	
and	Software	Technology,	University	of	Swat.	Arif	Ur	Rahman	(badwanpk@gmail.com)	Assistant	
Professor,	Department	of	Computer	Science,	Bahria	University	Islamabad.		

ABSTRACT	

The	main	purpose	of	the	article	is	to	divide	the	web	preservation	process	into	small	explicable	stages	
and	design	a	step-by-step	web	preservation	process	that	leads	to	creating	a	well-organized	web	
archive.	A	number	of	research	articles	are	studied	about	web	preservation	projects	and	web	archives,	
and	designed	a	step-by-step	systematic	approach	for	web	preservation.		The	proposed	comprehensive	
web	preservation	process	describes	and	combines	strengths	of	different	techniques	observed	during	
the	study	for	preserving	digital	web	contents	into	a	digital	web	archive.	For	each	web	preservation	
step,	different	approaches	and	possible	implementation	techniques	have	been	identified	that	can	be	
adopted	in	digital	archiving.	The	potential	value	of	the	proposed	model	is	to	guide	the	archivist,	
related	personnel,	and	organizations	to	effectively	preserved	their	intellectual	digital	contents	for	
future	use.	Moreover,	the	model	can	help	to	initiate	a	web	preservation	process	and	create	a	well-
organized	web	archive	to	efficiently	manage	the	archived	web	contents.	A	section	briefly	describes	
the	implementation	of	the	proposed	approach	in	a	digital	news	stories	preservation	framework	for	
archiving	news	published	online	from	different	sources.		

INTRODUCTION	

The	amount	of	information	generated	by	institutions	is	increasing	with	the	passage	of	time.	One	of	
the	mediums	that	uses	this	information	is	the	World	Wide	Web	(WWW).	The	WWW	has	become	a	
tool	to	share	information	quickly	with	everyone	regardless	of	their	physical	location.		The	number	
of	web	pages	is	vast.	Google	and	Bing	each	index	approximately	4.8	billion.1		

Though	the	WWW	is	a	rapidly	growing	source	of	information,	it	is	fragile	in	nature.	According	to	
the	available	statistics,	80	percent	of	pages	become	unavailable	after	one	year	and	13	percent	of	
links	(mostly	web	references)	in	scholarly	articles	are	broken	after	27	months.2		Moreover,	11	
percent	of	posts	and	comments	on	websites	for	various	purposes	are	lost	within	a	year.		

According	to	another	study	conducted	on	10	million	web	pages	collected	from	the	Internet	
Archive	in	2001,	the	average	survival	rate	of	web	pages	is	1,132.1	days	with	a	standard	deviation	
of	903.5	days.	90.6	percent	pages	of	those	web	pages	are	inaccessible	today.3	The	information	
fragility	causes	this	valuable	scholarly,	cultural,	and	scientific	information	to	vanish	and	become	
inaccessible	to	future	generations.	

In	recent	years,	it	was	realized	that	the	lifespan	of	digital	objects	is	very	short,	and	rapid	
technological	changes	make	it	more	difficult	to	access	these	objects.	Therefore,	there	is	a	need	to	
preserve	the	information	available	on	the	WWW.	Digital	preservation	is	performed	using	the	
primary	methods	of	emulation	and	migration,	in	which	emulation	provides	the	preserved	digital	
objects	in	their	original	format	while	migration	provide	objects	in	a	different	format.4	In	the	last	


SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 72 
https://doi.org/10.6017/ital.v38i1.10181 

two	decades,	a	number	of	institutions	worldwide,	such	as	national	and	international	libraries,	
universities,	and	companies	started	to	preserve	their	web	resources	(resources	found	at	a	web	
server,	i.e.,	web	contents	and	web	structure).		

The	first	web	archive	was	initiated	in	1996	by	Brewster	Kahle,	named	the	Internet	Archive,	and	it	
holds	more	than	30	petabytes	data,	which	includes	279	billion	web	pages,	11	million	books	and	
texts,	and	8	million	other	digital	objects	such	as	audio,	video,	image	files,	etc.	More	than	seventy	
web	archive	initiatives	were	started	in	33	countries	since	1996,	which	shows	the	importance	of	
web	preservation	projects	and	preservation	of	web	contents.	This	information	era	encourages	
librarians,	archivists,	and	researchers	to	preserve	the	information	available	online	for	upcoming	
generations.	While	digital	resources	may	not	replace	the	information	available	in	physical	form,	
the	digital	version	of	these	information	resources	improves	access	to	the	available	information.5	

There	are	different	aspects	of	the	preservation	process	and	web	archiving,	e.g.,	digital	objects’	
ingestion	to	the	archive	during	preservation	process,	digital	object’s	format	and	storage,	archival	
management,	administrative	issues,	access	and	security	to	the	archive,	and	preservation	planning.	
These	aspects	need	to	be	understood	for	effective	web	preservation	and	will	help	in	addressing	
the	challenges	that	occur	during	the	preservation	process.	The	Reference	Model	for	Open	Archival	
Information	System	(OAIS)	is	an	attempt	to	provide	a	high-level	framework	for	the	development	
and	comparison	of	digital	archives.	In	web	preservation,	a	challenging	task	is	to	identify	the	
starting	point	of	the	preservation	process	and	to	effectively	complete	the	process	which	help	to	
proceed	further	to	the	other	activities.	Therefore,	the	complicated	nature	of	the	Web	and	the	
complex	structure	of	the	web	contents	make	the	preservation	of	the	web	content	even	more	
difficult.	

The	OAIS	reference	model	helps	in	achieving	the	goals	of	a	preservation	task	in	a	step-by-step	
manner.	The	stakeholders	are	identified,	i.e.,	producer,	management,	and	consumer,	and	the	
packages,	i.e.,	submission	information	package	(SIP),	archival	information	package	(AIP)	and	
dissemination	information	package	(DIP),	which	need	to	be	processed,	are	clearly	defined.6	

This	study	aims	to	design	a	step-by-step	systematic	approach	for	web	preservation	that	helps	to	
understand	preservation	or	archival	activities’	challenges,	especially	those	that	relate	to	digital	
information	objects	at	various	steps	of	the	preservation	process.	The	systematic	approach	may	
lead	to	an	easy	way	to	analyze,	design,	implement,	and	evaluate	the	archive	with	clarity	and	
different	options	for	an	effective	preservation	process	and	archival	development.	An	effective	
preservation	process	is	one	that	leads	to	a	well-organized,	easily	managed	web	archive	and	
accomplishes	designated	community	requirements.	This	approach	may	help	to	address	the	
challenges	and	risks	that	confront	archivists	and	analysts	during	preservation	activities.	

STEP-BY-STEP	SYSTEMATIC	APPROACH	

Digital	preservation	is	“the	set	of	processes	and	activities	that	ensure	long-term,	sustained	storage	
of,	access	to	and	interpretation	of	digital	information.”7	The	growth	and	decline	rates	of	WWW	
content	and	the	importance	of	the	information	presented	on	the	web	make	it	a	key	candidate	for	
preservation.	Web	preservation	confronts	a	number	of	challenges	due	to	its	complex	structure,	a	
variety	of	available	formats,	and	the	type	of	information	(purpose)	it	provides.	The	overall	layout	
of	the	web	varies	domain	to	domain	based	on	the	type	of	information	and	its	presentation.	The	
websites	can	be	categorized	based	on	two	things.	First,	the	type	of	information	(i.e.,	the	web	


INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 73 

 
contents)	and	second,	the	way	this	information	presented	(i.e.,	the	layout	or	structure	of	the	web	
page.	Examples	include	educational,	personal,	news,	e-commerce,	and	social	networking	websites,	
which	vary	a	lot	in	their	contents	and	structure.	The	variations	in	the	overall	layout	make	it	
difficult	to	preserve	different	web	contents	in	a	single	web	archive.	The	web	preservation	
activities	are	summarized	in	figure	1.	The	following	sections	explain	the	web	preservation	
activities	and	possible	implementation	in	proposed	systematic	approach.	

Defining	the	Scope	of	the	Web	Archive	
The	WWW	provides	an	opportunity	to	share	information	using	various	services,	such	as	blogs,	
social	networking	websites,	e-commerce,	wikis,	and	e-libraries.	These	websites	provide	
information	on	a	variety	of	topics	and	address	different	communities	based	on	their	interest	and	
needs.	There	are	many	differences	in	the	way	the	information	is	handled	and	presented	on	the	
WWW.	In	addition,	the	overall	layout	of	the	web	changes	from	one	domain	to	another	domain.8	
Therefore,	it	is	not	practically	feasible	to	develop	a	single	system	to	preserve	all	types	of	websites	
for	the	long	term.	So,	before	starting	to	preserve	the	web,	one	(the	archivist)	should	define	the	
scope	of	the	web	to	be	archived.	The	archive	will	be	either	a	site-centric,	topic-centric,	or	domain-
centric	archive.9	

Site-centric	Archive	
A	site-centric	archive	focuses	on	a	particular	website	for	preservation.	These	types	of	archives	are	
mostly	initiated	by	the	website	creator	or	owner.	The	site-centric	web	archives	allow	access	to	the	
old	versions	of	the	website.	

Topic-centric	Archive	
Topic-centric	archives	are	created	to	preserve	information	on	a	particular	topic	published	on	the	
web	for	future	use.	For	scientific	verification,	researchers	need	to	refer	to	the	available	
information	while	it	is	difficult	to	ensure	access	to	these	contents	due	to	the	ephemeral	nature	of	
the	web.	A	number	of	topic-centric	archive	projects	have	been	performed	including	the	Archipol	
archive	of	Dutch	political	websites,10	the	Digital	Archive	for	Chinese	Studies	(DACHS)	archive2,11	
Minerva	by	the	Library	of	Congress,12	and	the	French	Elections	Web	archive	for	archiving	the	
websites	related	to	the	French	elections.13	

Domain-centric	Archive	
The	word	“domain”	refers	to	a	location,	network,	or	web	extension.	A	domain-centric	archive	
covers	websites	published	with	a	specific	domain	name	DNS,	using	either	a	top-level	domain	
(TLD),	e.g.,	.com,	.edu,	or	.org,	or	a	second-level	domain	(SLD),	e.g.,	.edu.pk	or	.edu.fr.	An	advantage	
of	domain-centric	archiving	is	that	it	can	be	created	by	automatically	detecting	specific	websites.	
Several	projects	have	a	domain-centric	scope,	e.g.,	the	Portuguese	Web	Archive	(PWA)	national	
websites,14	the	Kulturarw,	a	Swedish	Royal	Library	web	archive	collection	of.se	and	.com	domain	
websites,15	and	the	UK	Government	Web	Archive	collection	of	UK	government	websites,	
e.g.,	.gov.uk	domain	websites.		

Understanding	the	Web	Structure	
After	defining	the	scope	of	the	intended	web	archive,	the	archivist	will	have	a	better	
understanding	of	the	interest	and	expected	queries	of	the	intended	community	based	on	the	
resources	available	or	the	information	provided	by	the	selected	domain.	The	focus	in	this	step	is	to	
understand	the	type	of	information	(contents)	provided	by	the	selected	domain	and	how	the	
information	has	been	presented.	The	web	can	be	understood	by	two	dimensions.	The	first		


SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 74 
https://doi.org/10.6017/ital.v38i1.10181 

	
Figure	1.	Systematic	Approach	for	Web	Preservation	Process.	


INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 75 

 
considers	the	web	as	a	medium	that	communicates	contents	using	various	protocols,	i.e.,	HTTP,	
and	the	second	considers	the	web	as	a	content	container,	which	further	presents	the	contents	to	
the	viewers	and	not	simply	contents,	e.g.	the	underlying	technology	used	to	display	the	contents.16	
The	preservation	team	should	understand	such	parameters	as	the	technical	issues,	the	future	
technologies,	and	the	expected	inclusion	of	other	related	content.		

Identify	the	Web	Resources	
The	archivist	should	understand	the	contents	and	the	representation	of	the	contents	of	the	
selected	domain,	e.g.,	blogs,	social	networking	websites,	institutional	websites,	educational	
institutional	websites,	newspaper	websites,	or	entertainment	websites.	All	of	these	websites	
provide	different	information	and	address	individual	communities	that	have	distinct	information	
needs.	

A	web	page	is	the	combination	of	two	things,	i.e.,	web	contents	and	web	structure.17	The	resources	
which	can	be	preserved	are	as	follows.	

Web	Contents	
Web	contents	or	web	information	can	be	categorized	into	the	following	categories:	

• Textual	Contents	(Plain	Text):	This	category	describes	textual	information	that	appears	on	a	
web	page.	It	does	not	include	links,	behaviors,	and	presentation	stylesheets.	

• Visual	Contents	(Images):	These	contents	are	the	visual	forms	of	information	or	are	a	
complementary	material	to	the	information	provided	in	the	textual	form.	

• Multimedia	Contents:	As	another	form	of	information,	multimedia	contents	mainly	include	
audio	and	video.	It	may	also	include	animation	or	even	text	as	a	part	of	a	video	or	a	
combination	of	text,	audio,	and	video.	

Web	Structure	
Web	structure	can	be	categorized	in	the	following	categories:	

• Appearance	(Web	Layout	or	Presentation):	This	category	indicates	the	overall	layout	or	
presentation	of	a	web	page.	The	look	and	feel	of	a	web	page	(representation	of	the	
contents)	are	important,	which	is	maintained	with	different	technologies,	e.g.,	HTML	or	
stylesheets,	etc.	

• Behavior	(Code	Navigations):	Categorized	by	link	navigations,	these	can	be	within	a	website	
or	to	other	websites,	external	document	links	or	dynamic	and	animated	features,	such	as	
live	feed,	comments,	tagging,	or	bookmarking.	

Identify	Designated	Community	
The	archivist	should	identify	the	designated	community	of	the	intended	web	archive,	their	
functional	requirements	and	expected	queries	by	analyzing	them	carefully.	The	designated	
community	means	the	potential	users,	such	as	those	who	can	access	the	archived	web	contents	for	
different	purposes,	i.e.,	accessing	old	information	that	is	not	available	in	normal	circumstances	or	
referring	to	an	old	news	article	which	is	not	bookmarked	properly	or	retrieving	relevant	news	
articles	published	long	ago,	etc.	

Prioritize	the	Web	Resources	
After	a	comprehensive	assessment	of	the	resources	of	the	selected	domain	and	the	identification	
of	potential	users’	requirements	and	expected	queries,	the	archivist	should	prioritize	the	web	


SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 76 
https://doi.org/10.6017/ital.v38i1.10181 

resources.	The	complexity	of	web	resources	and	their	representation	cause	complications	in	the	
digital	preservation	process.	Generally,	it	may	be	undesirable	or	unviable	to	preserve	all	web	
resources;	therefore,	it	is	worthwhile	to	designate	the	web	resources	for	preservation.	The	
priority	should	be	assigned	on	the	basis	of	two	things:	first,	the	potential	reuse	of	the	resource	and	
second,	the	frequency	with	which	the	resource	will	be	accessed.	The	resources	with	no	value,	little	
value,	or	those	managed	elsewhere	can	be	excluded.		

For	prioritization	of	resources,	the	MoSCoW	method	can	be	applied.18	The	acronym	MoSCoW	can	
be	elaborated	as:	

M	-	MUST	have,	the	resource	must	be	preserved	or	resources	that	must	be	a	part	of	the	archive	
and	preserved.	For	example,	in	the	Digital	News	Story	Archive	(DNSA),	the	textual	news	story	
must	be	preserved	in	the	archive	because	the	preservation	emphasis	is	on	a	textual	news	story.19	
Online	news	contains	textual	news	stories,	and	many	news	stories	contain	associated	images,	and	
a	fraction	of	news	stories	contain	associated	audio-video	contents.	

S	-	SHOULD	have,	the	resource	should	be	preserved	if	at	all	possible.	Almost	all	the	news	stories	
have	associated	images;	a	few	news	stories	have	associated	audio	and	video	that	complement	it	
and	should	be	preserved	as	a	part	of	the	news	story	in	the	web	archive.	

C	-	COULD	have,	the	resource	could	be	preserved	if	it	does	not	affect	anything	else	or	is	nice	to	
have.	The	web	structure	in	DNSA	depends	on	the	resources	to	be	used	for	the	preservation	of	
news	stories;	the	layout	of	the	newspaper	website	could	(C)	be	a	part	of	the	preservation	process	
if	it	does	not	affect	anything,	e.g.,	storage	capacity	and	system	efficiency.	

W	-	WON’T	have,	the	resource	would	not	be	included.	Archiving	multiple	versions	of	the	layout	or	
structure	of	the	online	newspaper	are	not	worthwhile	and	hence	would	not	(W)	be	preserved.	

The	prioritization	of	these	resources	is	very	important	in	the	context	of	web	preservation	planning	
because	it	does	not	waste	time	and	energy,	and	it	is	the	best	way	to	handle	users’	requirements	
and	fulfill	their	expected	queries.	

How	to	Capture	the	Resource(s)	
The	selection	of	a	feasible	capturing	technique	depends	on:	first,	the	resources	to	be	captured	and	
second,	the	capturing	task	frequency.	There	are	three	web	resources	capturing	techniques,	i.e.,	by	
browser,	web	crawler,	and	authoring	system.	Each	capturing	technique	has	associated	advantages	
and	disadvantages.7	

Web	Capturing	Using	Browsers	
The	intended	web	content	can	be	captured	using	browsers	after	a	web	page	is	rendered	when	the	
HTTP	transaction	occurs.	This	technique	is	also	referred	to	as	a	snapshot	or	post-rendering	
technique.	The	method	captures	those	things	which	are	visible	to	the	users;	the	behavior	and	
other	attributes	remain	invisible.	Capturing	static	contents	is	one	of	the	disadvantages	of	web	
capturing	by	the	browser	approach,	this	approach	generally	preserved	contents	in	the	form	of	
images.	It	is	best	for	well-organized	websites,	and	commercial	tools	are	available	for	capturing	the	
web.	The	following	are	well-known	tools	to	capture	web	using	browsers.	

WebCapture	(https://web-capture.net/)	is	a	free	online	web-capturing	service.	It	is	a	fast	web	
page	snapshot	tool,	which	can	grab	web	pages	in	seven	different	formats,	i.e.	JPEG,	TIFF,	PNG,	BMP	


INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 77 

 
image	formats,	PDF,	SVG,	and	postscript	files	of	high	quality.	It	also	allows	downloading	the	
intended	format	in	a	ZIP	file	and	is	suitable	for	long	vertical	web	pages	with	no	distortion	in	
layout.		

A.nnotate	(http://a.nnotate.com/),	is	an	online	annotating	web	snapshot	tool	to	keep	track	of	
information	gathered	from	the	web	efficiently	and	easily.	It	allows	adding	tags	and	notes	to	the	
snapshot	and	building	a	personal	index	of	web	pages	as	document	index.	The	annotation	feature	
can	be	used	for	multiple	purposes,	for	example,	compiling	an	annotated	library	of	objects	for	
organization,	sharing	commented	web	pages,	product	comparison,	etc.		

SnagIt	(https://www.techsmith.com/screen-capture.html)	is	a	well-known	snapshot	tool	for	
capturing	screens	with	built-in	advanced	image	editing	features	and	screen	recording.	SnagIt	is	a	
commercial	and	advanced	screen	capture	tool	that	can	capture	web	pages	with	images,	linked	
files,	source	code,	and	the	URL	of	the	web	page.	

Acrobat	WebCapture	(File	>	Create	>	PDF	from	Web	Page...)	creates	a	tagged	PDF	file	from	the	web	
page	that	a	user	visits	while	the	Adobe	PDF	toolbar	is	used	for	the	entire	website.20		

The	capture	by	a	browser	technique	has	the	following	advantages:	

• By	this	technique,	the	archivist	can	capture	only	the	displayed	contents,	and	it	is	an	
advantage	if	you	need	to	preserve	the	displayed	contents	only.	

• It	is	a	relatively	simple	technique	for	well-organized	websites.	
• Commercial	tools	exist	for	web	capturing	using	browsers.	

In	addition,	the	disadvantages	are	the	following:	

• Capturing	displayed	contents	only	is	a	disadvantage	if	the	focus	is	not	on	only	displayed	
contents.	

• It	results	in	frozen	contents	and	treats	contents	as	if	they	are	publications.	
• It	loses	the	web	structure,	such	as	appearance,	behavior,	and	other	attributes	of	the	web	

page.	

Web	Capturing	Using	an	Authoring	System/Server	
The	authoring	system	capturing	technique	is	used	for	web	harvesting	directly	from	the	website	
hosting	server.	All	the	contents,	e.g.,	textual	information,	images,	and	source	code,	are	collected	
from	the	source	web	server.	The	authoring	system	allows	the	archivist	to	preserve	the	different	
versions	of	the	website.	The	authoring	system	depends	on	the	infrastructure	of	the	content	
management	system	and	is	not	a	good	choice	for	external	resources.	The	system	is	best	for	an	
owned	web	server	and	works	well	for	limited	internal	purposes.	The	Web	Curator	Tool	
(http://webcurator.sourceforge.net/),	PANDAS	(an	old	British	Library	harvesting	tool),	and	
NetarchiveSuite	(https://sbforge.org/display/NAS/NetarchiveSuite)	are	known	tools	use	for	
planning	and	scheduling	web	harvesting.	They	can	be	used	by	non-technical	personnel	for	both	
selection	and	harvesting	web	content	selection	policies.	These	web	archiving	tools	were	
developed	in	a	collaboration	of	the	National	Library	of	New	Zealand	and	the	British	Library	and	
are	used	for	the	UK	Web	Archive	(http://www.ariadne.ac.uk/issue50/beresford/).	The	tools	can	
interface	with	web	crawlers,	such	as	Heritrix	(https://sourceforge.net/projects/archive-
crawler/).	Authoring	systems	are	also	referred	to	as	workflow	systems	or	curatorial	tools.		


SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 78 
https://doi.org/10.6017/ital.v38i1.10181 

The	authoring	system	has	the	following	advantages:	

• It	is	best	for	web	harvesting,	which	captures	everything	available.	
• It	is	easy	to	perform,	if	you	have	proper	access	permission	or	you	own	the	server	or	system	

to	access	for	capturing	the	resources.	
• It	works	in	short	to	medium	term	resources	and	feasible	for	internal	access	within	

organizations.	

The	disadvantages	of	web	capturing	using	the	authoring	system	are:	

• It	captures	all	available	raw	information,	not	only	presentations.	
• It	may	be	too	reliant	on	the	authoring	infrastructure	or	the	content	management	system.	
• It	is	not	feasible	for	large	term	resources,	or	for	external	access	from	outside	organization.	

Web	Capturing	Using	Web	Crawlers	
Web	crawlers	are	perhaps	the	mostly	used	technique	for	capturing	web	contents	in	systematic	
and	automated	manner.21	Crawler	development	needs	the	expertise	and	experience	of	different	
tools,	i.e.	positive	and	negative	of	technologies,	and	the	viability	of	a	tool	in	a	specific	scenario.	The	
main	advantage	of	crawlers	is	that	they	extract	embedded	content.	Heritrix,	HTTrack,	Wget,	and	
DeepArc	are	common	examples	of	web	crawlers.	

Heritrix	(https://github.com/internetarchive/heritrix3/wiki)	is	developed	in	java,	an	open	source	
and	freely	available	web	crawler,	and	it	was	developed	by	Internet	Archive.	Heritrix	is	one	of	the	
widely	used	extensible	and	web-scale	web	crawlers	in	web	preservation	projects.	Initially,	the	
Heritrix	was	developed	for	specific	purpose	crawling	of	specific	websites	and	now	a	resourceful	or	
customize	web	crawler	for	archiving	the	web.	

HTTrack	(https://www.httrack.com/)	is	a	freely	available	configurable	browser	utility.	HTTrack	
crawls	HTML,	images,	and	other	files	from	a	server	to	a	local	directory	and	allows	offline	viewing	
of	the	website.	The	HTTrack	crawler	downloads	a	complete	website	from	the	web	server	to	a	local	
computer	system	and	makes	it	available	for	offline	for	viewing	with	all	related	link-structure	and	
seems	like	the	user	is	using	it	online.	It	also	updates	the	archived	websites	at	the	local	system	from	
the	server	and	resumes	all	the	interrupted	previous	extractions.	The	HTTrack	available	for	both	
Windows	and	Linux/Unix	operating	systems.	

Wget	(http://www.gnu.org/software/wget/)	is	a	freely	available	non-interactive	command	line	
tool	that	can	easily	be	configured	with	other	technologies	and	different	scripts.	It	can	capture	files	
from	the	web	using	widely	used	FTP,	FTPS,	HTTP	and	HTTPS	protocols,	and	support	cookies	as	
well.	It	also	updates	the	archived	websites	and	resumes	all	the	interrupted	extractions.	Wget	is	
available	for	both	Microsoft	Windows	and	Unix	operating	systems.	

The	advantages	of	web	crawling:	

• Widely	used	in	capturing	techniques.	
• Can	capture	specific	content	or	everything.	
• Avoids	some	of	the	accessing	issues,	such	as:	Link	rewriting	and	embedded	external	

content	from	an	archive	or	live.	

	 	
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 79 

 
Disadvantages	associated	with	web	crawling:	

• Much	work	is	required,	as	well	as	tools	or	development	expertise	and	experience,	etc.	
• The	web	crawler	does	not	have	the	right	scope:	sometimes,	it	does	not	capture	everything	

that	it	should,	and	sometimes	the	crawler	captures	too	much	content.	

Web	Content	Selection	Policy	
In	the	previous	steps,	the	web	resources	are	identified,	prioritized	based	on	requirements	and	
expected	queries	of	the	designated	community,	and	feasible	capturing	technique	is	identified	
based	on	capturing	frequency.	Now,	the	contents	need	to	be	prepared	and	filtered	for	selection,	
and	a	feasible	selection	approach	needs	to	be	selected	based	on	the	contents.	A	web	content	
selection	policy	helps	to	determine	and	clarify,	which	web	contents	are	required	to	be	captured	
based	on	the	priorities,	the	purpose	and	the	scope	of	web	contents	already	defined.22	The	decision	
of	the	selection	policy	comprises	the	description	of	the	context,	the	intended	users,	the	access	
mechanisms	and	the	expected	uses	of	the	archive.	The	selection	policy	may	comprise	the	selection	
process	and	selection	approach.	

The	selection	process	can	be	divided	into	subtasks	which,	in	combination,	provide	a	qualitative	
selection	of	web	contents	to	a	certain	extent,	i.e.,	preparation,	discovery,	and	filtering,	as	shown	in	
figure	2.	The	main	objective	of	the	preparation	phase	is	to	determine	the	targeted	information	
space,	the	capture	technique,	capturing	tools,	extension	categorization,	granularity	level,	and	the	
frequency	of	archiving	activity.	The	best	personnel	who	can	provide	help	in	preparation	are	the	
domain	experts,	regardless	of	the	scope	of	the	web	archive.	The	domain	experts	may	be	the	
archivists,	researchers,	librarians,	or	any	other	authentic	reference,	i.e.	a	document	or	a	research	
article.	The	tools	defined	in	the	preparation	phase	will	help	to	discover	intended	information	in	
the	discovery	phase,	which	can	be	divided	into	the	following	four	categories:	

1. Hubs	may	be	the	global	directories	or	topical	directories,	collection	of	sites	or	even	a	single	
web	page	with	essential	links	related	to	a	particular	subject	or	topic.	

2. Search	engines	can	facilitate	discovery	by	defining	a	precise	query	or	set	of	alternative	
queries	related	to	a	topic.	The	use	of	specialized	search	engines	can	significantly	improve	
the	results	of	discovering	related	information	that	can	be	greatly	improved.	

3. Crawlers	can	be	used	to	extract	web	contents	such	as	textual	information,	images,	audio,	
video	and	links.	Moreover,	the	overall	layout	of	a	web	page	or	a	whole	website	can	also	be	
extracted	in	a	well-defined	systematic	manner.	

4. External	Sources	may	be	non-web	sources	that	may	be	anything,	such	as	printed	material	
for	mailing	lists,	which	can	be	monitored	by	the	selection	team.	

The	main	objective	of	the	discovery	phase	is	to	determine	the	source	of	information	to	be	stored	
the	archive.	This	determination	can	be	achieved	by	two	ways.	First,	a	manually	created	entry	point	
list	is	used	to	determine	the	list	of	entry	points	(usually	links)	for	crawling	the	collection	manually	
and	updating	the	list	during	the	crawl.	There	are	two	discovery	methods,	i.e.,	exogenous	and	
endogenous.	Exogenous	discovery	is	used	in	manual	selection	and	mostly	relies	on	exploitation	of	
an	entry	point	list	for	hubs,	search	engines,	and	on	non-web	documents.	Second,	there	is	an	
automatically	created	entry	point	list	to	determine	the	list	of	entry	points	by	extracting	links	
automatically	and	obtaining	an	updated	list	every	time	during	the	crawl.	Endogenous	discovery	is	


SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 80 
https://doi.org/10.6017/ital.v38i1.10181 

used	in	automatic	selection	and	relies	on	the	link	extraction	using	crawlers	by	exploring	the	entry	
point	list.	

	
Figure	2.	Selection	Process.	

The	main	objective	of	the	filtering	phase	is	to	optimize	and	make	concise	the	discovered	web	
contents	(discovery	space).	Filtering	is	important	in	order	to	collect	more	specific	web	content	
and	remove	unwanted	or	duplicated	content.	Usually,	for	preservation,	an	automatic	filtering	
method	is	used;	manual	filtering	is	useful	if	the	robots	or	automatic	tools	cannot	interpret	the	web.	
The	discovery	and	filter	phase	can	be	combined	practically	or	logically.	Several	evaluation	axes	
can	be	used	for	the	selection	policy	(e.g.,	quality,	subject,	genre,	and	publisher).	In	the	literature,	
we	have	three	known	techniques	for	selecting	web	content.	

The	selection	approach	can	be	either	automatic	or	manual.	Manual	content	selection	is	very	rare	
because	it	is	labor	intensive:	it	requires	automatic	tools	for	finding	the	content,	and	then	manual	
review	of	that	collection	to	identify	the	subset	that	should	be	captured.	Automatic	selection	
policies	are	used	frequently	in	web	preservation	projects	for	web	collection,	especially	for	web	
archives.23	The	selection	of	the	collection	approach	depends	on	the	frequency	with	which	the	web	
content	has	been	preserved	in	the	archive.	There	are	four	different	selection	approaches	for	web	
content	collection.	

Unselective	Approach	
The	unselective	approach	implies	collecting	everything	possible;	by	specifically	using	this	
approach,	the	whole	website	and	its	related	domains	and	subdomains	are	downloaded	to	the	
archive.	It	is	also	referred	to	as	automatic	harvesting	or	selection,	bulk	selection,	and	domain	
selection.24	The	automatic	approach	is	used	in	a	situation	where	a	web	crawler	usually	performs	
the	collection.	For	example,	the	collection	of	websites	from	a	domain,	i.e.,	.edu	means	all	
educational	institution	websites	(at	domain	level)	or	the	collection	of	all	possible	contents/pages	
from	a	website	(harvesting	at	website	level)	by	extracting	the	embedded	links.	

A	section	of	the	data	preservation	community	believes	that	technically	it	is	a	relatively	cheaper,	
quicker	collection	approach	and	yields	a	comprehensive	picture	of	the	web	as	a	whole.	In	contrast,	
its	significant	drawbacks	are	that	it	generates	huge	unsorted,	duplicated,	and	potentially	useless	
data,	consuming	too	many	resources.	


INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 81 

 
The	Swedish	Royal	Library’s	project	Kulturarw3	harvests	websites	at	domain	level,	i.e.,	collecting	
websites	from	a	.se	domain	which	is	a	physically	located	website	in	Sweden	and	one	of	the	first	
projects	to	adopt	this	approach.25	Usually,	national-based	web	archive	initiatives	adopt	the	
unselective	approach,	most	notably	NEDLIB,	a	Helsinki	University	Library	harvester,	and	AOLA,	an	
Austrian	Online	Archive.26	

Selective	Approach	
The	selective	approach	was	adopted	by	the	National	Library	of	Australia	(NLA)	in	the	PANDAS	
project	in	1997.	In	this	approach,	a	website	is	included	for	archiving	based	on	certain	predefined	
strategies	and	on	the	access	and	information	provided	by	the	archive.	The	Library	of	Congress’	
project	Minerva	and	the	British	Library	project	“Britain	on	the	Web”	are	the	other	known	projects	
that	have	adopted	the	selective	approach.	According	to	NLA,	the	selected	websites	are	archived	
based	on	NLA	guidelines	after	negotiation	with	the	owners.27	

The	inclusion	decision	could	be	taken	at	one	of	the	following	levels:	

• Website	level:	which	websites	should	be	included	from	a	selected	domain,	e.g.,	to	archive	all	
educational	websites	from	high	level	domain	“.pk”.	

• Web	page	level:	which	web	pages	should	be	included	from	a	selected	website,	e.g.,	to	
archive	the	homepages	of	all	educational	websites.	

• Web	content	level:	which	type	of	web	contents	should	be	preserved,	e.g.,	to	archive	all	the	
images	from	the	homepages	of	educational	websites.	

A	selective	approach	is	best	if	the	numbers	of	websites	to	be	archived	are	very	large	or	the	
archiving	process	is	targeting	the	entire	WWW	and	wants	to	narrow	down	the	scope	by	
identifying	the	resources	in	which	the	archivists	are	more	interested.	This	approach	performs	
implicit	or	explicit	assumptions	about	the	web	contents	that	are	not	to	be	selected	for	
preservation.	It	may	be	very	helpful	to	initiate	a	pilot	preservation	project,	which	identifies:	What	
is	possible?	What	can	be	managed?	In	addition,	some	tangible	results	may	be	obtained	easily	and	
quickly	in	order	to	enhance	the	scope	of	the	project	in	a	broader	perspective.	The	selective	
approach	may	be	based	on	a	predefined	criterion	or	based	on	an	event.	

Selective	approach	based	on	criteria	involves	selecting	web	resources	based	on	various	pre-
defined	sets	of	criteria.	NLA’s	guidance	characterizes	the	criteria-based	selective	approach	as	the	
“most	narrowly	defined	method,”	and	described	it	as	“thematic	selection.”	A	simple	or	a	complex	
content-selection	criteria	can	be	defined,	which	depends	on	the	overall	goal	of	preservation.	For	
example,	all	resources	owned	by	an	organization,	all	resources	of	one	genre,	i.e.,	all	programming	
blogs,	resources	contributed	to	a	common	subject,	resources	addressing	a	specific	community	
within	an	institution,	i.e.,	students	or	staff,	all	publications	belonging	to	an	individual	organization	
or	group	of	organizations,	all	resources	that	may	benefit	external	users	or	an	external	user’s	
community,	e.g.,	historians,	or	alumni.	

Selective	approach	based	on	event	involves	selecting	web	resources	or	websites	based	on	
various	time-based	events.	The	archivists	may	focus	on	websites	that	address	national	or	
international	important	events,	e.g.,	disasters,	elections,	and	the	football	world	cup,	etc.	Event-
based	websites	have	two	characteristics:	(1)	very	frequent	updates	and	(2)	website	content	is	lost	
after	a	short	time,	e.g.,	a	few	weeks	or	a	few	months.	For	example,	the	start	and	end	of	a	term	or	


SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 82 
https://doi.org/10.6017/ital.v38i1.10181 

academic	year,	the	duration	of	an	activity,	e.g.,	research	project,	appointment,	or	departure	of	a	
new	senior	official.	

Deposit	Approach	
In	the	deposit	collection	approach,	the	information	package	is	submitted	by	the	administrator	or	
owner	of	the	website	which	includes	a	copy	of	the	website	with	related	files	that	can	be	accessed	
through	different	hyperlinks.	The	archival	information	package	is	applicable	to	the	small	
collection	(of	a	few	websites),	or	the	owner	of	the	website	can	initiate	the	preservation	project,	e.g.	
a	company	can	initiate	a	project	for	preserving	their	website.	The	deposit	collection	approach	was	
adopted	by	the	National	Archives	and	Records	Administration	(NARA)	for	the	collection	of	US	
federal	agency	websites	in	2001	and	by	Die	Deutsche	Bibliothek	(DDB,	http://deposit.ddb.de/)	for	
the	collection	of	dissertations	and	some	online	publications.	New	digital	initiatives	are	heavily	
dependent	on	administrator	or	owner	support	and	provide	an	easy	way	to	deposit	new	content	to	
the	repository,	e.g.,	in	the	MacEwan	University’s	institutional	repository,	the	librarians	leading	the	
project	tried	to	offer	an	easy	and	effective	way	to	deposit	their	archival	contents.28	

Combined	Approach	
There	are	advantages	and	disadvantages	associated	with	each	collection	approach.	The	ongoing	
debate	is	which	approach	is	best	in	a	given	situation.	For	example,	the	deposit	approach	should	be	
an	inexpensive	agreement	with	the	depositors.	The	emphasis	is	to	use	the	combination	of	
automatic	harvesting	and	selective	approaches	as	these	two	approaches	are	cheaper	as	compared	
to	other	selection	approaches	because	a	few	staff	personnel	are	required	and	cope	with	
technological	challenges.	This	initiative	was	taken	by	the	Bibliothque	Nationale	de	France	(BnF)	in	
2006.	The	BnF	automatically	crawls	information	regarding	the	updated	web	pages	and	stores	it	in	
an	XML-based	“site	delta”	and	uses	page	relevancy	and	importance,	similar	to	how	Google	ranks	
pages,	to	evaluate	individual	pages.29	The	BnF	used	a	selective	approach	for	the	deep	web	(that	is,	
web	pages	or	websites	that	are	behind	a	password	or	are	otherwise	not	generally	accessible	to	
search	engines),	referred	to	as	“deposit	track.”	

Metadata	Identification	
Cataloging	is	required	to	discover	a	specific	item	from	the	digital	collection.	An	identifier	or	set	of	
identifiers	is	required	to	retrieve	a	digital	record	in	digital	repositories	or	an	archive.	For	digital	
documents,	this	catalog	or	registration	or	identifier	is	referred	to	as	metadata.30	Metadata	are	
structured	information	concerning	resources	that	describe,	locate	(discover	or	place),	manage,	
easily	retrieve	(access)	and	use	digital	information	resources.	Metadata	are	often	referred	to	as	
“data	about	data”	or	“information	about	information”,	but	it	may	be	more	helpful	and	informative	
to	describe	these	data	as	“descriptive	and	technical	documentation.”31	

Metadata	can	be	divided	into	the	following	three	categories:	

1. Descriptive	metadata	describes	a	resource	for	discovery	and	identification	purposes.	It	may	
consist	of	elements	for	a	document	such	as	title,	author(s),	abstract,	and	keywords,	etc.	

2. Structural	metadata	describes	how	compound	objects	are	put	together,	for	example,	how	
sections	are	ordered	to	form	chapters.	

	 	
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 83 

 
3. Administrative	metadata	imparts	information	to	facilitate	resource	management,	such	as	
when	and	how	a	file	was	created,	who	can	access	the	file,	its	type,	and	other	technical	
information.	Administrative	metadata	is	classified	into	two	types:	(1)	rights	management	
metadata	addresses	intellectual	property	rights	and	(2)	preservation	metadata	contains	
information	needed	to	archive	and	preserve	a	resource.32	

Due	to	new	information	technologies,	digital	repositories,	especially	web-based	repositories,	have	
grown	rapidly	over	the	last	two	decades.	This	interest	prompts	the	digital	libraries	communities	
to	devise	metadata	strategies	to	manage	the	immense	amount	of	data	stored	in	digital	libraries.33	
Metadata	play	a	vital	role	in	the	long-term	preservation	of	digital	objects	and	important	to	identify	
the	metadata	which	may	help	to	retrieve	a	specific	object	from	the	archive	after	preservation.	
According	to	Duff	et	al.,	“the	right	metadata	is	the	key	to	preserving	digital	objects.”34	

There	are	hundreds	of	metadata	standards	developed	over	the	years	for	different	user	
environments,	disciplines,	and	for	different	purposes;	many	of	them	are	in	their	second,	third,	or	
nth	edition.35	Digital	preservation	and	archiving	requires	metadata	standards	to	trace	and	ensure	
its	access	to	the	digital	objects.	Several	of	the	common	standards	are	briefly	discussed	below.	

Dublin	Core	Metadata	Initiative	(DCMI,	http://dublincore.org/)	was	initiated	at	the	2nd	World	
Wide	Web	conference	in	1994	and	was	standardized	by	ANSI/NISO	Z39.85	in	2001	and	ISO	15386	
in	2003.36	The	main	purpose	of	the	DCMI	was	to	define	an	element	set	for	representing	web	
resources;	initially,	thirteen	core	elements	were	defined	which	later	increased	to	a	fifteen-element	
set.	The	elements	are	optional,	repeatable,	can	be	followed	in	any	order,	and	expressed	in	XML.37	

Metadata	Encoding	and	Transmission	Standard	(METS,	http://www.loc.gov/standards/mets/)	is	
an	XML	metadata	standard	intended	to	represent	information	of	the	complex	digital	objects.	METS	
elements	evolved	from	the	early	project	Making	of	America	II	“MOA2”	in	2001,	supported	by	the	
Library	of	Congress	and	sponsored	by	the	Digital	Library	Federation	“DLF”	and	registered	with	
National	Information	Standards	Organization	“NISO”	in	2004.	A	METS	document	contains	seven	
major	sections	in	which	each	contains	different	aspects	of	metadata.38	

Metadata	Object	Description	Schema	(MODS,	http://www.loc.gov/standards/mods/)	was	initiated	
by	the	MARC21	maintenance	agency	at	the	Library	of	Congress	in	2002.	MODS	elements	are	richer	
then	DCMI,	simpler	then	MARC21	bibliographic	format	and	expressed	in	XML.39	The	MODS	
identified	the	widest	facets	or	features	of	an	object	and	presented	nineteen	high-level	optional	
elements.40	

Visual	Resources	Association	Core	Strategies	(VRA	Core,	http://www.loc.gov/standards/vracore/)	
was	developed	in	1996,	and	the	current	version	4.0	was	released	in	2007.	The	VRA	core	is	a	
widely	used	standard	for	art,	libraries,	and	archives	for	such	objects	as	paintings,	drawings,	
sculpture,	architecture,	and	photographs,	as	well	as	books	and	decorative	and	performance	art.41	
The	VRA	core	contains	nineteen	elements	and	nine	sub-elements.42	

Preservation	Metadata	Implementation	Strategies	(PREMIS,	
http://www.loc.gov/standards/premis/)	was	developed	in	2005,	sponsored	by	the	Online	
Computer	Library	Center	(OCLC)	and	the	Research	Libraries	Group	(RLG),	includes	a	data	
dictionary	and	some	information	about	metadata.	PREMIS	defined	a	set	of	five	interactive	core	
semantic	units	or	entities	and	XML	schema	for	endorsing	digital	preservation	activities.	It	is	not	


SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 84 
https://doi.org/10.6017/ital.v38i1.10181 

concerned	with	discovery	and	access	but	with	common	metadata,	and	for	descriptive	metadata,	
other	standards	(Dublin	Core,	METS	or	MODS)	need	to	be	used.	The	PREMIS	data	model	contains	
intellectual	entities	(contents	that	can	be	described	as	a	unit,	e.g.,	books,	articles,	databases),	
objects	(discrete	units	of	information	in	digital	form,	which	can	be	files,	bitstreams,	or	any	
representation),	agents	(people,	organization,	or	software),	events	(actions	that	involve	an	object	
and	an	agent	known	to	the	system)	and	rights	(assertion	of	rights	and	permission).43	

It	is	indisputable	that	good	metadata	improves	access	to	the	digital	object	in	the	digital	repository.	
Therefore,	the	creation	and	selection	of	appropriate	metadata	make	the	web	archive	accessible	to	
the	archive	user.	Structure	metadata	helps	to	manage	the	archival	collection	internally,	as	well	as	
the	related	services,	but	may	not	always	help	to	discover	the	primary	source	of	the	digital	object.44	
Currently,	there	are	many	semi-automated	metadata	generation	tools.	The	use	of	these	semi-
automatic	tools	for	generating	metadata	is	crucial	for	the	future,	considering	the	operation’s	
complexity	and	cost	of	manual	metadata	origination.45	

Archival	Format	
The	web	archive	initiatives	select	websites	for	archiving	based	on	relevance	of	contents	and	the	
intended	audience	of	the	archived	information.	The	size	of	the	web	archives	varies	significantly	
depending	on	their	scope	and	the	type	of	content	they	are	preserving,	e.g.,	web	pages,	PDF	
documents,	images,	audio,	or	video	files.46	To	preserve	these	contents,	a	web	archive	uses	
different	storage	formats	containing	metadata	and	utilizes	data	compression	techniques.	The	
Internet	Archive	defined	the	ARC	format	
(http://archive.org/web/researcher/ArcFileFormat.php),	later	used	as	a	defacto	standard.	In	
2009,	the	Internet	Organization	for	Standardization	(ISO)	established	the	WARC	format	
(https://goo.gl/0RBWSN)	as	an	official	standard	for	web	archiving.	Approximately	54	percent	of	
web	archive	initiatives	applied	ARC	and	WARC	formats	for	archiving.	The	use	of	standard	formats	
helps	the	archivists	to	facilitate	the	creation	of	collaborative	tools,	such	as	search	engines	and	UI	
utilities	to	efficiently	manipulate	the	archived	data.47	

Information	Dissemination	Mechanisms	
A	well-defined	preservation	process	can	lead	to	a	well-organized	web	archive	that	is	easy	to	
maintain	and	easy	to	retrieve	a	specific	digital	object	from	the	collection	using	information	
dissemination	techniques.	Poor	search	results	are	one	of	the	main	problems	in	information	
dissemination	of	web	archives.	The	users	of	a	web	archive	expend	excessive	time	to	retrieve	
intended	documents	or	information	to	satisfy	the	user’s	query.	Archivists	are	more	concerned	
with	“ofness,”	“what	collections	are	made	up	of,”	although	archive	users	are	concerned	with	
aboutness,	“what	collections	are	about.”48		

To	use	the	full	potential	of	web	archives	a	usable	interface	is	needed	to	help	the	user	to	search	the	
archive	for	specific	digital	object.	Full	text	and	keyword	search	are	the	dominant	ways	to	search	
the	unstructured	information	repository,	evidently	observed	from	the	online	search	engines.	The	
sophistication	of	search	results	against	user	queries	is	based	on	the	ranking	tools.49	The	access	
tools	and	techniques	are	getting	the	attention	of	researchers,	and	approximately	82	percent	of	
European	web	archives	concentrate	on	such	tools,	which	makes	these	web	archives	easily	
accessible.50	The	Lucene	full-text	search	engine	and	its	extension	NutchWAX	is	widely	used	in	web	
archiving.	Moreover,	for	the	combination	of	semantic	descriptions	that	already	rely	on	or	are	
implicit	within	their	descriptive	metadata,	reasoning-based	or	semantic	searching	of	the	archival	


INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 85 

 
collection	can	enable	the	system	to	produce	novel	possibilities	for	the	archival	content	retrieval	
and	browsing.51	Even	in	the	current	era	of	digital	archives,	mobile	services	are	adopted	in	digital	
libraries,	e.g.,	access	to	e-books,	libraries	databases,	catalogs,	and	text	messaging	are	common	
mobile	services	offered	in	university	libraries.52	In	a	massive	repository,	a	user	query	retrieves	
millions	of	documents,	which	makes	it	difficult	for	users	to	identify	the	most	relevant	information.	
The	ranking	model	estimates	the	results	relevancy	based	on	user’s	queries	using	specified	criteria	
to	overcome	this	problem	and	sorts	the	results	by	placing	the	most	relevant	result	at	the	top.53	
There	are	a	number	of	ranking	models	that	exist	in	the	literature,	e.g.,	conventional	ranking	
models,	e.g.,	TF-IDF,	BM25F,	temporal	ranking	models,	e.g.,	PageRank,	and	learning	to	rank	
models,	e.g.,	L2R.		

The	findings	of	the	systematic	approach	for	web	preservation	are	used	to	automate	the	process	of	
the	digital	news-story	preservation.	The	steps	of	the	proposed	model	are	carefully	adopted	to	
develop	a	tool	that	is	able	to	add	contextual	information	to	the	stories	to	be	preserved.	

DIGITAL	NEWS	STORIES	PRESERVATION	FRAMEWORK	

The	advancement	of	web	technologies	and	maturation	of	the	internet	attracts	news	readers	to	
access	news	online	that	is	provided	by	multiple	sources	and	to	obtain	the	desired	information	
comprehensively.	The	amount	of	news	published	online	has	grown	rapidly,	and	for	an	individual,	
it	is	cumbersome	to	browse	through	all	online	sources	for	relevant	news	articles.	The	news	
generation	in	the	digital	environment	is	no	longer	a	periodic	process	with	a	fixed	single	output,	
such	as	printed	newspapers.	The	news	is	instantly	generated	and	updated	online	in	a	continuous	
fashion.	However,	because	of	different	reasons,	such	as	the	short	lifespan	of	digital	information	
and	the	speed	of	generation	of	information,	it	has	become	vital	to	preserve	digital	news	for	the	
long	term.	Digital	preservation	includes	various	actions	to	ensure	that	digital	information	remains	
accessible	and	usable,	as	long	as	they	are	considered	important.54	Libraries	and	archives	preserve	
by	carefully	digitizing	newspapers	considering	as	a	good	source	of	knowing	the	history.	Many	
approaches	have	been	developed	to	preserve	digital	information	for	the	long	term.	The	lifespan	of	
news	stories	published	online	varies	from	one	newspaper	to	another,	i.e.,	from	one	day	to	a	
month.	However,	a	newspaper	may	be	backed	up	and	archived	by	the	news	publisher	or	national	
archives;	in	the	future,	it	will	be	difficult	to	access	particular	information	published	in	various	
newspapers	regarding	the	same	news	story.	The	issues	become	even	more	complicated	if	a	story	
is	to	be	tracked	through	an	archive	of	many	newspapers,	which	requires	different	access	
technologies.	

The	Digital	News	Story	Preservation	(DNSP)	framework	was	introduced	to	preserve	digital	news	
articles	published	online	from	multiple	sources.55	The	DNSP	framework	is	planned	based	on	
adopting	the	proposed	step-by-step	systematic	approach	for	web	preservation	to	develop	a	well-
organized	web	archive.	Initially,	the	main	objectives	defined	for	the	DNSP	framework	are:	

• To	initiate	a	well-organized	national	level	digital	news	archive	of	multiple	news	sources.	
• To	normalize	news	articles	during	preservation	to	a	common	format	for	future	use.	
• To	extract	explicit	and	implicit	metadata,	which	would	be	helpful	in	ingesting	stories	to	the	

archive	and	browsing	through	the	archive	in	the	future.	
• To	introduce	content-based	similarity	measures	to	link	digital	news	articles	during	

preservation.	


SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 86 
https://doi.org/10.6017/ital.v38i1.10181 

The	Digital	News	Story	Extractor	(DNSE)	is	a	tool	developed	to	facilitate	the	extraction	of	news	
stories	from	the	online	newspapers	and	to	migrate	to	a	normalized	format	for	preservation.	The	
normalized	format	also	includes	a	step	to	add	metadata	in	the	Digital	News	Stories	Archive	
(DNSA)	for	future	use.56	To	facilitate	the	accessibility	of	news	articles	preserved	from	multiple	
sources,	some	mechanisms	need	to	be	adopted	for	linking	the	archived	digital	news	articles.	An	
effective	term-based	approach	“Common	Ratio	Measure	for	Stories	(CRMS)”	for	linking	digital	
news	articles	in	DNSA	is	introduced	that	links	similar	news	articles	during	the	preservation	
process.57	The	approach	is	empirically	analyzed,	and	the	results	of	the	proposed	approach	are	
compared	to	get	conclusive	arguments.	The	initial	results	computed	automatically	using	a	
common	ratio	measure	for	stories	are	encouraging	and	are	compared	with	the	similarity	of	news	
articles	based	on	human	judgment.	The	results	are	generalized	by	defining	a	threshold	value	
based	on	multiple	experimental	results	using	the	proposed	approach.	

Currently,	there	is	ongoing	work	to	extend	the	scope	of	DNSA	to	dual	languages,	i.e.,	Urdu	and	
English,	as	well	as	content-based	similarity	measures	to	link	news	articles	published	in	Urdu-
English.	Moreover,	research	is	underway	to	develop	tools	for	exploiting	the	linkage	created	among	
stories	during	the	preservation	process	for	search	and	retrieval	tasks.	

SUMMARY	

Effective	strategic	planning	is	critical	in	creating	web	archives;	hence,	it	requires	a	well-
understood	and	a	well-planned	preservation	process.	The	process	should	result	in	a	well-
organized	web	archive	that	includes	not	only	the	content	to	be	preserved	but	also	the	contextual	
information	required	to	interpret	the	content.	

The	study	attempts	to	answer	many	questions	by	guiding	the	archivists	and	related	personnel,	
such	as:	How	to	lead	the	web	preservation	process	effectively?	How	to	initiate	the	preservation	
process?	How	to	proceed	through	different	steps?	What	are	the	possible	techniques	that	may	help	
to	create	a	well-organized	web	archive?	How	can	the	archived	information	can	be	used	to	its	
greatest	potential?		

To	answer	these	questions,	the	study	resulted	in	an	appropriate	step-by-step	process	for	web	
preservation	and	a	well-organized	web	archive.	The	targeted	goal	of	each	step	is	identified	by	
researching	the	existing	approaches	that	can	be	adopted.	The	possible	techniques	for	those	
approaches	are	discussed	in	detail	for	each	step.		

REFERENCES	

1	“World	Wide	Web	Size,”	The	size	of	the	World	Wide	Web,	visited	on	Jan	31,	2019,	
http://www.worldwidewebsize.com/.	

2	Brian	F.	Lavoie,	“The	Open	Archival	Information	System	Reference	Model:	Introductory	Guide,”	
Microform	&	Imaging	Review	33,	no.	2	(2004):	68-81;	Alexandros	Ntoulas,	Junghoo	Cho,	and	
Christopher	Olston,	“What's	New	on	the	Web?	The	Evolution	of	The	Web	from	a	Search	Engine	
Perspective,”	in	Proceedings	of	the	13th	International	Conference	on	World	Wide	Web-04	(New	
York,	NY:	ACM,	2004),	1-12.	

	
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 87 

 
3	Teru	Agata	et	al.,	“Life	Span	of	Web	Pages:	A	Survey	of	10	million	Pages	Collected	in	2001,”	
IEEE/ACM	Joint	Conference	on	Digital	Libraries,	(IEEE,	2014),	463-64,	
https://doi.org/10.1109/JCDL.2014.6970226.	

4	Timothy	Robert	Hart	and	Denise	de	Vries,	“Metadata	Provenance	and	Vulnerability,”	Information	
Technology	and	Libraries	36,	no.	4	(Dec.	2017):	24-33,	
https://doi.org/10.6017/ital.v36i4.10146.	

5	Claire	Warwick	et	al.,	“Library	and	Information	Resources	and	Users	of	Digital	Resources	in	the	
Humanities,”	Program	42,	no.	1	(2008):	5-27,	https://doi.org/10.1108/00330330810851555.	

6	Lavoie,	“Open	Archival	Information	System	Reference	Model.”	

7	Susan	Farrell,	K.	Ashley,	and	R.	Davis,	“A	Guide	to	Web	Preservation,”	Practical	Advice	for	Web	
and	Records	Managers	Based	on	Best	Practices	from	the	JISC-Funded	PoWR	Project	(2010),	
https://jiscpowr.jiscinvolve.org/wp/files/2010/06/Guide-2010-final.pdf.	

8	Lavoie,	“Open	Archival	Information	System	Reference	Model;”	Farrell,	Ashley,	and	Davis,	“Guide	
to	Web	Preservation.”	

9	Peter	Lyman,	“Archiving	the	World	Wide	Web,”	Washington,	Library	of	Congress	(2002),	
https://www.clir.org/pubs/reports/pub106/web/.	

10	Diomidis	Spinellis,	“The	Decay	and	Failures	of	Web	References,”	Communications	of	the	ACM	46,	
no.	1	(2003):	71-77,	https://dl.acm.org/citation.cfm?doid=602421.602422.	

11	Digital	Archive	for	Chinese	Studies	(DACHS)	Archive2	https://www.zo.uni-
heidelberg.de/boa/digital_resources/dachs/index_en.html,	visited	on	Jan	31,	2019.	

12	Julien	Masanès,	“Web	Archiving	Methods	and	Approaches:	A	Comparative	Study,”	Library	Trends	
54,	no.	1	(2005):	72-90,	https://doi.org/10.1353/lib.2006.0005.	

13	Hanno	Lecher,	“Small	Scale	Academic	Web	Archiving:	DACHS,”	in	Web	Archiving	
(Berlin/Heidelberg:	Springer,	2006),	213-25,	https://doi.org/10.1007/978-3-540-46332-
0_10.	

14	Daniel	Gomes	et	al.,	“Introducing	the	Portuguese	Web	Archive	Initiative,”	in	8th	international	
Web	Archiving	Workshop	(Berlin/Heidelberg:	Springer,		2009).	

15	Gerrit	Voerman	et	al.,	“Archiving	the	Web:	Political	Party	Web	Sites	in	the	Netherlands,”	
European	Political	Science	2,	no.	1	(2002):	68-75,	https://doi.org/10.1057/eps.2002.51.	

16	Sonja	Gabriel,	“Public	Sector	Records	Management:	A	Practical	Guide,”	Records	Management	
Journal	18,	no.	2	(2008),	https://doi.org/10.1108/00242530810911914.	

17	Farrell,	Ashley,	and	Davis,	“Guide	to	Web	Preservation.”	
	

SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 88 
https://doi.org/10.6017/ital.v38i1.10181 

 
18	Jung-ran	Park	and	Andrew	Brenza,	“Evaluation	of	Semi-Automatic	Metadata	Generation	Tools:	A	
Survey	of	the	Current	State	of	the	Art,”	Information	Technology	and	Libraries	34,	no.	3	(Sept,	
2015):	22-42,	https://doi.org/10.6017/ital.v34i3.5889.	

19	Muzammil	Khan	and	Arif	Ur	Rahman,	“Digital	News	Story	Preservation	Framework,”	in	Digital	
Libraries:	Providing	Quality	Information:	17th	International	Conference	on	Asia-Pacific	Digital	
Libraries,	ICADL	2015	Seoul,	Korea,	December	9-12,	2015	(Proceedings,	vol.	9469,	Springer,	
2015),	350-52,	https://doi.org/10.1007/978-3-319-27974-9;	Muzammil	Khan,	“Using	Text	
Processing	Techniques	for	Linking	News	Stories	for	Digital	Preservation,”	PhD	Thesis,	Faculty	
of	Computer	Science,	Preston	University	Kohat,	Islamabad	Campus,	HEC	Pakistan,	2018.	

20	Dennis	Dimick,	“Adobe	Acrobat	Captures	the	Web,”	Washington	Apple	Pi	Journal	(1999):	23-25.	

21	Trupti	Udapure,	Ravindra	D.	Kale,	and	Rajesh	C.	Dharmik,	“Study	of	Web	Crawler	and	Its	
Different	Types,”	IOSR	Journal	of	Computer	Engineering	(IOSR-JCE)	16,	no.	1	(2014):	01-05,	
https://doi.org/10.9790/0661-16160105.	

22	Dora	Biblarz	et	al.,	“Guidelines	for	a	Collection	Development	Policy	Using	the	Conspectus	
Model,”	International	Federation	of	Library	Associations	and	Institutions,	Section	on	Acquisition	
and	Collection	Development	(2001).	

23	Farrell,	Ashley,	and	Davis,	“Guide	to	Web	Preservation;”	E.	Pinsent	et	al.,	“PoWR:	The	
Preservation	of	Web	Resources	Handbook,”	
http://jisc.ac.uk/publications/programmerelated/2008/powrhandbook.aspx	(2010);	Michael	
Day,	“Preserving	the	Fabric	of	Our	Lives:	A	Survey	of	Web	Preservation	Initiatives,”	Lecture	
Notes	in	Computer	Science	(Berlin/Heidelberg:	Springer,	2003):	461-72,	
https://doi.org/10.1007/978-3-540-45175-4_42.	

24	Pinsent	et	al.,	“PoWR:”;	Day,	“Preserving	the	Fabric.”	

25	Allan	Arvidson,	“The	Royal	Swedish	Web	Archive:	A	Complete	Collection	of	Web	Pages,”	
International	Preservation	News	(2001):	10-12.	

26	Andreas	Rauber,	Andreas	Aschenbrenner,	and	Oliver	Witvoet,	“Austrian	Online	Archive	
Processing:	Analyzing	Archives	of	the	World	Wide	Web,”	Research	and	Advanced	Technology	
for	Digital	Libraries	(2002):	ECDL	2002.	Lecture	Notes	in	Computer	Science,	vol	2458,	
(Berlin/Heidelberg:	Springer,	2002),	16-31,	https://doi.org/10.1007/3-540-45747-X_2.	

27	William	Arms,	“Collecting	and	Preserving	the	Web:	The	Minerva	Prototype,”	RLG	DigiNews	5,	no.	
2	(2001).	

28	Sonya	Betz	and	Robyn	Hall,	“Self-Archiving	with	Ease	in	an	Institutional	Repository:	Micro	
Interactions	and	the	User	Experience,”	Information	Technology	and	Libraries	34,	no.	3	(Sept.	
2015):	43-58,	https://doi.org/10.6017/ital.v34i3.5900.	

29	Serge	Abiteboul	et	al.,	“A	First	Experience	in	Archiving	the	French	Web,”	in	International	
Conference	on	Theory	and	Practice	of	Digital	Libraries,	(Berlin/Heidelberg:	Springer,	2002),	1-
15,	https://doi.org/10.1007/3-540-45747-X_1;	Sergey	Brin	and	Lawrence	Page,	“Reprint	of:	

	
INFORMATION TECHNOLOGY AND LIBRARIES | MARCH 2019 89 

 
The	Anatomy	of	a	Large-Scale	Hypertextual	Web	Search	Engine,”	Computer	Networks	56,	no.	18	
(2012):	3825-33,	https://doi.org/10.1016/j.comnet.2012.10.007.	

30	Masanès,	“Web	Archiving.”	

31	NISO-Press,	“Understanding	Metadata,”	National	Information	Standards	(2004),	
http://www.niso.org/publications/understanding-metadata.	

32	Ibid.	

33	Jane	Greenberg,	“Understanding	Metadata	and	Metadata	Schemes,”	Cataloging	&	Classification	
Quarterly	40,	no.	3-4	(2009):	17-36,	https://doi.org/10.1300/J104v40n03_02.	

34	Michael	Day,	“Preservation	Metadata	Initiatives:	Practicality,	Sustainability,	and	
Interoperability,”	Publishers:	Archivschule	Marburg	(2004):	91-117.	

35	Jenn	Riley,	Glossary	of	Metadata	Standards	(2010).	

36	Corey	Harper,	“Dublin	Core	Metadata	Initiative:	Beyond	the	Element	Set,”	Information	Standards	
Quarterly	22,	no.	1	(2010):	20-31.	

37	Jane	Greenberg,	“Dublin	Core:	History,	Key	Concepts,	and	Evolving	Context	(Part	One),”	in	Slide	
Presentation	on	dc-2010	International	Conference	on	Dublin	Core	and	Metadata	Applications	
Pittsburgh,	PA	(2010).	

38	Cundiff	V.	Morgan,	“An	Introduction	to	the	Metadata	Encoding	and	Transmission	Standard	
(METS),”	Library	Hi	Tech	22,	no.	1	(2004):	52-64,	
https://doi.org/10.1108/07378830410524495;	Leta	Negandhi,	“Metadata	Encoding	and	
Transmission	Standard	(METS),”In	Texas	Conference	on	Digital	Libraries,	TCDL-2012	(2012).	

39	Sally	H.	McCallum,	“An	Introduction	to	the	Metadata	Object	Description	Schema	(MODS),”	
Library	Hi	Tech	22,	no.	1	(2004):	82-88,	https://doi.org/10.1108/07378830410524521.	

40	R.	Gartner,	“MODE:	Metadata	Object	Description	Schema,”	JISC	Techwatch	Report	TSW	(2003):	
03-06.	www.loc.gov/standards/mods/.	

41	VRA-Core,	“An	Introduction	of	VRA	Core,”	http://www.loc.gov/standards/vracore/VRA	Core4	
Intro.pdf,	Created:	Oct	2014.	

42	VRA-Core,	“VRA	Core	Element	Outline,”	http://www.loc.gov/standards/vracore/VRA	Core4	
Outline.pdf,	Created:	Feb	2007.	

43	Priscilla	Caplan,	“Understanding	PREMIS,”	Washington	DC,	USA:	Library	of	Congress,	(2009),	
https://www.loc.gov/standards/premis/understanding-premis.pdf;	J.	Relay,	“An	Introduction	
to	PREMIS,”	Singapore	IPRESS	Tutorial,	(2011),	
http://www.loc.gov/standards/premis/premistutorial	iPRES2011	singapore.pdf.	

	
SYSTEMATIC APPROACH TOWARDS WEB PRESERVATION | KHAN AND UR RAHMAN 90 
https://doi.org/10.6017/ital.v38i1.10181 

 
44	Jennifer	Schaffner,	“The	Metadata	is	the	Interface:	Better	Description	for	Better	Discovery	of	
Archives	and	Special	Collections,	Synthesized	from	User	Studies,”	Making	Archival	and	Special	
Collections	More	Accessible,	85	(2015).	

45	Joao	Miranda	and	Daniel	Gomes,	“Trends	in	Web	Characteristics,”	in	Web	Congress,	2009.	LA-
WEB'09.	Latin	American,	(IEEE,	2009),	146-53,	https://doi.org/10.1109/LA-WEB.2009.28.	

46	Daniel	Gomes,	João	Miranda,	and	Miguel	Costa,	“A	Survey	on	Web	Archiving	Initiatives,”	
Research	and	Advanced	Technology	for	Digital	Libraries	(2011):	408-20,	
https://doi.org/10.1007/978-3-642-24469-8_41.	

47	Ibid.	

48	Schaffner,	“Metadata	is	the	Interface.”	

49	Miguel	Costa	and	Mário	J.	Silva,	“Evaluating	Web	Archive	Search	Systems,”	in	International	
Conference	on	Web	Information	Systems	Engineering	(Berlin/Heidelberg:	Springer,	2012),	440-
454.	https://doi.org/10.1007/978-3-642-35063-4_32.	

50	Foundation,	I,	“Web	Archiving	in	Europe,”	technical	report,	CommerceNet	Labs	(2010).	

51	Georgia	Solomou	and	Dimitrios	Koutsomitropoulos,	“Towards	an	Evaluation	of	Semantic	
Searching	in	Digital	Repositories:	A	DSpace	Case-Study,”	Program	49,	no.	1	(2015):	63-90,	
https://doi.org/10.1108/PROG-07-2013-0037.	

52	Liu	Yan	Quan	and	Sarah	Briggs,	“A	Library	in	the	Palm	of	Your	Hand:	Mobile	Services	in	Top	100	
University	Libraries,”	Information	Technology	and	Libraries	34,	no.	2	(June	2015):	133,	
https://doi.org/10.6017/ital.v34i2.5650.	

53	Ricardo	Baeza-Yates	and	Berthier	Ribeiro-Neto,	Modern	Information	Retrieval	463.	(New	York:	
ACM	Pr.,	1999).	

54	Daniel	Burda	and	Frank	Teuteberg,	“Sustaining	Accessibility	of	Information	through	Digital	
Preservation:	A	Literature	Review,”	Journal	of	Information	Science,	39,	no.	4	(2013):	442-58,	
https://doi.org/10.1177/0165551513480107.	

55	Muzammil	Khan	et	al.,	“Normalizing	Digital	News-Stories	for	Preservation,”	in	Digital	
Information	Management	(ICDIM),	2016	Eleventh	International	Conference	on	(IEEE,	2016),	85-
90,	https://doi.org/10.1109/ICDIM.2016.7829785.	

56	Khan,	et	al.,	“Normalizing	Digital	News.”	

57	Muzammil	Khan,	Arif	Ur	Rahman,	and	M.	Daud	Awan,	“Term-Based	Approach	for	Linking	Digital	
News	Stories,”	in	Italian	Research	Conference	on	Digital	Libraries	(Cham,	Switzerland:	Springer,	
2018),	127-38,	https://doi.org/10.1007/978-3-319-73165-0_13.