ER 2011.pptx


Best-­‐Effort	  Modeling	  of	  
Structured	  Data	  on	  the	  Web	  

Alon	  Halevy	  
Google	  

November	  2,	  2011	  


Structured	  Data	  and	  the	  Web	  

•  A	  huge	  amount	  of	  structured	  data	  on	  the	  Web	  
– Reference	  data,	  hobbies,	  products,	  coffee,	  …	  
– Reac%ng	  to	  exis%ng	  data	  

•  A	  plaHorm	  for	  geIng	  more	  data	  out:	  
– Government	  data,	  crime,	  water	  condiLons,	  …	  
– Being	  proac%ve	  	  

•  New	  kinds	  of	  data	  collecLon	  and	  management	  
– CollaboraLon,	  crowd-­‐sourcing,	  real-­‐Lme	  data,	  crisis	  
response,	  …	  

–  Inven%ng	  a	  bright	  future	  


Goal:	  Structured	  Data	  Ecosystem	  

Discover	   Import	  

Clean	  

Query	  

Share	  

Integrate	   Visualize	  

Publish	  
Communicate	  


Outline	  

•  Google	  Fusion	  Tables: 	  	  
– A	  Database	  management	  service	  for	  the	  Web	  

•  WebTables:	  
– Discovering	  a	  (structured)	  needle	  in	  an	  
(unstructured)	  haystack	  

•  ObservaLons	  about	  modeling	  along	  the	  way	  


Fusion	  Tables	  
google.com/fusiontables	  

•  Goal:	  an	  easy-­‐to-­‐use	  database	  system	  that	  is	  
integrated	  with	  the	  Web.	  

•  Key	  features:	  
– Easy	  upload	  (CSV,	  KML,	  spreadsheets)	  
– Sharing	  (even	  outside	  your	  company)	  
– Visualiza/ons	  front	  and	  center	  
– Easy	  publishing	  

•  Goal	  2:	  a	  data	  cloud	  -­‐-­‐	  discover	  others’	  data	  
and	  combine	  with	  yours.	  	  


Crowd Sourcing 


A	  GIS	  in	  the	  Cloud	  

•  That’s	  not	  what	  we	  set	  out	  to	  do,	  really.	  
•  Challenges:	  

– Trickling:	  show	  only	  a	  small	  number	  of	  features	  
(points,	  polygons)	  from	  a	  large	  data	  set	  

– Need	  to	  thin	  polygons,	  clip	  to	  the	  window	  
– Style	  features	  on	  the	  fly	  
– All	  in	  less	  than	  100ms	  


And	  the	  Credit	  Goes	  to…	  

•  Hector	  Gonzalez	  
•  Jayant	  Madhavan	  
•  	  Sree	  Balakrishnan	  
•  Heidi	  Lam	  	  
•  Hongrae	  Lee	  
•  Warren	  Shen	  	  
•  Anno	  Langen	  	  
•  Rebecca	  Shapley	  	  

•  Anish	  Das	  Sarma	  
•  Boulos	  Harb	  
•  Fei	  Wu	  
•  Cong	  Yu	  
•  Spiros	  Papadimitriou	  


Outline	  

ü Google	  Fusion	  Tables: 	  	  
– A	  Database	  management	  service	  for	  the	  Web	  

•  WebTables:	  
– Discovering	  a	  (structured)	  needle	  in	  an	  
(unstructured)	  haystack	  

Discovery	  =	  incen-ve	  to	  publish	  


Tables on the Web 


Goal:	  Search	  for	  Structured	  Data	  

Challenges:	  
•  Finding	  the	  good	  tables	  on	  the	  Web	  
•  Understanding	  their	  semanLcs	  	  
•  Understanding	  user’s	  intenLons	  


The Deep Web 

store locations 
used cars 

radio stations 
patents recipes 

See “Google’s Deep Web Crawl”, VLDB 2008 


HTML Lists 


The	  Needle	  in	  the	  Haystack	  

Finding	  high	  quality	  HTML	  tables	  


Vertical Tables 


Semantics Embedded in Surrounding Text 


And Sometimes, Complicated 


WebTables: Exploring the Relational Web 
[Cafarella et al., VLDB 2008, WebDB 08]  

•  In corpus of 14B raw tables, we estimate 
154M are “good” relations 
–  Single-table databases; Schema = attr labels + types 
–  Largest corpus of databases & schemas we know of 
 

•  The Webtables system: 
–  Recovers good relations from crawl and enables search 
–  Builds novel apps on the recovered data 
 

Searching Tables is Tricky 
[Tweak Document Search, Cafarella 08] 

•  Consider new cues in ranking: 
–  Hits on left column 
–  Hits on schema (where there is one) 
–  Number of rows, columns 
–  Hits on table body 
–  Size of table relative to page 

•  ~25% increase in good results in top-10 
results (compared to filtering google 
results for tables 


Tree Search 

Amish quilts 

Parking tickets in India 
Horses 

Modeling	  Challenge:	  	  
	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  	  Data	  is	  About	  Everything	  


Even	  if	  we	  had	  a	  KB	  of	  everything	  
[Consider	  Freebase	  as	  example]	  

•  User:	  
–  AcLon	  movies	  

•  Freebase:	  	  
–  Movies	  AND	  
Genre=AcLon	  

•  User	  
–  Governor	  of	  California	  

•  Freebase:	  
–  US	  State	  has	  governing	  
posiLons	  

–  Governing	  posiLons	  
have	  office	  holder	  

–  Office	  holder	  has	  
posiLon	  

Mismatch	  between	  KB	  model	  and	  users’	  conceptual	  model!	  	  


The	  People’s	  Ontology	  
[Open	  InformaLon	  ExtracLon]	  

Mine	  a	  database	  of	  enLLes	  and	  classes	  from	  the	  
Web:	  

Mine	  binary	  relaLonships	  	  

Broad,	  dirty,	  but	  uses	  culturally	  aware	  terminology	  


Recovering Table Semantics 
[Venetis et al., VLDB 2011] 


Recovering Binary Relationships 

Recovering	  seman-cs:	  be9er	  search	  
and	  quality	  filter	  


Attribute Correlations DB 

Raw crawled pages 

Raw HTML Tables Recovered Relations Relation Search 

Inverted Index 

Job-title, company, date 104 

Make, model, year 916 

Rbi, ab, h, r, bb, avg, slg 12 

Dob, player, height, weight 4 

… … 

Attribute Correlation Statistics Db 

•  2.6M distinct schemas 

•  5.4M attributes 

The	  Unreasonable	  Effec%veness	  of	  Data	  [Halevy,	  Norvig,	  Pereira]	  


Synonym Discovery 

•  Use schema statistics to automatically 
compute attribute synonyms 
–  More complete than thesaurus 

•  Given input “context” attribute set C: 
1.  A = all attrs that appear with C 
2.  P = all (a,b) where a∈A, b∈A, a≠b 
3.  rm all (a,b) from P where p(a,b)>0 
4.  For each remaining pair (a,b) compute: 


Synonym Discovery Examples 
name! e-mail|email, phone|telephone,  

e-mail_address|email_address, date|last_modified!

instructor! course-title|title, day|days, course|course-#,  
course-name|course-title!

elected! candidate|name, presiding-officer|speaker!

ab! k|so, h|hits, avg|ba, name|player!

sqft! bath|baths, list|list-price, bed|beds, price|rent!


Conclusions	  
•  Fusion	  Tables:	  helping	  get	  the	  ecosystem	  started.	  	  
•  Search	  for	  structured	  data	  sets:	  	  

– Much	  more	  to	  do!	  
– Unify	  with	  other	  search	  
– Manually	  created	  ontologies	  vs.	  extracted	  ones?	  

•  Can	  we	  get	  the	  crowds	  to	  help?	  
– Resolving	  heterogeneity	  
– Create	  new	  data	  sets	  

•  Can	  we	  help	  domain-­‐specific	  expert	  
communiLes?	  


A	  Few	  References	  

•  CommunicaLons	  of	  the	  ACM:	  Feb	  2011	  
•  Deep	  web:	  VLDB	  2008	  
•  WebTables:	  VLDB	  2008,	  2009,	  2011	  
•  Fusion	  Tables:	  SIGMOD	  2010,	  SOCC	  2010	  

– google.com/fusiontables	  

•  Principles	  of	  Data	  IntegraLon	  (Doan,	  Halevy,	  
Ives):	  Morgan	  Kaufmann,	  2012.	  

•  The	  Infinite	  EmoLons	  of	  Coffee	  (Halevy):	  next	  
month!