ER 2011.pptx Best-­‐Effort  Modeling  of   Structured  Data  on  the  Web   Alon  Halevy   Google   November  2,  2011   Structured  Data  and  the  Web   •  A  huge  amount  of  structured  data  on  the  Web   – Reference  data,  hobbies,  products,  coffee,  …   – Reac%ng  to  exis%ng  data   •  A  plaHorm  for  geIng  more  data  out:   – Government  data,  crime,  water  condiLons,  …   – Being  proac%ve     •  New  kinds  of  data  collecLon  and  management   – CollaboraLon,  crowd-­‐sourcing,  real-­‐Lme  data,  crisis   response,  …   –  Inven%ng  a  bright  future   Goal:  Structured  Data  Ecosystem   Discover   Import   Clean   Query   Share   Integrate   Visualize   Publish   Communicate   Outline   •  Google  Fusion  Tables:     – A  Database  management  service  for  the  Web   •  WebTables:   – Discovering  a  (structured)  needle  in  an   (unstructured)  haystack   •  ObservaLons  about  modeling  along  the  way   Fusion  Tables   google.com/fusiontables   •  Goal:  an  easy-­‐to-­‐use  database  system  that  is   integrated  with  the  Web.   •  Key  features:   – Easy  upload  (CSV,  KML,  spreadsheets)   – Sharing  (even  outside  your  company)   – Visualiza/ons  front  and  center   – Easy  publishing   •  Goal  2:  a  data  cloud  -­‐-­‐  discover  others’  data   and  combine  with  yours.     Crowd Sourcing A  GIS  in  the  Cloud   •  That’s  not  what  we  set  out  to  do,  really.   •  Challenges:   – Trickling:  show  only  a  small  number  of  features   (points,  polygons)  from  a  large  data  set   – Need  to  thin  polygons,  clip  to  the  window   – Style  features  on  the  fly   – All  in  less  than  100ms   And  the  Credit  Goes  to…   •  Hector  Gonzalez   •  Jayant  Madhavan   •   Sree  Balakrishnan   •  Heidi  Lam     •  Hongrae  Lee   •  Warren  Shen     •  Anno  Langen     •  Rebecca  Shapley     •  Anish  Das  Sarma   •  Boulos  Harb   •  Fei  Wu   •  Cong  Yu   •  Spiros  Papadimitriou   Outline   ü Google  Fusion  Tables:     – A  Database  management  service  for  the  Web   •  WebTables:   – Discovering  a  (structured)  needle  in  an   (unstructured)  haystack   Discovery  =  incen-ve  to  publish   Tables on the Web Goal:  Search  for  Structured  Data   Challenges:   •  Finding  the  good  tables  on  the  Web   •  Understanding  their  semanLcs     •  Understanding  user’s  intenLons   The Deep Web store locations used cars radio stations patents recipes See “Google’s Deep Web Crawl”, VLDB 2008 HTML Lists The  Needle  in  the  Haystack   Finding  high  quality  HTML  tables   Vertical Tables Semantics Embedded in Surrounding Text And Sometimes, Complicated WebTables: Exploring the Relational Web [Cafarella et al., VLDB 2008, WebDB 08] •  In corpus of 14B raw tables, we estimate 154M are “good” relations –  Single-table databases; Schema = attr labels + types –  Largest corpus of databases & schemas we know of •  The Webtables system: –  Recovers good relations from crawl and enables search –  Builds novel apps on the recovered data Searching Tables is Tricky [Tweak Document Search, Cafarella 08] •  Consider new cues in ranking: –  Hits on left column –  Hits on schema (where there is one) –  Number of rows, columns –  Hits on table body –  Size of table relative to page •  ~25% increase in good results in top-10 results (compared to filtering google results for tables Tree Search Amish quilts Parking tickets in India Horses Modeling  Challenge:                                                                                              Data  is  About  Everything   Even  if  we  had  a  KB  of  everything   [Consider  Freebase  as  example]   •  User:   –  AcLon  movies   •  Freebase:     –  Movies  AND   Genre=AcLon   •  User   –  Governor  of  California   •  Freebase:   –  US  State  has  governing   posiLons   –  Governing  posiLons   have  office  holder   –  Office  holder  has   posiLon   Mismatch  between  KB  model  and  users’  conceptual  model!     The  People’s  Ontology   [Open  InformaLon  ExtracLon]   Mine  a  database  of  enLLes  and  classes  from  the   Web:   Mine  binary  relaLonships     Broad,  dirty,  but  uses  culturally  aware  terminology   Recovering Table Semantics [Venetis et al., VLDB 2011] Recovering Binary Relationships Recovering  seman-cs:  be9er  search   and  quality  filter   Attribute Correlations DB Raw crawled pages Raw HTML Tables Recovered Relations Relation Search Inverted Index Job-title, company, date 104 Make, model, year 916 Rbi, ab, h, r, bb, avg, slg 12 Dob, player, height, weight 4 … … Attribute Correlation Statistics Db •  2.6M distinct schemas •  5.4M attributes The  Unreasonable  Effec%veness  of  Data  [Halevy,  Norvig,  Pereira]   Synonym Discovery •  Use schema statistics to automatically compute attribute synonyms –  More complete than thesaurus •  Given input “context” attribute set C: 1.  A = all attrs that appear with C 2.  P = all (a,b) where a∈A, b∈A, a≠b 3.  rm all (a,b) from P where p(a,b)>0 4.  For each remaining pair (a,b) compute: Synonym Discovery Examples name! e-mail|email, phone|telephone, 
 e-mail_address|email_address, date|last_modified! instructor! course-title|title, day|days, course|course-#, 
 course-name|course-title! elected! candidate|name, presiding-officer|speaker! ab! k|so, h|hits, avg|ba, name|player! sqft! bath|baths, list|list-price, bed|beds, price|rent! Conclusions   •  Fusion  Tables:  helping  get  the  ecosystem  started.     •  Search  for  structured  data  sets:     – Much  more  to  do!   – Unify  with  other  search   – Manually  created  ontologies  vs.  extracted  ones?   •  Can  we  get  the  crowds  to  help?   – Resolving  heterogeneity   – Create  new  data  sets   •  Can  we  help  domain-­‐specific  expert   communiLes?   A  Few  References   •  CommunicaLons  of  the  ACM:  Feb  2011   •  Deep  web:  VLDB  2008   •  WebTables:  VLDB  2008,  2009,  2011   •  Fusion  Tables:  SIGMOD  2010,  SOCC  2010   – google.com/fusiontables   •  Principles  of  Data  IntegraLon  (Doan,  Halevy,   Ives):  Morgan  Kaufmann,  2012.   •  The  Infinite  EmoLons  of  Coffee  (Halevy):  next   month!