Building Google Dataset Search and Fostering an Open Data Ecosystem

Posted by Matthew Burgess and Natasha Noy, Google AI

Earlier this month we launched Google Dataset Search, a tool designed to make it easier for researchers to discover datasets that can help with their work. What we colloquially call “Google Scholar for data,” Google Dataset Search is a search engine across metadata for millions of datasets in thousands of repositories across the Web. In this post, we go into some detail of how Dataset Search is built, outlining what we believe will help develop an open data ecosystem, and we also address the question that we received frequently since the Dataset Search launch, “Why is my dataset not showing up in Google Dataset Search?

An Overview
At a very high level, Google Data Search relies on dataset providers, big and small, adding structured metadata on their sites using the open schema.org/Dataset standard. The metadata specifies the salient properties of each dataset: its name and description, spatial and temporal coverage, provenance information, and so on. Dataset Search uses this metadata, links it with other resources that are available at Google (more on this below!), and builds an index of this enriched corpus of metadata. Once we built the index, we can start answering user queries — and figuring out which results best correspond to the query.

An overview of the technology behind Google Dataset Search

Using Structured Metadata from Data Providers
When Google’s search engine processes a Web page with schema.org/Dataset mark-up, it understands that there is dataset metadata there and processes that structured metadata to create “records” describing each annotated dataset on a page. The use of schema.org allows developers to embed this structured information into HTML, without affecting the appearance of the page while making the semantics of the information visible to all search engines.

However, no matter how precise schema.org definitions or guidelines are, some metadata will inevitably be incomplete, wrong, or entirely missing. Furthermore, distinctions between some fields can be vague: is the dataset repository a publisher or a provider of a dataset? How do we distinguish between citations to a scientific paper that describes the creation of the dataset vs. papers describing its use? Indeed, many of these questions often generate active scholarly discussions.

Despite these variations, Dataset Search must provide a uniform and predictable user experience on the front end. Therefore, in some cases we substitute a more general field name (e.g., “provided by”) to display the values coming from multiple other fields (e.g., “publisher”, “creator”, etc.). In other cases, we are not able to use some of the fields at all: if a specific field is being misinterpreted in many different ways by dataset providers, we bypass that field for now and work with the community to clarify the guidelines. In each decision, we had one specific question that helped us in difficult cases “What will help data discovery the most?” This focus on the task that we were addressing made some of the problems easier than they seemed at first.

Connecting Replicas of Datasets
It is very common for a dataset, in particular a popular one, to be present in more than one repository. We use a variety of signals to determine when two datasets are replicas of each other. For example, schema.org has a way to specify the connection explicitly, through schema.org/sameAs, which is the best way to link different replicas together and to point to the canonical source of a dataset. Other signals include two datasets descriptions pointing to the same canonical page, having the same Digital Object Identifier (DOI), sharing links for downloading the dataset, or having a large overlap in other metadata fields. None of these signals are perfect in isolation, therefore we combine them to get the strongest possible indication of when two datasets are the same.

Reconciling to the Google Knowledge Graph
Google’s Knowledge Graph is a powerful platform that describes and links information about many entities, including the ones that appear in dataset metadata: organizations providing datasets, locations for spatial coverage of the data, funding agencies, and so on. Therefore, we try to reconcile information mentioned in the metadata fields with the items in the Knowledge Graph. We can do this reconciliation with good precision for two main reasons. First, we know the types of items in the Knowledge Graph and the types of entities that we expect in the metadata fields. Therefore, we can limit the types of entities from the Knowledge Graph that we match with values for a particular metadata field. For example, a provider of a dataset should match with an organization entity in the Knowledge Graph and not with, say, a location. Second, the context of the Web page itself helps reduce the number of choices, which is particularly useful for distinguishing between organizations that share the same acronym. For example, the acronym CAMRA can stand for “Chilbolton Advanced Meteorological Radar” or “Campaign for Real Ale”. If we use terms from the Web page, we can then more easily determine that CAMRA is in fact the Chilbolton Radar when we see terms such as “clouds”, “vapor”, and “water” on the page.

This type of reconciliation opens up lots of possibilities to improve the search experience for users. For instance, Dataset Search can localize results by showing reconciled values of metadata in the same language as the rest of the page. Additionally, it can rely on synonyms, correct misspellings, expand acronyms, or use other relations in the Knowledge Graph for query expansion.

Linking to other Google Resources
Google has many other data resources that are useful in augmenting the dataset metadata, such as Google Scholar. Knowing which datasets are referenced and cited in publications serves at least two purposes:

  1. It provides a valuable signal about the importance and prominence of a dataset.
  2. It gives dataset authors an easy place to see citations to their data and to get credit.

Indeed, we hope that highlighting publications that use the data will lead to a more healthy ecosystem of data citation. For the moment, our links to Google scholar are very approximate as we lack a good model on how people cite data. We try to go beyond DOIs to give somewhat better coverage, but the number of articles citing a dataset ends up being approximate. We hope to make more progress in this area in order to get a higher level of precision.

Search and Ranking of Results
When a user issues a query, we search through the corpus of datasets, in a way not unlike Google Search works over Web pages. Just like with any search, we need to determine whether a document is relevant for the query and then rank the relevant documents. Because there are no large-scale studies on how users search for datasets, as a first approximation, we rely on Google Web ranking. However, ranking datasets is different from ranking Web pages, and we add some additional signals that take into account the metadata quality, citations, and so on. As Dataset Search gets used more by our users and we understand better how users search for datasets, we hope that ranking will improve significantly.

A Better Open Data Ecosystem
We built Dataset Search in an attempt to create a tool that will positively impact the discoverability of data. The decision to rely on open standards (schema.org, W3C DCAT, JSON-LD, etc.) for markup is intentional, as Dataset Search can only be as good as the open-data ecosystem that it supports. As such, Google Dataset Search aims to support a strong open data ecosystem by encouraging:

  1. Widespread adoption of open metadata formats to describe published data.
  2. Further development of open metadata formats to describe more types of data and in more detail.
  3. The culture of citing data the way we cite research publications, giving those who create and publish the data the credit that they deserve.
  4. The development of tools that leverage this metadata to enable more discovery or better use of data. 

The increased adoption of open metadata standards in conjunction with the continued development of Dataset Search (and, hopefully, other tools) should foster a healthier open data ecosystem where data is a first-class citizen of research.

So, Where is Your Dataset?
It is probably clear by now that Dataset Search is only as good as the metadata that exists on the Web pages for datasets. The most common answer to the question of why a specific dataset does not show up in our results is that the Web page for that dataset does not have any markup. Just pop that page into the Structured Data Testing Tool and you will see whether the markup is there. If you don’t see any markup there, and you own the page, you can add it and if you don’t own the page, you can ask the page owners to do it, which will make their page more easily discoverable by everyone.

We hope that the community finds Dataset Search useful, users make serendipitous discoveries and save time and scientists and journalists spend less time searching for data and more time using it.

Acknowledgements
We would like to thank Xiaomeng Ban, Dan Brickley, Lee Butler, Thomas Chen, Corinna Cortes, Kevin Espinoza, Archana Jain, Mike Jones, Kishore Papineni, Chris Sater, Gokhan Turhan, Shubin Zhao and Andi Vajda for their work on the project and all our partners, collaborators, and early adopters for their help.

Continua a leggere

Pubblicato in Senza categoria

303 TOYS 1/6th scale Masterpiece Series – The Military Marquis – Yuchi Gong a.k.a. Jingde figure

Yuchi Gong (尉遲恭) or Yuchi Rong (尉遲融) (585–658), a.k.a. Jingde (敬德), was a Chinese general who lived in the early Tang dynasty. Yuchi Jingde and another general Qin Shubao are worshipped as door gods in Chinese folk religion. Yuchi Jingde was born in 585, during the reign of Emperor Wen of Sui. When agrarian rebels rose against Sui rule near the end of the reign of Emperor Wen’s son Emperor Yang, Yuchi initially served in the governmental militia fighting agrarian rebels, and was known and awarded for his bravery.

Over 1000 pure bronze plates with leather underlay, hand made technics used for extreme effect

303 TOYS MP004 1/6th scale Masterpiece Series – The Military Marquis – Yuchi Gong a.k.a. Jingde 12-inch Collectible Figure specially features: head sculpt with magnetic golden hairpin, body, Eight (8) pieces of interchangeable palms (2 fists, 2 weapon holding hands, 2 relaxing hands and 2 open palms)

Scroll down to see all the pictures.
Click on them for bigger and better views.

Costume: brown open cross-collar inner shirt, red coat, dark brown trousers, red culottes, bronze caligas, gold monster buckle, gem-attached leather belts, bronze cloud girdle, double-phoenix helmet, gold mask, suite of gold armor, gold vambraces, gold cuishes, black and red cloak

Weapons: dragon-head blade, Two (2) iron whips, gem-attached sword, scabbard

Accessory: black figure stand

Continua a leggere

Pubblicato in Senza categoria

Alert Line 1/6th scale WWII German Female DAK (Afrika Korps) Officer 12-inch action figure

The Afrika Korps or German Africa Corps (German: Deutsches Afrikakorps, DAK) was the German expeditionary force in Africa during the North African Campaign of World War II. First sent as a holding force to shore up the Italian defense of their African colonies, the formation fought on in Africa, under various appellations, from March 1941 until its surrender in May 1943. The unit’s best known commander was Field Marshal Erwin Rommel.

Alert Line 1/6th scale WWII German Female DAK Officer 12-inch action figure Features: Female Head Sculpt, Female Body, Hands, Visor Cap, Officer Cap, Shirt, Afrika Korps Uniform Coat, Afrika Korps Breeches, Short Skirt, Necktie, Officer’s Belt, Stockings With Long Tube, Female Style Desert boots, MP40 Sub-machine Gun, MP40 Ammunition Pouch, PPK Pistol, PPK Holster, DAK Armband, Medals. NOTE: Hyenas not included (display only).

Scroll down to see the rest of the pictures.
Click on them for bigger and better views.

Related toy blog posts:
World War II – The Germans are coming!! posted HERE
Soldier Story DAK Cyrenaica 1941 – Preview pics HERE

Continua a leggere

Pubblicato in Senza categoria

TOYS-ERA 1/6th scale The Steel 2.0 37cm tall action figure aka Colossus in Deadpool films

Colossus (Piotr “Peter” Nikolayevich Rasputin) is a Russian mutant and a member of the X-Men. Colossus is able to transform himself into metallic form, making him the physically strongest of the team. Even when his powers are not engaged, he is still a physically imposing figure of 6 ft 7 in (200 cm). He is portrayed as quiet, honest and virtuous. He has had a fairly consistent presence in X-Men-related comic books since his debut.

In film, actor Daniel Cudmore portrays Colossus in X2, X-Men: The Last Stand and X-Men: Days of Future Past, and Stefan Kapičić provides the voice of a CGI character in Deadpool and Deadpool 2.

The Steel is big and strong, he has the ability to transform his entire body into a form of incredibly dense organic steel, which grants him incredible levels of physical strength and durability.

TOYS-ERA PE002 1/6th scale The Steel 2.0 Collectible action figure specially features: Metallic silver skin Head Sculpt x 2 | Specially designed extra strong 37cm tall muscular body, with metallic silver paint and sculpted body lines | Hands, 3 pairs: Fists and Palms, with metallic silver paint and sculpted body lines.

Scroll down to see all the pictures.
Click on them for bigger and better views.

Costume: Body vest, with X pattern details | Black Singlet, with foam buff layer | Black Pants, with X logo belt | Boots, with articulated ankles.

Accessories: Handcuff | Extra large figure Stand

Release date: Q4 2018

Continua a leggere

Pubblicato in Senza categoria

Going Big Time on National TV

So a little lost in the shuffle of the push for GlassBuild was some exciting promotion that our industry got from a very popular TV show.  Treehouse Masters, the #1 show on the Animal Planet network and one of the most popular Friday night shows in all of TV, did a treehouse for Pittsburgh Steeler Antonio Brown.  The treehouses built on this show are not your typical wooden piece that you may remember as a kid.  No these structures are incredible, nicer than most of the homes we live in, and the one for Brown showed off glass and metal in an incredible way.  His treehouse featured an awesome 2-story window wall with framing from YKK, dynamic glass from Pleotint-Suntuitive and Insulating Glass from Thompson IG.  The installation was done by Modern Wall Systems.  To me the best part were the constant compliments about the way the glazing looked and the camera shots reinforcing it.  It was beautiful.  So a major congratulations to all parties involved!  As I noted last week, when I saw Tom Donovan of Suntuitive he was all smiles and he should be- as should the others from our industry who were involved with this.  We showed a major audience that our products are difference makers and that they can shine in prime time!! If you are interested in watching the episode (it was very interesting overall) click HEREand go to episode 4.

Elsewhere…

–  A couple of leftover notes from GlassBuild… I forgot to promote and note the great book that sold like crazy at the show- “An Owners Guide to Exit & Succession Planning.” This book features in-depth advice on the exit process and management succession. It’s really an awesome read whether you are ready to sell your business or not. You can order it HERE
–  Also one comment/question that came up during the Glazing Executives Forum was about driverless trucks.  Obviously transportation and logistics is a big issue and that was a major story during the session.  So this week when I saw this story on some wild new autonomous trucks, I had to share it on here.  I still don’t see it being a major mover for us in the industry any time soon- but you never know.
–  Good positive news from the latest Architectural Billings Index… 54.2 rating (remember over 50 is the positive territory) so a nice bounce back after last month barely got over the 50 mark. The south region and multi family building were the strengths that the analysts pointed to as reasons for the score.  The ABI trend is certainly our friend these days.  Not as friendly is the Dodge Momentum Index.  That had a negative showing last time out- but I am waiting to see if it gets adjusted up with the next report. 

–  Have you been following the story of the Millennium Tower?  I mentioned it here a while back and it popped back in the news last week with a cracked window issue.  This is surely a job we all should be watching to see what is happening and how the issues (building is evidently sinking) will be addressed.

–  Long time readers of this blog know I love lists and rankings, so when the latest poll of the 50 Best Places to Live in America was released, I was all over it.  Here’s the top 10…from 10 to 1

10- Woodbury, MN

9- Sammamish, WA (I initially though this was the home of TGP- but I guess not- I would live at the TGP HQ though, it is stunning!)

8- Highlands Ranch, CO

7- Dublin, CA

6- Franklin, TN

5- Cary, NC

4- Ellicott City, MD

3- Carmel, IN

2- Ashburn, VA

and the best place to live in 2018 is…. Frisco, TX

So any of my readers live in these cities?  If so- congrats!!  The top 50 can be found here…  good list overall!
–  Last this week- programming note- they’ll be no post from me next week.  I will return to this space the week of 10/7.  Of course if news breaks I’ll post and also tweet it out.

LINKS of the WEEK

Fascinating story of a great resistance fighter who just passed
Story this week on how golf pictures helped an inmate get justice
Cool way to save a wedding
VIDEO of the WEEK

I am all Apple when it comes to my products but the latest phones are underwhelming… and this review really nailed it…

Continua a leggere

Pubblicato in Senza categoria

Sideshow Collectibles Quarter scale 21.5-inch tall Psylocke Premium Format™ Figure Pre-order

Pre-order

“Mutants like ourselves- if we don’t stand up to defend them, who will?”

Sideshow is proud to present the Psylocke Premium Format™ Figure.

The psychic X-Men Psylocke measures 21.5” tall standing on top of a psionic energy base inspired by her famous butterfly aura. Painted in stunning purples and pinks, this psychic platform shows off the heights of Betsy Braddock’s omega-level strength.

The polyresin Psylocke Premium Format™ Figure features an all-sculpt costume based on her beloved 90’s look with a blue bodysuit, arm guards, and boots. Her suit also includes a red sculpted sash, fluttering with dynamic motion as she holds her dual katanas at the ready. Psylocke’s beautifully sculpted portrait is detailed flowing purple hair and piercing purple eyes, capturing all the focus of her physical and psychic abilities.

Pre-order

Scroll down to see the rest of the pictures.
Click on them for bigger and better views.

The Exclusive Edition of the Psylocke Premium Format Figure includes a swap-out left arm wielding a psyblade weapon. Manifested by her mind, this energized psychic construct makes an exciting additional display option for your Marvel statue.

Pair Psylocke with other notable mutants from Sideshow’s X-Men Collection like Mystique, Emma Frost, and Rogue to create a dynamic X-Men display with your Marvel collectibles.

Pre-order

Related post:
Review of Kotobukiya Marvel Bishoujo collection 1/7th scale X-Force Psylocke 8-inch Statue posted on my toy blog HERE and HERE
TOYS ERA 1/6th scale THE PRESCIENCE aka Olivia Munn as Psylocke in X-Men: Apocalypse preview pics HERE

Continua a leggere

Pubblicato in Senza categoria