There are some pages on sites that you don’t necessarily want to have indexed. Some of them are easy to find because they fulfill important purposes on sites, but still shouldn’t be crawled and indexed. Others can be considerably more difficult to find.
For example, on one site, I found widgets that appeared on only 28 pages of a site, that inflated the amount of URLs on that site from around 3,500 to over 90,000 in Google’s index.
It was one of the first assignments I had at an agency I went to work for years ago, and looking at the URLs that were being returned in a site search at Google [site:www.example.com], it looked like there was an almost infinite number of URLs for the site, in the form of URLs that weren’t very search friendly, which made understanding the architecture of the site extremely difficult. Was this a bug, or was the site truly that large? The first part of SEO for a site is often a matter of discovery.
On some sites, there are some pages you want to keep from being crawled and/or indexed, and these can include pages such as “Email to a Friend” pages, “Write a Review” pages, “Compare Products” pages, and other pages that have little unique text on them, are unlikely to be linked to by most people, will rarely if ever be shared socially or “emailed to a friend.” Sometimes these pages only show one or two lines of unique information, and then something like a map or a calendar.
Sometimes there are pages are near duplicates of other pages, and contain substantially similar content that might differ by how content (often products) on a page is sorted:
- Best selling
- Alphabetic order a-z
- Alphabetic order z-a
- Price high-low
- Price low-high
- 15 items per page
- 30 items per page
- 50 items per page
- List display
- Grid display
Limiting which versions of those sorted (and filtered) pages get crawled and indexed, and using canonical link elements and pagination link elements wisely can be helpful.
For example, when you have multiple pages within a product category, you may want to limit (using robots.txt disallows, robots noindex meta elements, hash URLs (#), Java Script, option dropdowns, etc.) which sorting order and other parameters to one version, such as pages that display: (1) Best sellers, (2) using a Grid display, (3) with 15 items per page, and have that be the sorted version that gets crawled and indexed by search engines.
Sometimes the way elements of a content management system work can cause issues outside of those sorting pages, or pages that add features but shouldn’t be indexed (email to a friend pages). For example, as I noted above, I once worked on a site that had approximately 3,500 URLs that I wanted indexed. I didn’t know that when I started working on the site. All I knew was that Google estimated that there were about 90,000 or so URLs on the site.
Shortly after I started on the site, I tried to crawl it using a program intended to find broken links (Xenu Link Sleuth), and I noticed that there were times during the crawl when URLs just started to get really ugly. Some of the pages on the site had widgets on them that expanded or contracted sections of content on those pages.
Everytime a section was expanded or contracted, the URL for the page changed. For example, if the page had 21 of these widgets on it, and the first three widgets were clicked upon to expand them, the URL would change to look like this:
If I then contracted the first widget and expanded the 15th widget, the URL might then look like:
Notice how the order of the different parameters can change, and the number of parameters can change as well. All of these can be listed in any order, and there was a very large number of variations of that URL on pages where there were 21 of these expansion/contraction widgets.
I ended up spending about 15 hours running Xenu Link Sleuth on the site. I disallowed crawling of the pages that had those widgets on them, one at a time, to find that there were 28 pages on the site that did include those widgets with the changing URLs. If I kept the extra parameters from those pages from being crawled, I ended up with only 3,500 or so pages.
Within a month of so, Google had removed many of those expanding and contracting URLs that appeared only because of the widgets. Instead of Google estimating 90,000 pages for the site in its index, it reported “about” 3,500.
Having a manageable amount of URLs on the site, and being able to understand why the inflation of URLs had happened on the site previously made it a lot easier to manage and make changes to the remaining URLs and content on the site. It was like the sun was shining on the site:
One of the first steps that I take these days is in looking for URLs for unnecessary pages, so that I can remove and reduce pages (from being indexed) that don’t provide value in a meaningful way. Pages that allow you to email a friend with the URL of the page, or that enable you to rate and review a specific product are important, but there’s usually no value in letting those pages get indexed by search engines.
By removing unnecessary pages from being indexed, I can then move on to pages that should be indexed, to make them even more valuable with added content, faster loading and rendering time, and in other ways that both improve the experience of visitors on those pages and help meet the informational and situation needs of visitors to pages as well as helping to meet the objectives of the owners of the site.
The research and development team at Google doesn’t alway focus upon search algorithms. Sometimes their efforts seem more suited to Indiana Jones than Luke Skywalker. In the past year we’ve seen Google map the Grand Canyon using pack donkeys and a team exploring the depths of the Canyon. Google has been mapping the underwater surfaces of oceans with Streetviews as well. The Amazon Rainforest has also been the target of Google’s excursions.
Google has used specialized Streetviews cars to film a wide range of roads around the globe. They’ve used tricycles to film areas where cars can’t go, and they have a page (Cars, Trikes, and More) that shows other ways the search engine captures images, including trolleys, snowmobiles, and cameras sticking up out of backpacks.
So I was a little suprised to see that Google targeted walking sticks with a patent granted to the search engine this week. Then again, the patent tells us:
However, even the use of vehicles such as tricycles or snowmobiles does not offer access to areas where vehicular travel is difficult, such as in rugged areas or areas where roads are not present.
The walking stick in question has one or more cameras at one end, and a “trigger” at the other end, which sets the camera or cameras off when you contact the ground. But it also has more than that. The stick includes an Inertial Measurement Unit (IMU), a collection of microelectronics that can include gyroscopes, accelerometers, and magnetometers to help identify its location. It likely also uses a GPS sensor, and possibly other sensors as well. That might remind you of my last post here, How Google Now and Phone Sensors Might Change Search as We Know It, in which I wrote about how Google might start taking advantage of a lot of sensor data in mobile devices, and aggregate that data to predict future events.
The IMU sensors can be used when you are indoors and/or GPS isn’t available, and can help improve the accuracy of GPS information when you are outdoors.
Here’s the patent:
Walking stick with IMU
Invented by Daniel Jason Ratner and Russell Leigh Smith
Assigned to Google
United States Patent 8,467,674
Granted June 18, 2013
Filed: September 21, 2011
An elongated member is provided with one or more imaging sensors, location sensors, and a switch in its bottom end. For example, in an embodiment the elongated member may be a walking stick and the one or more imaging sensors may be one or more cameras Such a walking stick takes pictures of its surrounding environment and keeps records of its location when the switch touches the ground, so that the pictures and location information can be used to create a virtual simulation of the area that a user of the walking stick has walked through.
The images from the patent show a traditional looking camera at the top of the stick, but the patent mentions that other types of imaging sensors could be used as well.
The patent concludes by telling us that while this walking stick device could supply a stream of photos, those could be stitched together virtually to create a video as well.
Such a method of operation is advantageous in that it provides a stable base for the one or more cameras by causing the walking stick to act as a monopod. Moreover, because users will generally have fairly regular strides, pictures of the surroundings of the area surrounding the walking stick will be taken at regular intervals.
This provides for acquiring data which is appropriate for a virtual simulation the environment of the walking stick, because the image data can be transformed and combined to yield an interactive simulation of the environment of the walking stick.
Furthermore, time stamps and location information from a GPS or IMU can improve the quality of the virtual environment data still further, by aiding virtual environment application 240 in combining the pieces of image data into a virtual environment visualization.
When I saw the title to this patent, I was actually taken a little aback with the low tech nature of the invention. But when we start thinking about all of the ways that Google may gather information in the world around us, from Streetview cars, trykes, snowmobiles, backpack cameras, boats, submarines, self-driving cars, Google Glass, and more, it’s probably not surprising to see them cover another method that might seem to be much more low key.
Just as Googlebot crawls the World Wide Web, Google is finding new ways to capture and collect information from the world around us.
Given that people from Google engage in activities such as Climbing Mt. Kilimanjaro, maybe it shouldn’t have come as a surprise that Google patented a walking stick. Especially one with sensors in it.
At the Google I/O Developers Conference last week, we were introduced to the future of Search, or as Google’s Head of Search Quality Amit Singhal called it, the “death of search.” The presentations from the day long event told us that features like Google Now will provide information to us as we need it, rather than when we ask for it.
Perhaps that’s best explained by looking closer at how Google Now works, and considering a fairly recent hire from Google. In the post Why Google’s Predictive Personal Assistant is better than Siri I wrote last September, I wrote about the patent that describes the predictive algorithm behind Google Now.
For instance, Google Now learns from your habits and your actions. If you go to the ball game at a nearby stadium on a regular basis, Google Now might start regularly showing you a knowledge card with the scores of games from the local team. If you only go to games when the local team is playing a specific competitor, Google Now may figure out that you’re a fan of that competitor, and start showing you knowledge cards with their scores. All of this is based upon Google Now learning more of your online and offline habits and activities.
Google Now will be coming to Chrome, and a hands-free verbal searching experience was displayed at Google I/O for desktop searchers as well, referred to as Hot-Word Detection.
While this is worth paying attention to, where things gets really interesting is when we look at three new employees at Google who are the team behind Behav.io, who have been engaged in finding ways to gather and use sensor information in a deeper manner from your mobile device, and those of people who you are connected to.
When news of the hiring took place, I looked at the USPTO assignment database to get an idea of what kind of technology the team had been working upon. A patent originally assigned to MIT was reassigned to Behav.io, which describes the kind of work they’ve been doing.
They developed a mobile application that can predict whether or not people might install apps based upon their behaviors and those of the people they communicate with. They kept an eye on a number of different kinds of informational graphs to be able to make this kind of prediction.
Here are some examples of those types of graphs:
- A call log graph – with edges weighted by number of calls between nodes,
- A text message graph (with edges weighted by number of text messages between nodes),
- A Bluetooth proximity graph,
- A co-location graph (from GPS data),
- A friendship graph (from Facebook), and
- An affiliation graph (from contacts)
If you go back to my (Siri) post above from last September, and click on the link to the patent, it describes how Google might make predictions based upon contextual information. For example, if you drive to work each morning, Google might figure out where you work. If you get in your car to go to work, and there’s congestion on the route you usually drive, Google Now might suggest a different commute to you.
Looking at informational graphs like the ones studied by the Behav.io team can provide a much richer set of information to make predictions upon. In addition to these types of communications, the patent describes the many different types of sensors that mobile smart phones come with.
Smart phones can gather data using many different sensors that are included within them, from accelerometers to barameters, gyroscopes to magnetometers, ambient light sensors to proximity sensors, network position sensors to whether or not a screen is on or off. The Samsung Galaxy S4 is shipping with a thermometer and hygrometer as well. Given all these sensors, a phone can act as a mobile weather station, and can collect a lot of information that can be used in different ways as well.
The MIT patent goes far beyond predicting which apps people might install, and uses that only as an example. As we’re told in the patent:
This invention is not limited to predicting installation of apps. For example, in some implementations, this invention can be used to predict the conditional probability of a user taking any action, including adopting an idea
Sensor Data from a mobile device can be used to tell when you’re getting sick.
For example, a phone could predict that you’re coming down with the Flu a couple of days before you show visible symptoms, based upon your movements and if they show you to be a little slower than normal and weaker and not as steady. It might also look at who you’ve interacted with physically (looking at blue tooth signals between phones, for instance) and communications between you and others.
The patent also tells us that it can tell when an idea is starting to spread across a network by predicting the diffusion of ideas across that network:
In exemplary implementations of this invention, “trend ignition” in a social-influence campaign in a network is predicted. For example, network data may be used to predict the probability that a certain portion of the user population of a network will adopt an idea (due to diffusion of the idea through the network), if a specific portion number of users in the network are initial “seeds” for that idea (persons who adopt the idea initially). This enables campaign managers to allocate resources efficiently.
The patent is:
Methods and apparatus for prediction and modification of behavior in networks
Invented by Wei Pan, Yaniv Altshuler, Alex Paul Pentland, and Nadav Aharony
Assigned to Massachusetts Institute of Technology
US Patent Application 20120303573
Published November 29, 2012
Filed May 29, 2012
In exemplary implementations of this invention, mobile application (app) installations by users of one or more networks are predicted. Using network data gathered by smartphones, multiple “candidate” graphs (including a call log graph) are calculated.
The “candidate” graphs are weighted by an optimization vector and then summed to calculate a composite graph. The composite graph is used to predict the conditional probabilities that the respective users will install an app, depending in part on whether the user’s neighbors have previously installed the app.
Exogenous factors, such as the app’s quality, may be taken into account by creating a virtual candidate graph. The conditional probabilities may be used to select a subset of the users. Signals may be sent to the subset of users, including to recommend an app.
Also, the probability of successful “trend ignition” may be predicted from network data.
The patent is long and very detailed, but worth skimming through with a highlighter, or by making notes in a margin (if your ebook reader can do that), or by pasting it into notepad and deleting all the stuff you don’t want to keep (which is what I do).
The difference behind what the team working on Google Now were doing, and the people at Behav.io were doing doesn’t just vary based upon the Behav.io team looking at more sensor data on a mobile device. It differs because Behav.io has been looking at aggregating data between multiple devices and multiple people to predict the adoption of apps, to figure out how illnesses spread, to understand where ideas might start and ignite socially.
How will this impact the work being done on Google Now, which according to the Google I/O presentations, is one of the key aspects of the future of search? What will these new employees bring with their new focus on sensors and communication between people on smart phones? It can potentially change things significantly, as noted in the article, Google I/O: How Google Now Is Bringing Search Closer to Science Fiction.
For some more on what people involved in Behav.io have been working upon, check out the follow resources:
- Winner: Behavio / Nadav Aharony
- Investigating Social Mechanisms with Mobile Phones
- The “Friends and Family” Study and the FunF platform (pdf)
The future of search isn’t going to be the death of search, but it’s working upon knowing things that we might need to know before we realize we might.
Back in 1997 I was a webmaster for a site that incorporated businesses in Delaware. In those days, promoting a site meant finding ways to deliver traffic to its pages, and building relationships with other site owners and businesses. PageRank was unheard of at the time, and the idea of anchor text being used as a relevance signal for web pages wasn’t anything we were concerned with at all.
Being included in directories was one way of being found, and finding complimentary businesses who we might share referrals with was a great way to cross promote a site, as long as you felt you could trust your visitors with the owners of that other site. We didn’t have social media sites like Twitter or Facebook or Google Plus, but the internet had been a pretty social place even before the Web came about, with newsgroups and Bulletin Board Services (BBS) providing a way for people to meet and interact. Forums provided similar chances to be social in the 90s.
I remember adding a link to a Polish Classified Ads site – written in both English and Polish. Honestly, I didn’t expect any response at all from it. I had found it, and it looked like people might be using it, but would there actually be any interest from people in the region in creating Delaware corporations. If I judged it by the standards of a world pre-occupied with SEO, I probably would have passed.
It didn’t leave me a chance to provide a live link so it wouldn’t have passed along any PageRank or anchor text relevance.
The audience was business people, but it wasn’t topically inclined towards people creating new businesses. I created a classified with address and phone number (the owner of the site liked talking to people on the phone rather than responding to inquiries by email), and left it there, and didn’t think about it again.
We started getting phone calls from Eastern Europe. Not a lot of calls, but they were interesting. Financial services companies from countries that were little known in the United States were working with clients that wanted to ship goods across the world, and wanted to incorporate each voyage, to limit their liability from other journeys.
The little classified advertisement I left on the Polish site lead to enough business to sustain the incorporation company for its first few years of business, all for the sake of getting cargo containers filled with commodities like olive oil from one side of the Atlantic to the other.
I recalled that link, the phone calls in response, and the business that it lead to because it reminded me of a time when the Web was more social.
Regardless of the lack of sites like Twitter, we built pages to answer questions, to offer goods and services, and to provide information to people. We weren’t concerned about how highly a site might rank in search results, or even what search results were. I think sometimes sites today forget about that social element of the Web, and worry more about where they might rank for some main terms, and how much traffic their products and part numbers might draw through web searches.
Social isn’t just about including sharing buttons and profile buttons on your pages to Twitter and Facebook and Google Plus, though those are a good start these days. Make it as easy as possible for people to find your profiles on those sites, and for them to share something you’ve written. Social isn’t just about making sure that you use avatars and images and backgrounds on social sites and profiles, though that can help as well.
Social is about what you contribute to social sites and how you interact with others.
Social is about building relationships with others, and taking the time to meet people, to talk with them, to find common interests and objectives.
Social networking sites are channels that enable to you find others, to educate and be educated, to influence, and to be influenced. There should be a thrill of discovery when you log into Twitter, and have a chance to learn something new.
Social media isn’t a broadcast channel where you blast out offers to others, and tweet and retweet things you think “your” audience might be interested in.
The “social” in social media and social networking is the most important part of being involved.
When I put the classified out there on the Polish website all those years ago, I was inviting conversations with others who might be interested in what we offered. It was a first step, and the phone number I provided was to someone who loved to talk, to educate, and to learn about others, and how they could work together. Being listed there in terms of today’s SEO might have revealed it to have very little value, but in social terms, it was the right place to be listed at the right time by people who needed the service offered.
Social Signals Today
Social media is being incorporated into both Google and Bing these days with social annotations appearing in searches from both. When you’re signed into Google and perform a search, it’s quite possible that you may see relevant results that appear because someone you’re connected to might have shared something from a site or endorsed it. When you perform searches at Bing, you may see similar results with annotations from people you’re connected to, that wouldn’t have appeared in those results (even though they are relevant) without those connections being made.
If you run a business and have a website, one of the reasons to get involved in Facebook and Twitter and Google Plus is to connect with people on those sites that might be interested in what you offer, and in what interests you may have.
Social networks enable you to reach out across your neighborhood and across the globe to find people who share common interests with you.
A lot of people are writing about Google Authorship Markup these days, and how Google might introduce Agent Rank, or an author rank, that might influence organic search results even when someone isn’t logged into a Google Account. Google hasn’t explicitly provided us with details on how they might rank such results, but there are a lot of hints that have come out via patent filings and white papers and even interviews that people have held with representatives from Google.
In January, I wrote a post titled, What’s Your Google Viral Score? about a patent from Google that describes how Google might come up with a viral score, or “content propagation likelihood” score that attempts to estimate how information might be spread by someone within social networks based upon activity such as sharing, responding to comments and threads, endorsing content, liking pages and profiles, and so on. Such a score might attract advertisers to choose a particular member of a social network to share something with them.
Signals Involving Activity and Influence
In 2010, Google published a whitepaper titled AdHeat: An Inﬂuence-based Diffusion Model for Propagating Hints to Match Ads (pdf), to be presented at the WWW 2010 Conference held in Raleigh, North Carolina (and it was a Best Paper Nominee at the Conference). Yesterday, Google published a patent application based upon the paper which provides some additional details about how an advertising system like the Adheat system might be set up on a social network such as Google Plus. The patent is:
AdHeat Advertisement Model for Social Network
Invented by Dong Zhang and Edward Y. Chang
Assigned to Google Inc.
US Patent Application 20130103503
Published April 25, 2013
Filed: September 14, 2012
In one implementation, a computer-implemented method includes receiving at a server information indicating activity levels of users of a computer-implemented social network or acquaintance relationships of the users on the computer-implemented social network. The method further includes generating by the server influence scores for the users based on the received information.
The method also includes recursively propagating by the server an ad through the computer-implemented social network between users having an acquaintance relationship by transmitting the ad from a propagating user to a recipient user when a difference between a first influence score of the propagating user and a second influence score of the recipient user is greater than a threshold.
According to the paper on Adheat, this advertising method was tested on Google’s Q&A websites throughout the world. Google doesn’t have a Q&A site in the United States, but they do in many other countries across the Globe. Those sites are very social in nature, and participants’ reputations and results are ranked on the basis of a user rank that looks at both contributions and interactions they make to the sites.
One of the interesting things about this user rank is that it would translate over very well to Google Plus, and a system like Agent Rank, which could potentially be used to help rank web search results based upon user activities with the social network.
We don’t know if Google will ever run advertising on Google Plus, and it’s possible that they might not, but that they might look at Google Plus activity for people, and use that information to show people ads in Google search results and on pages that carry Google sponsored results.
At the heart of such an advertising method is how people might be decided upon by advertisers to show ads to, and how those ads might be spread to others as well:
In some instances, a method and system is described by which an advertiser can target ads to users of a social network according to a user’s interests and influence on the social network. An opportunity to display an ad to an influential user with interests relevant to the ad may be more valuable to an advertiser than an opportunity to display an ad to a non-influential user with unrelated interests (or even a non-influential user with related interests). A user’s influence on a social network may be determined by looking at the user’s level of activity and/or acquaintance relationships on the social network. An advertiser may receive a ranked list of anonymous users according to user interest and influence. A bidding mechanism may be used to accommodate multiple advertisers seeking to obtain the opportunity to display ads to a finite number of relevant, influential users on a social network.
In some instances, once an opportunity to display an ad to a specific anonymous user has been awarded to an advertiser, an ad from the advertiser may be propagated from the user to the user’s friends using a heat diffusion model. For instance, a user’s influence on the social network can be represented as a heat intensity or a heat score, where users with more influence have a higher heat score. Propagation between users can then be modeled using a heat diffusion model. For example, an ad may spread (propagate) between two connected users as long as the user targeted with the ad has greater “heat” than the user yet to be targeted. This may result in ads propagating throughout the social network from more influential users to less influential users. One advantage that may be gained from the described method is the ability of advertisers to maximize advertising efficiency by propagating ads from influential users to influenced users.
The patent filing and the paper on Adheat do provide more information on how social activities and interactions might influence the spread of advertisements based upon user interests and user influence. They also provide hints at how Google might be viewing participants in a social network such as Google Plus. These can include who you interact with, what topics do you write about and respond to, how you interact with others, who shares the content that you create and share and respond to.
Google Plus cares little about a link graph, and more about an interest graph that is uncovered based upon the topics that you are involved within, a social graph that identifies who you interact with and whom you might influence and be influenced by. That’s not to say that profile and post pages in Google Plus don’t accumulate PageRank and might show up in search results themselves. They are pages that can be ranked that way like other indexable pages on the Web. But that’s just a small part of the picture.
We’ve been told by representives from Google that Google Authorship and social signals will likely become part of how pages are ranked on the Web at some point in the future, and it’s something that the search engine is experimenting with.
As my story at the beginning of this post illustrates though, social activity can lead to good things regardless of whether a search engine is working upon how to incorporate those signals into search results.
Make it easy for people to interact with you and your website through social sites and profile buttons and social sharing buttons and get on those social sites and actually interact with people in meaningful ways. It can lead to positive results regardless of how it might be incorporated into search engine rankings in the future.
In the past couple of years, we’ve been seeing Google bring a level of social activity and awareness to Google that was missing in the past. They’ve developed meta data approaches that enable authors to connect their Google Account with web pages they write at and contribute to. They’ve also set up a way for businesses to connect their sites to Google profiles as publishers.
Bing seems to be trying to broaden their social reach as well, though their approaches have been different. Chances are that methods similar to what Bing is exploring will be used by Google in the future as well.
When Google launched Search Plus Your World (SPYW) results, it let Google share information from social networks (primarily Google Plus) to people you are connected to. While performing searches you might see relevant results for queries that someone you’re connected to either shared or endorsed with a +1 as a Social search result.
We see similar social annotations appearing in Bing, when logged into our Facebook accounts. See: How Buddies and People Who Know are Selected for Bing Social Searches.
Another feature that we see in Google is that some search results might display Author Badges even when we aren’t logged into Google. Those badges can show up when someone who created a page adds author information meta data to that page, and connects their Google Profile to the page or a profile on the site where it appears.
Interestingly, Google is also showing author badges for content on other sites that might be from the same author where authorship markup hasn’t been added to those sites. (For example, Google is showing author badges for a couple of my sites where I haven’t set up authorship markup, but I linked to those from my Google Profile).
Bing Profile Pictures in Search Results
This week, Bing also started showing profile pictures in their search results when someone isn’t logged into either Facebook or Bing. These results appear to be similar to the authors that Google shows. Bing doesn’t offer the kind of authorship markup that Google offers. So how does Bing find these pictures and profiles?
AJ Kohn, of Blind Five Year Old, published a thoughtful analysis of these profile pictures from Bing search results, in the post People Snippets on Tuesday. AJ notes regarding these images:
The new faces (for the most part) showing up in Bing search results are not authorship snippets per se but are people snippets derived from entities. It’s about who the content is about rather than who created the content.
It appears that Bing is doing some creative data mining of pages within its index to identify images of people to display next to pages in search results. Given the fact that not everyone will set up authorship markup for Google, it’s likely that Google will also have to do something similar if they want to try to associate authors with content on the Web that doesn’t use authorship.
There are a lot of people on the Web who publish a lot of content in formats that might make it difficult to add the kind of authorship markup that Google has made available. For instance, consider an academic or industrial researcher who has a profile page and a list of whitepapers available as PDF files on the Web. Google’s authorship markup really isn’t very easy to apply to those pages and that content, even if it might not be difficult to point a link to the profiles pages for that author from a Google profile.
Below is an image of Bing results that include my picture. There’s another using the same profile picture in the search below for a different profile page for me. And a little lower is a link to my profile at MySpace that shows a completely different profile image for me, which is the profile picture I used at MySpace. These aren’t links to content I created as much as they are links to pages about me, such as profile pages.
Microsoft published a pending patent at the USPTO in December that describes how they might use a data mining approach to identify authors on the Web. The patent application is:
Discovering Expertise Using Document Metadata in Part to Rank Authors
Invented by Aninda Ray and Dmitriy Meyerzon
US Patent Application 20120310928
Published December 6, 2012
Filed: June 1, 2011
Expertise mining features are provided based in part on the use of an expertise mining algorithm and expertise mining queries. A method of an embodiment operates to provide an expanded feedback query based in part on search results using an expertise mining query and a number of author-ranking heuristics used to rank authors and/or co-authors (e.g., primary authors, secondary authors, etc.) as part of an expertise mining operation.
A search system of an embodiment includes an author ranker component to rank authors based in part on an expertise mining query and author-ranking heuristics, and a query expander component to provide expanded queries as part of identifying relevant search results. Other embodiments are also disclosed.
How Bing May be Finding Authors on Different Topics
There’s a multiple step process that Bing might use to find authors on the Web, and to display images of those authors in search results.
The first step involves doing some focused crawling on the Web to locate authors who might have some kind of expertise on a particular topic.
As an example, if someone is searching for expertise on a new mobile phone running Windows Mobile Phone 7 operating system (example is from the patent, which is why it uses a Microsoft example), they might type “Windows Mobile Phone 7 expert” into a search engine to find out about experts on that topic.
Search results that are returned for these queries might be limited to certain types of results as well, such as:
- Design plans,
- White papers,
- Curriculum vitae,
- Published (white) paper lists,
- Citation lists,
- Patent applications,
The patent filing uses an example of a search engineer to describe this process, which is probably a good choice. I’ve searched for more information on more than a couple of inventors on search related patents, and many of them have one or more profile pages, often at a university page or an industry page, and these link to things like whitepapers that they’ve written.
These profiles aren’t likely to have something like Google’s authorship markup on them, and adding markup to PDF documents involves making changes on a server level rather than to those PDF documents themselves.
When authors are located through this focused crawling for different topics, the search engine might expand queries to learn more about these specific authors, including identifying other profile pages for them, and possibley other tings that they’ve written.
The kinds of information looked for might include things such as:
- User profiles,
- User expertise summaries
Let’s say that Robert Example is one of the “experts” identified during that original search. The search engine might work at uncovering more information by exploring search results where the author’s name is added to words (or tokens as they might be called) from the original query. So it might search for [Robert Example windows] or [Robert Example windows 7] or [robert Example Windows Mobile Phone 7] and so on.
The purpose behind doing this is to be able to provide searchers with search results that might include links to profile pages for authors as well as results that might contain related pages from those authors. The related pages not only help by providing results that answer a query, but they also reinforce the expertise that might be cited in a profile about an author.
The patent provides a lot of details on how they might not only identify authors and author profile pages, but also content created by those authors, and how different authors might be ranked in determining which ones to show for different topics evidenced in queries.
Why This is Important
Bing is showing us that Authorship Markup like that developed by Google isn’t completely necessary for identifying the authors of pages on the Web, and for learning about which authors might be experts on specific topics. It doesn’t seem like Bing has completely refined this process yet at this point, but they seem to be working upon it. While the Google authorship markup approach probably makes it easier for a search engine to learn about authors, and to bring a social element to showing content from some authors to people they are connected to in Google’s SPYW approach, not everyone is going to adopt or use Google Authorship.
Setting up authorship is recommended because Google does appear to be working towards a day when content in the Web is tied to its author. That could potentially mean that a reputation score associated with certain authors could potentially help boost rankings for pages written by authors with good reputation scores for different topics.
But there are likely going to be many pages on the Web that don’t have authorship markup, and an approach like the one that appears to be in development from Bing shows how authors can sometimes be identified when they have profile pages, and they write about topics that they could be considered experts on.
So Bing might be able to use a data mining process like this to find authors that might be shown in search results as having an expertise in fishing, in interior design, in designing parts for hybrid cars, and in many other industries covering many different topics.
Showing off your expertise on the Web through profile pages at different sites such as LinkedIn, Google +, and many others is often a good idea in that it helps show your credibility and the things you have an interest in. Under Bing’s data mining approach, it may also land you in search results with a picture.
The Microsoft patent doesn’t mention including pictures from profiles in search results, but if you look at how those are appearing often next to profile images, it’s a clear sign that Bing is interested in making authors and profiles stand out.
Linkbuilding isn’t dead, and it likely won’t be any time soon, but it has changed over the past few years, and the search engines are looking at other signals as well, such as social sharing…
Less Reliance on Links
If you look through the links that you might have attracted or acquired for your website in the past, you might see directories that may no longer have much value. You visit their home page and notice that it either the directory no longer exists, or that its home page is showing a zero PageRank in the Google Toolbar.
You may see links from pages that linked to you previously that aren’t online as well because the site closed, or it moved to a new domain, or it removed resource pages or blog posts that might have linked to you.
The Web is in a constant state of churn, and sites come and go. Many directories that existed in the past targeted people interested in building links to their pages more than building directories that help people find pages and information on the Web.
Google used to tell people in their Webmaster Guidelines to use sources like directories to tell other people about their sites, but that was removed from the guidelines a while ago. Google seems to have cracked down on directories and other resources that were aimed specifically at helping pages rank higher in Google’s search results.
Google’s page on Link Schemes explicitly warns people against such directories as well, telling people that “Low-quality directory or bookmark site links” are unnatural links that violate Google’s guidelines.
There has been a lot of discussion on the Web that “earning” or “attracting” links is the new linkbuilding, by creating great content that people value for the information it shares and the resources it provides. There’s a lot of value in giving people information that they will find useful, in presenting it in a way that engages, persuades, and inspires people to share it with others through social sites, through links and referrals, by email, and in other ways.
That’s really nothing new and maybe even goes back to the days before we had sites like Google or Bing or even Altavista.
Social Annotations in Search Results
Both Google and Facebook have been getting more social, and if you perform a search at Google while logged into your Google Account, you’ll see annotations from people you’ve connected to on Google Plus regarding whether or not they’ve shared pages appearing in search results.
If you log into your Facebook account and search at Bing, you’ll also see annotations from people you’re connected to about whether or not they’ve liked certain pages.
Google is experimenting with how they display annotations from social sources. The result pictured above is in Google’s top ten results (for me), and it was shared on Google Plus by Webimax. Because of my social connection to Webimax, a link to the page appears in relevant search results on a query that I performed for [seo 404 errors] (ignore the brackets).
Without the page having been shared by an account that I’m connected to on Google Plus, it might not have ranked as highly in search results. While that result has acquired some PageRank, and it’s relevant for my query, it appears in part because of our social connection.
Google has been performing experiments with this kind of social connection, and they shared a whitepaper publicly that discusses some of the experiments they’ve been performing. In May of 2012, the paper Social Annotations in Web Search (pdf) was published at the CHI 12 conference in Austin, Texas. Its authors include Aditi Muralidharan of UC Berkeley, and Zoltan Gyongyi and Ed H. Chi, both from Google.
The paper explores the topic of how best to present socially shared documents in search results. Note that my result displayed above doesn’t have a small image under it like socially shared results used to in the past, and are shown to be presented in the research paper. The intro to the paper tells us of three major points:
- First, only certain social contacts are useful sources of information, depending on the search topic.
- Second, faces lose their well-documented power to draw attention when rendered small as part of a social search result annotation.
- Third, and perhaps most surprisingly, social annotations go largely unnoticed by users in general due to selective, structured visual parsing behaviors specific to search result pages.
Muting Social Results by Person/Publisher?
Google also published a patent application last week that involves social annotations in search results as well. The filing would enable searchers to decide whether or not they wanted to see pages from specific friends or connections at social networks in the search results they received. They would be able to “mute” results from people or companies of their choosing. The patent is:
Filtering Social Search Results
Invented by Matthew E. Kulick, Adam D. Bursey, and Maureen Heymans
Assigned to Google
US Patent Application 20130036109
Published February 7, 2013
Filed: August 5, 2011
This specification describes technologies relating to searching. In general, aspects of the subject matter described in this specification can be embodied in methods that include the actions of:
- Receiving a search query from a user of a search service,
- Identifying search results including general search results responsive to the search query and social search results associated with content generated by one or more members of a user social graph associated with the user that are responsive to the search query, the search results corresponding to digital content stored in one or more computer-readable storage media,
- Determining that a first social search result is associated with a first muted member that is a member of the user social graph,
- Generating filtered search results in response to determining that the social search result is associated with the first muted member, and
Providing the filtered search results for display to the user.
What the Patent Tells Us About Social Signals
In addition to the information that a patent contains, it can be really helpful to look at some of the assumptions and implications it might hold. Here are some points and take aways to consider:
Point: People are likely to give more weight to search results from people they know, including reviews and opinions. Sometimes though, those social results can include information that they really don’t find value in.
Take Away: When you share something on a social site such as Google Plus, think about how much value it may actually provide to searchers who might see it as a socially shared search result. If people come to value the things you share, it’s more likely that they won’t mute your results (and they can’t yet, but might be able to in the future). It’s also likely that more people will actually follow you on that social network as well.
Point: When someone sees your annotation in social search results, they can use it to go back to your profile. A profile not only acts to indicate that you are connected through a social network, but also to enable people to verify what they might know about you.
Take Away: Spend time on your social profiles, making them interesting and engaging, and credible. Doing so can also have the benefit of attracting more people to add you on those networks as well.
Point: The patent discusses the possibility of multiple interconnected social networks being involved in an analysis like this, including people you might chat with, or email to, or connections you’ve made in specific social networking services. Connections might also be considered based upon blogs or feeds that you’ve subscribed to, or sites that you’ve become a member of. Your connections in social networks may also have connections who might be shown your content in a social search result as well.
Take Away: We do know that Google is showing results from people whom we are directly connected to in Google Plus at this point, but there is a possibility that they could broaden the connections that they do show. It would make sense for them to do that in conjuntion with a feature that would allow people to mute some of the connections they are shown.
Point: A social graph might be more than just the people you are connected to. It might include “friends” of people you are connected to, sites and resources that are linked to from their profiles, and more. This could be governed by a number of degrees of separation, but if you connect directly to someone, you might be connecting to some of their connections as well.
Take Away: There’s value in connecting to others as an individual or as a business to make the connection itself and work towards building a relationship with them. You may start seeing social results from people they are connected to as well.
Point: Google might decide that the connections you make, and the degree of separation at which their content might show up in social results, or at which your content might show up in their search results might involve other actions, including a frequency of interaction. This can include how often you visit a specific social network, how frequently you endorse or click on items shown by friends, and more. This might be a dynamic feature, and could change over time.
Take Away: If you’re going to share something or endorse something through a social network, choose things that your audience is likely to be interested in, and may visit the pages, pictures, and videos that you share or endorse. If you choose things that doesn’t appeal to that audience, your social shared items can reach a smaller audience. If the muting feature described in this patent is put in place, more people may mute your results as well.
Point: The patent provides a laundry list of the kinds of content that might potentially be shared on a social network, and it includes: “local reviews (e.g., for restaurants or services), video reviews and ratings, product reviews, book reviews, blog comments, news comments, maps, public web annotations, public documents, streaming updates, photos and photo albums. Thus, the content can include both content generated by the members of the user’s social graph, as well as content endorsed or reviewed by the members of the user’s social graph.”
Take Away: I’ve seen many social profiles that limit the kinds of things they share about themselves and their business, or include a limited number of resources. If you broaden what you include to share things that people who are interested in you might also find interest in (and that you find interest in), you broaden the potential range of socially annotated results that you might be seen for.
Be careful though, spamming people through social sharing might be a quick way of getting removed as a connection at present, and if the kind of muting described in the patent is implemented, it might get you muted.
Ranking in Social Results
The patent does provide some details about how social signals might be ranked in search results. It’s hard to tell how strong these signals might be today, or if there might be others that aren’t included in the description of the patent. But they seem like good things to keep in mind.
The following screenshot from the patent looks at some of the sources of signals that might be used in ranking social search results:
Affinity based signals, which can include:
- How a friend is connected to the user (directly on a social network, through a blog subscription, as an email connection, etc.)
- Which social networking site the friend is a member of
- Whether friend or friend of friend
- How many paths to get to the friend of a friend (e.g., common middle friends)
- How frequently someone clicks on posts by a particular contact
- Leaving comments on a contact’s posts can result in higher affinity than occasional endorsements
Other ranking signals:
- An information retrieval score of content for a query
- Content type (e.g., blogs versus images)
- Date of the associated content
- If content was originally shared with a smaller group of friends, including you, rather than everybody
- The number of friends who might have endorsed or shared a resource
- A close affinity to someone who endorsed a resource
- The type of endorsement made by someone you might be related to (starred results, visited the resource, left a comment on it, etc.)
- Based upon authorship of a resource, and a relationship with that author
This is the first extended set of signals that I’ve seen from Google that describe how social search results might be ranked. It’s quite possible that some of these may be in use at present, some might be things that could be added or ignored, and other signals could be included as well.
These signals do make a lot of sense if the goal of social search is to surface things that are fresh and from people who you value the opinion of.
The patent does include some more details, and some examples that might be worth looking at as well. Definitely take a look or ask if there’s something you’re interested in.
Google’s invested a lot of time, energy, and resources towards using social signals to make content available from people we may be connected to in social settings. We know that Google is exploring other signals as well, such as using knowledge bases to provide richer results. That’s likely a good topic for a future post.
Regardless of whether or not Google provides us with the ability to mute certain people from showing up in social search results, social annotations are capable of providing searchers with information about relevant results that include pages and content that they might not see based solely upon PageRank and relevance signals alone.