Archive for the ‘Uncategorized’ Category
Read our thoughts on the future of communication in social networks. Let us know what you think.
We are working on combining these two worlds: spam-free zones of social networks and Web search. The basic idea is simple. You sign in with your accounts (such as Twitter, Facebook, etc), we extract links posted by your friends and build the list of web sites that is your trusted Web. You can then limit your search to your trusted Web. We are experimenting with this idea in various ways. For example, the links from your friends can be extended with links from friends of friends (i.e. one hop via social graph). Or it can be extended with pages which are linked from the pages posted by your friends (i.e. one hop via Web graph). We can also compute popularity of Web sites among your friends and use it to improve the ranking of search results. We are going to leverage The Tweeted Times technology to build it.
Note that it is not only about spam-free search. It is also a great personalization tool as you search through sites that are the most relevant to you. You get these sites on top of search results that rarely happens when you search over the whole Web.
To reduce search results to the trusted Web we use Blekko API. Blekko is a really cool search engine that allows you to “slash” the Web with a list of Web sites. We generate Blekko slashtags that you can use to search only through your trusted Web. Combining our slashtags with other tags you can personalize it further.
There are ongoing efforts to improve search via social graph. Google already ranks higher and highlights those links in the results that your friends posted on social networks. Blekko allows you to search through Web pages liked by your friends on Facebook (use /likes slashtag). But in both cases it is about indexing links to particular page. In its current form Google and Blekko work more like a bookmarking system to find pages linked by your friends. We want to take it further – to go from page links to Web sites as described above. We will keep you posted on the progress.
Further reading: for an in-depth introduction see Understanding the Cassandra Data Model at datastax.com
For newcomers Cassandra data model is a mess. Even experienced database developers spend quite a bit of time learning it. There are great articles on the Web that explain the model. Read WTF is a SuperColumn? An Intro to the Cassandra Data Model and my favorite one – Installing and using Apache Cassandra With Java. This blog post is my take to explain Cassandra model to those who would like to understand the key ideas in 15 minutes or less.
In a nutshell, Cassandra data model can be described as follows:
1) Cassandra is based on a key-value model
A database consists of column families. A column family is a set of key-value pairs. I know the terminology is confusing but so far it is just basic key-value model. Drawing an analogy with relational databases, you can think about column family as table and a key-value pair as a record in a table.
2) Cassandra extends basic key-value model with two levels of nesting
At the first level the value of a record is in turn a sequence of key-value pairs. These nested key-value pairs are called columns where key is the name of the column. In other words you can say that a record in a column family has a key and consists of columns. This level of nesting is mandatory – a record must contain at least one column (so in the first point above value of a record was an intermediate notion as value is actually a sequence of columns).
At the second level, which is arbitrary, the value of a nested key-value pair can be a sequence of key-value pairs as well. When the second level of nesting is presented, outer key-value pairs are called super columns with key being the name of the super column and inner key-value pairs are called columns.
3) The names of both columns and super columns can be used in two ways: as names or as values (usually reference value).
First, names can play the role of attribute names. For example, the name of a column in a record about
User can be
Second, names can also be used to store values! For example, column names in a record which represent Blog can be identifiers of the posts of this blog and the corresponding column values are posts themselves. You can really use column (or super column) names to store some values because (a) theoretically there is no limitation on the number of columns (or super columns) for any given record and (b) names are byte arrays so that you can encode any value in it.
4) Columns and super columns are stored ordered by names.
You can specify sorting behavior by defining how Cassandra treats the names of (super) columns (recall that a name is just an byte array). Name can be treated as Bytes Type, Long Type, Ascii Type, UTF8 Type, Lexical UUID Type, Time UUID Type.
So now you know everything you need to know. Let’s consider an classical :) example of Twitter database to demonstrate the points.
Tweetscontains records representing tweets. The key of a record is of Time UUID type and generated when the tweet is received (we will use this feature in
User_Timelines column family below). The records consist of columns (no super columns here). Columns simply represent attributes of tweets. So it is very similar to how one would store it in a relational database.
The next example is
User_Timelines (i.e. tweets posted by a user). Records are keyed by user IDs (referenced by
User_ID columns in Tweets column family).
User_Timelines demonstrates how column names can be used to store values – tweet IDs in this case. The type of column names is defined as Time UUID. It means that tweets IDs are kept ordered by the time of posting. That is very useful as we usually want to show the last N tweets for a user. Values of all columns are set to an empty byte array (denoted “-”) as they are not used.
To demonstrate super columns let us assume that we want to collect statistics about URLs posted by each user. For that we need to group all the tweets posted by a user by URLs contained in the tweets. It can be stored using super columns as follows.
User_URLs the names of the super columns are used to store URLs and the names of the nested columns are the corresponding tweet IDs.
Important note: currently Cassandra automatically supports indexes for column names but does not support indexes for the names of super columns. In our example it means that you cannot efficiently retrieve/update tweet ids by URL.
[Update: The above note is incorrect! It is subcolumn names that are not indexed inside super columns. Supercolumn names are always indexed. It is a great news as it enables the use-case of data denormalization to speed up queries. For more on this, find the first comment by Jonathan Ellis below. I cover denormalization use-cases in my next post.]
Let me know if I missed anything or something is unclear.
After having read the paper on Amazon Dynamo I have been confused. It is clear that when you move from strictly consistent to eventually consistent storage every bad anomaly becomes possible: stale data reads, reappearing of deleted items, etc. Shifting the task of supporting consistency from storage to applications helps sometimes but does not fully eliminate the possibility of the anomalies. Business decisions made in the presence of such anomalies can lead to serious flaws such as overbooking airplane seats, clearing bounced checks, etc. It might sound completely unacceptable, but don’t jump to conclusions.
In the paper “Building on Quicksand” Pat Helland and Dave Campbell consider such anomalies and subsequent business decisions as part of common business practices – every business should be ready to apologize. To justify the statement, the authors provide a number of striking parallels from the pre-computer world where the same anomalies can be found. The key point here is not the possibility of the anomalies but their probability. The authors believes that such anomalies cannot be avoided. But if you manage to build a system which guaranties their low probability it becomes acceptable business expenses that are generously reimbursed by the following benefits:
- The system provides scalability and high availability. High availability might be critical for many online businesses. Thus, apologizing in some very rare cases business does not lost many potential customers as a result of system outage.
- It might also reduce the cost of infrastructure as it allows for “building systems on quicksand” – on unreliable and inexpensive components such as flakey commodity computers, slow links, low quality data centers, etc.
Besides relying on low probability of the anomalies, what else can be done to mitigate their effect on users? The main approach to making user experience coherent in present of the anomalies is to expose more information to the user on what is going on in the system. For example, the process of ordering can be decomposed into multiple steps including Order Entry and Order Fulfillment. On Order Entry the system responses “Your order for the book has been accepted and will be processed” – the system manifests a tentative operation which might not be fulfilled as a result of data inconsistency (more on this can also be found in “Principles for Inconsistency” by Dean Jacobs, etc). Moreover, as any computer system is disconnected from the real world there might be external causes that prevent from order fulfillment, for example, the forklift runs over the only item you have in stock. So you cannot make any serious promises to your customers anyway relying on decisions made by computers.
The ideas presented in “Building on Quicksand” are controversial as there is no any comprehensive study of whether it is possible to achieve the required probability. Nevertheless, it is an inspiring manifest for researchers and a fascinating reading for broader audience as it might change your understating of how IT solutions should be aligned with the reality of business operations.
Current practice of dealing with streams is common for all applications. The user subscribes to sources (such as news or blog feeds) or followees (e.g. in Twitter, FriendFeed) and read the stream made up of posts from the sources/followees. The problem with this approach is that there is always a compromise with the number of sources/followees that the user would like to read and the amount of information he/she is able to consume. I am sure that many of you know this problem. Even if you subscribe to, say, 30 blogs you cannot even look though all of the posts especially if there are some “top” blogs that usually issue up to 20 posts in a day. As the result you get hundreds of unread posts in just one week.
The solution to this problem is a personalized stream that contains only a moderate number of posts that are potentially interesting for the user. It sounds good but stream personalization systems do not reach wide adoption. I think that the problem of the systems is not poor algorithms they utilize but wrong assumptions they are based on. Modern personalization systems assume that the user has already formed his interests. So the system tries to identify user’s interests and then use this information to bring relevant posts via content match or collaborative filtering. The idea of building a system around user’s interests seems wrong. First of all I am sure that majority of people don’t have any formed interests. People read streams to know significant news, to identify new trends or just want to have fun. It explains why voting/commenting social news sites (e.g. Digg, Reddit) which rank news by absolute popularity without any personalization are quite popular. Even if the user has some concrete interests, posts about these interests usually make up small fraction of the user’s whole media consumption. For example, I am currently interested in semantic search because we are working on such system. But news on semantic search are quite rare and anyway I would not like to read only about semantic search stuff every day. So focusing on user’s interests is very limiting. Systems which utilize collaborative filtering try to go beyond immediate user’s interests and recommend posts that are read by other people who has similar interests. But it is not useful also as it often recommends very diverse topics and it is more about research of what people interested in particular topic also read then about what might be interesting for me. Besides being limited modern personalization systems fail to explain why they recommend this or that post to the user. It is because content match and collaborative filtering are based on statistical aggregation the result of which is hard to explain. The user have to do non-trivial interpretation and maybe even additional research to understand what recommended posts are about while the system does not provide any evidence why the user should do the effort.
So user’s interests are fluid, diverse and hard to grasp. Trying to build something around user’s interests in automatic way is in vain. To become successful personalization systems should rethink their fundamental assumption. I think that personalization system should stop trying to capture user’s interests and focus on what inspires them. The causes of user’s interests are easy to list. People usually find something interesting and worth reading in two main cases. First, it is something that is very popular, widely discussed and lead to universal resonance – trending topics. Note that it might be absolutely unrelated to the user’s interests. For example, swine flu does not touch me at all but I might be interested to know that it happens not to look like a fool among my friends who are well-informed about current global issues. Second, it is something that is inspired by people that the user knows and respects – what is hot in the user’s community. It is more likely to match immediate user’s interests but not necessarily because the user would usually like to know about all important events in his/her community. This new approach to personalization is more social than that based on user’s interests and social approaches prove to be effective for many tasks. Another advantage of the approach is that it allows explaining to the user why we recommend this post and the user should spend his/her time reading it. We need just say that too many people have posted links to it (in case of global trending topic) or that one or more influential people from the user’s community have posted/liked it.
Several years ago we cannot build a personalized user stream based on these considerations. Now, thinks to recent development and high popularity of public micro-blogging systems, we can collect enough information to identify global and in-community trending topics. Public micro-blogging systems can be used as a ranking system to identify significant posts for personalized streams. Global trending topics are already identified by a number of services (e.g. tweetmeme.com) which analyze Twitter, etc. Let me describe how I see it could work for in-community trending topics. The user’s community is defined by a list of favorite sources that the user is subscribed to or regularly visits (e.g. user’s favorite blogs, news sites and his/her friend’s feeds) and a list of followees on services like Twitter or FriendFeed. As I already mentioned above many of us cannot even look through all the posts from these sources and followees but would like not to miss important news, ideas, etc. The user’s personalized stream should include significant posts from the favorite sources and the most posted/retweeted/favorite’ed/liked links to articles/pages/posts among the user’s followees. The significant posts from the favorite sources can be identified as follows. Select those post from a source which global popularity (i.e. among all users of Twitter/FriendFeed, not only user’s followees) are greater than the average for all posts of this source. It means to select all posts from a source which causes some resonance and skip the others. As concerns the most posted/etc links among followees, they represent significant in-community trending topics. They can include links to posts from the favorite sources (for such links followees contribute to their global popularity, maybe with greater weight) or links to other sources that help to introduce the user to something new. You can find some technical details on how we identify significant posts in Twitter here.
- Show real-time updates from various social apps (e.g. social networks, email applications, RSS feeds, etc). For example, RoamAbout has a toolbar at the bottom of the screen with which you can get custom tailored updates from friends of your Facebook network, tweets from people you follow on Twitter, RSS feeds from your favorite blogs and news sites, and notifications when you get an email in your Gmail inbox. And you can get these non-intrusive updates while browsing on any web page. It eliminates the need to constantly switch to different tabs to check email, updates, and tweets.
- Submit information from the current Web page to social apps. For example, with Flock you can drag and drop any picture from the current web page to Facebook app to share it. In RoamAbout it can be done by launching an application in the context of the web page you are on, or in the context of a word highlighted on the web page. For example, you can highlight a phase on the Web page and then launch Twitter app to post it on Twitter.