Max's Output

Cassandra for Enterprises

leave a comment »

There is a huge hype about relational databases (MySQL) versus NoSQL systems (Cassandra). After the Digg’s successful migration in September 2009 and Twitter announcement in March 2010 the whole Web has become insanely mad about Cassandra. “NoSQL vs. RDBMS: Let the flames begin!” is just one of hundreds of blog posts on the topic.

As there is a number of successful migrations to Cassandra (Digg is not the only example) and the fact that Facebook has been using it for years we can conclude that it just works. On the other hand, Cassandra has clearly not reached its maturity yet (you can easily feel it by trying to install Cassandra and looking at their stop-server script). There is still not enough understanding of all prons and cons of using it. So how widely will it be adopted for Web applications is still an open question – for more on this read “I Can’t Wait for NoSQL to Die”.

It is interesting that from the very beginning of the NoSQL movement enterprise software companies pay attention also. Of cause they are not talking about Cassandra yet but they are definitely interested in the Amazon Dynamo’s principals on which Cassandra is based. For example, read “Principles for Inconsistency” from SAP. So we can expect NoSQL hype to expand to Enterprise Software soon. As Amazon Dynamo uses MySQL or Berkeley Database as a storage we might foresee Enterprise Version of Cassandra using Oracle or DB2 as a storage. Is it where we are going now? LOL

Written by maxgrinev

March 28, 2010 at 11:08 am

Posted in Cassandra

Want to have a scalable and high-available storage? Be ready to apologize to your customers

leave a comment »

After having read the paper on Amazon Dynamo I have been confused. It is clear that when you move from strictly consistent to eventually consistent storage every bad anomaly becomes possible: stale data reads, reappearing of deleted items, etc. Shifting the task of supporting consistency from storage to applications helps sometimes but does not fully eliminate the possibility of the anomalies. Business decisions made in the presence of such anomalies can lead to serious flaws such as overbooking airplane seats, clearing bounced checks, etc. It might sound completely unacceptable, but don’t jump to conclusions.

In the paper “Building on Quicksand” Pat Helland and Dave Campbell consider such anomalies and subsequent business decisions as part of common business practices – every business should be ready to apologize. To justify the statement, the authors provide a number of striking parallels from the pre-computer world where the same anomalies can be found. The key point here is not the possibility of the anomalies but their probability. The authors believes that such anomalies cannot be avoided. But if you manage to build a system which guaranties their low probability it becomes acceptable business expenses that are generously reimbursed by the following benefits:

  1. The system provides scalability and high availability. High availability might be critical for many online businesses. Thus, apologizing in some very rare cases business does not lost many potential customers as a result of system outage.
  2. It might also reduce the cost of infrastructure as it allows for “building systems on quicksand” – on unreliable and inexpensive components such as flakey commodity computers, slow links, low quality data centers, etc.

Besides relying on low probability of the anomalies, what else can be done to mitigate their effect on users? The main approach to making user experience coherent in present of the anomalies is to expose more information to the user on what is going on in the system. For example, the process of ordering can be decomposed into multiple steps including Order Entry and Order Fulfillment. On Order Entry the system responses “Your order for the book has been accepted and will be processed” – the system manifests a tentative operation which might not be fulfilled as a result of data inconsistency (more on this can also be found in “Principles for Inconsistency” by Dean Jacobs, etc). Moreover, as any computer system is disconnected from the real world there might be external causes that prevent from order fulfillment, for example, the forklift runs over the only item you have in stock. So you cannot make any serious promises to your customers anyway relying on decisions made by computers.

The ideas presented in “Building on Quicksand” are controversial as there is no any comprehensive study of whether it is possible to achieve the required probability. Nevertheless, it is an inspiring manifest for researchers and a fascinating reading for broader audience as it might change your understating of how IT solutions should be aligned with the reality of business operations.

Written by maxgrinev

January 4, 2010 at 12:16 pm

Posted in Uncategorized

Extending The Twitter Tim.es with Source-Oriented Algorithms

There are several ideas on what next significant features can be added to The Twitter Tim.es. In this post I summarize a number of my favorite ones that are derivatives from the following general idea: using Twitter as a voting system with respect to a source (e.g. a blog feed or a Twitter user timeline) or a bundle of sources (e.g. Google Reader bundle or a Twitter List).

How it works

It works as follows. Links published by a source are ranked or filtered with respect to how many times they are posted on Twitter. For example, if the source is your favorite blog, which can be too fruitful to read all posts though, such a system allows you to identify interesting posts that you should not miss in the blog.
In case of ranking, it works like The Twitter Times: links are just ordered according to their popularity and recency on Twitter. There can be two reasonable options when it computes popularity. One is to consider only tweets from your friends and friends-of-friends (fofs) as The Twitter Times does. Another is to consider all Twitter users (still with friends counted with a higher weight) because for some sources there might be only a few or no tweets from your friends.
In case of filtering, the system identifies a subset of the source’s links that should interesting for you. It can be implemented by selecting those links which popularity is above the average for the source. As in case of ranking, it is very likely that we will have to consider all Twitter users (not only friends and fofs) to implement the filtering.
To conclude I summarize the difference between friends-oriented ranking (currently supported in The Twitter Times) and source-oriented ranking/filtering described above:
1) The friends-oriented algorithm uses Twitter as a voting system for all the links posted by user’s friends and fofs. The source-oriented algorithms consider only links coming from a single source.
2) The friends-oriented algorithm counts votes only from user’s friends and fofs. The source-based algorithms have an option to count votes from all Twitter users.

Applications

There are at least two interesting applications of the source-based ranking/filtering algorithms.
First, it can be used to identify interesting posts from your favorite blog, bundle of blogs, or newspaper (for example, The New York Times or The Los Angeles Times). Each day (or hour) you can read a ranking of posts or a list of selected posts for your favorite source which are hot on Twitter.

Second, it can be used to build thematic newspapers described by Maria Grineva in her post. You should just create a bundle of blogs (using for example Google Reader bundles) or a Twitter list which includes sources covering a common theme.

In both cases it can be embedded in The Twitter Times interface as a side bar or a tab for each source (bundle, Twitter list).

Written by maxgrinev

November 22, 2009 at 5:45 pm

The Twitter Times – a real-time personalized newspaper!

with 2 comments

From the massive volume of daily news the most interesting items are those that are actively discussed by people you follow: your friends, respected/famous persons and celebrities you admire. This is the most effective filter. We have built The Twitter Times – a newspaper constructed for you in real time based on the news discussed in your Twitter community (i.e. people you follow on Twitter and their followees). The Twitter Times provides you with a new effective way to comprehend news on a daily, or even hourly, basis.

Here are the main features of The Twitter Times:

  • Real-time – based on your real-time Twitter stream.
  • Personalized – we identify important news items posted by people in your Twitter community and rank those items with respect to their recency and popularity among your followees. It is essentially different from existing services such as tweetmeme.com и digg.com which are based on global popularity.
  • Media-rich – we extract news content so that you can read the text of news, watch videos and photos in your newspaper, all in one place.

What is also interesting about The Twitter Times:

  • The Twitter Times helps you extend your community and find like-minded people – for every news article we display who posted it not only from the 1st, but also from the 2nd circle of your friends (the people followed by your followees). Discover and follow more people you like.
  • It helps you to become more engaged in your community – you can propogate news in your community by retweeting (green retweet button under every news article). Build up your authority in the community by posting more.
  • You can read other user’s newspaper (e.g. famous people, for example, Esther Dyson – http://www.twittertim.es/edyson) and construct your own newspaper.
Currently The Twitter Times is in private alpha. There are few running newspapers for selected persons. You can try out some of them: Richard MacManus (http://www.twittertim.es/rww), Esther Dyson (http://www.twittertim.es/edyson) and Robert Scoble (http://www.twittertim.es/notsecretscoble).
You can leave us your Twitter user name here and we will build a newspaper for you and inform you via Twitter @-reply as we are ready.
We plan to launch The Twitter Times in public soon and any Twitter user will be able to register and get her newspaper immediately.
Acknowledgement: We would like to thank Filip Dousek (@fdousek) for his value contributions to the idea of this product.

Written by maxgrinev

July 31, 2009 at 1:58 pm

Towards Stream Personalization Powered by Twitter

with 19 comments

Now streams have become the main source of information for many of us. Streams are central concepts in most momentous applications such as Twitter, FriendFeed, Facebook. News and blogs are also consumed as streams. Streams become so important that they even replaces search engines as a starting point of Web browsing – now a typical Web session consists in reading Twitter and Google Reader streams and following links found in these streams instead of starting with Web search. Streams are amazing because they make us involved and connected but they quickly cause information overload. You know that there should be something interesting but you cannot read it all to find that. In this post I consider how modern personalization systems address information overload problem and how it can be done in a new more social way by harnessing public micro-blogging streams such as Twitter, FriendFeed, etc.

Current practice of dealing with streams is common for all applications. The user subscribes to sources (such as news or blog feeds) or followees (e.g. in Twitter, FriendFeed) and read the stream made up of posts from the sources/followees. The problem with this approach is that there is always a compromise with the number of sources/followees that the user would like to read and the amount of information he/she is able to consume. I am sure that many of you know this problem. Even if you subscribe to, say, 30 blogs you cannot even look though all of the posts especially if there are some “top” blogs that usually issue up to 20 posts in a day. As the result you get hundreds of unread posts in just one week.

The solution to this problem is a personalized stream that contains only a moderate number of posts that are potentially interesting for the user. It sounds good but stream personalization systems do not reach wide adoption. I think that the problem of the systems is not poor algorithms they utilize but wrong assumptions they are based on. Modern personalization systems assume that the user has already formed his interests. So the system tries to identify user’s interests and then use this information to bring relevant posts via content match or collaborative filtering. The idea of building a system around user’s interests seems wrong. First of all I am sure that majority of people don’t have any formed interests. People read streams to know significant news, to identify new trends or just want to have fun. It explains why voting/commenting social news sites (e.g. Digg, Reddit) which rank news by absolute popularity without any personalization are quite popular. Even if the user has some concrete interests, posts about these interests usually make up small fraction of the user’s whole media consumption. For example, I am currently interested in semantic search because we are working on such system. But news on semantic search are quite rare and anyway I would not like to read only about semantic search stuff every day. So focusing on user’s interests is very limiting. Systems which utilize collaborative filtering try to go beyond immediate user’s interests and recommend posts that are read by other people who has similar interests. But it is not useful also as it often recommends very diverse topics and it is more about research of what people interested in particular topic also read then about what might be interesting for me. Besides being limited modern personalization systems fail to explain why they recommend this or that post to the user. It is because content match and collaborative filtering are based on statistical aggregation the result of which is hard to explain. The user have to do non-trivial interpretation and maybe even additional research to understand what recommended posts are about while the system does not provide any evidence why the user should do the effort.

So user’s interests are fluid, diverse and hard to grasp. Trying to build something around user’s interests in automatic way is in vain. To become successful personalization systems should rethink their fundamental assumption. I think that personalization system should stop trying to capture user’s interests and focus on what inspires them. The causes of user’s interests are easy to list. People usually find something interesting and worth reading in two main cases. First, it is something that is very popular, widely discussed and lead to universal resonance – trending topics. Note that it might be absolutely unrelated to the user’s interests. For example, swine flu does not touch me at all but I might be interested to know that it happens not to look like a fool among my friends who are well-informed about current global issues. Second, it is something that is inspired by people that the user knows and respects – what is hot in the user’s community. It is more likely to match immediate user’s interests but not necessarily because the user would usually like to know about all important events in his/her community. This new approach to personalization is more social than that based on user’s interests and social approaches prove to be effective for many tasks. Another advantage of the approach is that it allows explaining to the user why we recommend this post and the user should spend his/her time reading it. We need just say that too many people have posted links to it (in case of global trending topic) or that one or more influential people from the user’s community have posted/liked it.

Several years ago we cannot build a personalized user stream based on these considerations. Now, thinks to recent development and high popularity of public micro-blogging systems, we can collect enough information to identify global and in-community trending topics. Public micro-blogging systems can be used as a ranking system to identify significant posts for personalized streams. Global trending topics are already identified by a number of services (e.g. tweetmeme.com) which analyze Twitter, etc. Let me describe how I see it could work for in-community trending topics. The user’s community is defined by a list of favorite sources that the user is subscribed to or regularly visits (e.g. user’s favorite blogs, news sites and his/her friend’s feeds) and a list of followees on services like Twitter or FriendFeed. As I already mentioned above many of us cannot even look through all the posts from these sources and followees but would like not to miss important news, ideas, etc. The user’s personalized stream should include significant posts from the favorite sources and the most posted/retweeted/favorite’ed/liked links to articles/pages/posts among the user’s followees. The significant posts from the favorite sources can be identified as follows. Select those post from a source which global popularity (i.e. among all users of Twitter/FriendFeed, not only user’s followees) are greater than the average for all posts of this source. It means to select all posts from a source which causes some resonance and skip the others. As concerns the most posted/etc links among followees, they represent significant in-community trending topics. They can include links to posts from the favorite sources (for such links followees contribute to their global popularity, maybe with greater weight) or links to other sources that help to introduce the user to something new. You can find some technical details on how we identify significant posts in Twitter here.

Written by maxgrinev

May 17, 2009 at 9:26 am

Posted in Uncategorized

Social Browsers: Only Half Way There

with 2 comments

The concept of Social Browser sounds very appealing. Indeed, if you are infatuated in social applications you would be interested in a browser that integrates your social environment into browsing experience. However experimenting with existing social browsers I have found that they lack essential features that they must have to be called Social.
I have tried Flock, Cruz, RoamAbout 2.0. The latter is a plug-in to Firefox or Internet Explorer while others are stand-alone browsers. Features provided by these browsers can be roughly classified into two main categories:
  1. Show real-time updates from various social apps (e.g. social networks, email applications, RSS feeds, etc). For example, RoamAbout has a toolbar at the bottom of the screen with which you can get custom tailored updates from friends of your Facebook network, tweets from people you follow on Twitter, RSS feeds from your favorite blogs and news sites, and notifications when you get an email in your Gmail inbox. And you can get these non-intrusive updates while browsing on any web page. It eliminates the need to constantly switch to different tabs to check email, updates, and tweets.
  2. Submit information from the current Web page to social apps. For example, with Flock you can drag and drop any picture from the current web page to Facebook app to share it. In RoamAbout it can be done by launching an application in the context of the web page you are on, or in the context of a word highlighted on the web page. For example, you can highlight a phase on the Web page and then launch Twitter app to post it on Twitter.
At first sight it looks like a good (or even complete) set of features: you can get information from your community via updates and you can easily share information from the Web with your friends by submitting it to your favorite social apps. But it is not really enough. Remember we are talking about browsing but updates are not related to your browsing activity – they have nothing to do with the page you are currently on. With respect to the current page, you can interact with you friends only one way – share what you have found on that page. Would you also like to know what your friends posted about that page? Here we come to my main idea of this post.
To reveal the full potential of social browsing we need two way interaction with our community. I believe that in addition to sharing what you found, you would also appreciate to know what your friends (or let’s take it broader – your followees) think about that particular article or web page. In other words, we need to collect all sharings of our friends (followees) that are related to the current page from all social apps they use. This functionality opens up a new way to explore information on the Web: pages that you browse should be augmented and connected according to how your community see them. For example, when you read a blog post you would get comments by your friends to that post wherever they come from Twitter, Facebook or Friendfeed. It is an example of augmentation. An example of connection is to show links to related articles derived from your social network (via collaborative filtering or content analysis) – links to other articles that your friends read and that are related to the post. Web pages augmented and connected in such way form a new Web personalized by people from your community. Browsing this new Web should be much more fun and interesting discoveries.

Written by maxgrinev

April 28, 2009 at 9:40 pm

Posted in Uncategorized

Длинный хвост

with 43 comments

Я решил описать концепцию, которая произвела на меня наибольшее впечатление в 2006 году. Это феномен Длинного хвоста (long tail) в контексте Интернет бизнес моделей.

В общем феномен длинного хвоста можно описать следующим образом: маловостребованные вещи (или малозначительные события) в сумме являются более востребованными (или дают больший эффект), чем популярные вещи (или значительные события). Для того, чтобы понять откуда взялся термин длинных хвост, посмотрите на график и представьте, что это распределение востребованности вещей, упорядоченных в порядке убывания востребованности. Мало востребованные вещи образуют длинный хвост, интеграл по которому, может быть больше чем по востребованным вещам. В контексте бизнеса это означает, например, следующие: в сумме небестселлеров продают больше, чем бестселлеров. Таким образом, если суметь собрать достаточно большое число маловостребованных товаров (то есть построит из них длинный хвост), то это может дать существенный экономический эффект. В контексте создания бизнес моделей феномен длинного хвоста был впервые рассмотрен Крисом Андерсоном (Chris Anderson) в публикации в журнале Wired в 2004. Несколько месяцев назад он же написал книгу, посвященную анализу этого феномена.

Построение длинного хвоста стало возможным, главным образом, благодаря широкому распространению Интернет, поскольку Интернет снимает многие физические ограничения. Например, в реальном книжном магазине можно продавать только бестселлеры, поскольку существую объективные физические ограничения на размер книжных полок, в то время как Интернет магазин не имеет таких ограничений и может предлагать сколь угодно широкий ассортимент. Наиболее часто обсуждаемым примером (экономически) успешного построения длинного хвоста является Интернет-магазин Amazon. Известно, что около половины дохода эта компания получает от реализации продуктов из длинного хвоста. Другим знаковым примером является рекламные программы компании Google (такие как adword и adsense). Эти программы позволяют большому числу мелких рекламодателей (для которых реклама в традиционных средствах массовой информации, таких как телевидение, газеты и журналы, была практически недоступна) проводить свои рекламные компании в Интернет. То есть опять же, традиционные средства массовой информации имеют физические ограничения эфирного времени или максимально допустимого числа страниц, которые намного менее существенны в Интернет. А, кроме того, цена доставки информации до конечного потребителя существенно ниже для Интернет-компаний, чем для традиционных масс медиа.

Возможность построения длинного хвоста при помощи Интернет является новой возможностью, которая, как кажется, еще не исчерпана и может быть реализована во многих областях.

Written by maxgrinev

January 9, 2007 at 12:18 pm

Follow

Get every new post delivered to your Inbox.