Thoughts on Performance Optimization at Scale

I came across Alex Levenson presentation and it made me feel that we might be living in the fools heaven. The challenges of Big Data is not that simple to address. Every scenario is different, which makes it more difficult for the developers to provide a generic solution for Big Data challenges. The presentation is worth listening and you might… Read more →

How Big Data Can Transform Healthcare

Big Data is mostly thought of evolving the businesses or creating smart cities, however, Big Data can also be used in Healthcare sectors. According to McKinsey & Company identified four main sources of Big Data in Healthcare industry. The four sources are: Activity (claims) and cost data, Clinical data, Pharmaceutical R&D data and Patient behavior and sentiment data. I recently came across very informative article which talks… Read more →

What is Wrong With All Machine Learning Models

John Langford a machine learning research scientist, works in Microsoft and author of the weblog, has recently published a brilliant article about flaws in machine learning models. Currently the link to his original article is down, but you can find his article as below. John Article (Taken from here) Attempts to abstract and study machine learning are within some given… Read more →

Machine Learning is a new form of statistics

Statistics and machine learning are thought to be two separate fields. But if you read good articles from highly reputed journals of machine learning you will realize that these two fields are merging together. Not too long ago, a new field “statistical machine learning” made it clear that these two field have too much in common. Coming from computer science background, I… Read more →

Data science related top 20 short tutorials (must read)

I have finished reading 20 short tutorial suggested by datasciencecentral. Its amazing, I particularly liked clustering and bigdata related articles. Following is the complete list, go ahead and let me know what’s your favourite article. Tutorial: How to detect spurious correlations, and how to find the … Practical illustration of Map-Reduce (Hadoop-style), on real data Jackknife logistic and linear regression for… Read more →

Basics of Bigdata

Bigdata is often misunderstood and thought to be very large data, however it is just one aspect of bigdata. The term Bigdata refers to data, which is too complex for traditional approaches to handle. The bigdata have following characteristics.     Volume – Large amount of the data. Velocity – Rapid generation of the data. Variability – Inconsistency of the data. Veracity – Quality of… Read more →

Weka or LingPipe for New Data Scientist

I started working in Weka and Lingpipe around 2 years ago. My task was to develop a better clustering algorithm for text data. I initially used Weka to familiarize my self with basic clustering algorithms, however I found Weka has more documentation for classification algorithms than clustering algorithms. I came across Lingpipe framework on the internet and found that their blog provides… Read more →

Clustering Bigdata

Clustering large amount of data brings complexity and requires special clustering algorithms. Common clustering algorithms like k-means are not designed to handle such tasks. Anil K. Jain, A big name in domain of clustering algorithms explains this phenomena in his video lecture ( He provides a solution “approximate k-means algorithm” which cluster large amount of data (bigdata). Other researcher like Xiao Cai et.… Read more →