Python is a popular choice for beginners, yet still powerful enough to back some of the world’s most popular products and applications. It’s design makes the programming experience feel almost as natural as writing in English. Python basics or Python Debugger cheatsheets for beginners covers important syntax to get started. Community-provided libraries such as numpy, scipy, sci-kit and pandas are highly relied on and the NumPy/SciPy/Pandas Cheat Sheet provides a quick refresher to these.

Python 2.7 Quick Reference SheetPython Cheat Sheet by DaveChildPython Basics Reference sheetPython Debugger CheatsheetNumPy / SciPy / Pandas Cheat SheetPython OverAPI cheatsheetPython Decorators cheatsheetPython 2.4 Quick Reference CardPython 3 Cheat SheetPython Language & Syntax Cheat Sheet

**Cheat sheets for R: **

The R’s ecosystem has been expanding so much that a lot of referencing is needed. The R Reference Card covers most of the R world in few pages.The Rstudio has also published a series of cheatsheets to make it easier for the R community. The data visualization with ggplot2 seems to be a favorite as it helps when you are working on creating graphs of your results.

R cheat sheet (Google Drive)R functions for Regression AnalysisR Reference CardR functions for Time series AnalysisR Reference Card for Data MiningR Cheat SheetData Analysis the data.table wayInteractive Web Apps cheatsheet by R studioData Visualisation with ggplot2 cheatsheet by R studioPackage Development with devtools cheatsheet by R studioData Wrangling cheatsheetR markdown cheatsheetR Markdown Reference guideR Data Management cheatsheetR Cheatsheet for graphical parameters

**Cheat sheets for MySQL & SQL: **

For a data scientist basics of SQL are as important as any other language as well. Both PIG and Hive Query Language are closely associated with SQL- the original Structured Query Language. SQL cheatsheets provide a 5 minute quick guide to learning it and then you may explore Hive & MySQL! MySQL Cheatsheet by Dave childSQL Cheat sheetSQL in one pageMySQL Reference guideVisual SQL JoinsSQL for dummies

**Cheat sheets for Spark: **

Apache Spark is an engine for large-scale data processing. For certain applications, such as iterative machine learning, Spark can be up to 100x faster than Hadoop (using MapReduce). The essentials of Apache Spark cheatsheet explains its place in the big data ecosystem, walks through setup and creation of a basic Spark application, and explains commonly used actions and operations.

https://dzone.com/refcardz/apache-sparkScala cheatsheets 1Scala cheatsheets 2Scala from DZone Reference CardSpark cheatsheet on githubScala on Spark CheatsheetEssential Apache Spark cheatsheet by MapR

**Cheat sheets for Hadoop & Hive: **

Hadoop emerged as an untraditional tool to solve what was thought to be unsolvable by providing an open source software framework for the parallel processing of massive amounts of data. Explore the Hadoop cheatsheets to find out Useful commands when using Hadoop on the command line. A combination of SQL & Hive functions is another one to check out. Hadoop for Dummies cheatsheetGetting Started Apache Hadoop Reference CardHadoop Command Line cheatsheetWorking with HDFS from the command line – Hadoop Cheat sheetHive Function cheatsheetSQL to Hive cheatsheet

**Cheat sheets for Machine learning: **

We often find ourselves spending time thinking which algorithm is best? And then go back to our big books for reference! These cheat sheets gives an idea about both the nature of your data and the problem you’re working to address, and then suggests an algorithm for you to try. Choosing the right estimator Machine Learning cheatsheetPatterns for Predictive learning cheatsheetMachine learning algorithm cheat sheet for Microsoft AzureMachine Learning cheatsheet Github 1Machine Learning cheatsheet Github 2Machine Learning which algorithm performs best?Cheat sheet 10 machine learning algorithms R commandsPatterns for Predictive Analytics

**Cheat sheets for Django : **

Django is a free and open source web application framework, written in Python. If you are new to Django, you can go over these cheatsheets and brainstorm quick concepts and dive in each one to a deeper level. Django cheat sheet v.1Django cheatsheet 1Django cheatsheet 2Django cheatsheet 3Django cheatsheet 4Django Reference CheatsheetDjango Quick start guide & CheatsheetFlask Cheatsheet

Originally posted by Bhavya Geethika

]]>

You can view the code at https://github.com/awahid101/mist and let me know about your thoughts.

Please note the code is not at the final stage. I am trying to find a suitable time to comment the code and improve the quality.

]]>

I came across Alex Levenson presentation and it made me feel that we might be living in the fools heaven. The challenges of Big Data is not that simple to address. Every scenario is different, which makes it more difficult for the developers to provide a generic solution for Big Data challenges.

The presentation is worth listening and you might find it useful to learn from Alex experience. Following is the link to the presentation.

]]>

I recently came across very informative article which talks about how these four sources of Big Data can be used in Healthcare. Following is the link to the article, I hope you will find it useful.

http://www.datasciencecentral.com/profiles/blogs/4-ways-big-data-is-transforming-healthcare

]]>Apart from Data Architect skill, opportunities for the rest of the jobs mentioned in the survey are around 1500 whereas opportunities for the Big Data and Data Science is way higher. One thing is for sure, Big Data and Data Science skills are much more in demand than any other highly paid technical skills. If you already have those skills then consider your self lucky, but if you don’t then my advice would be to start learning about Big Data and Data Science from today. It is worth spending your time and it will definitely pay off in long run.

Complete list of top 30 highly paid skills can be found Here

]]>**John Article (Taken from here)**

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

- The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
- Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal – we want algorithmic guidance.
- We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
- The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.

**Bayesian Learning**

**Methodology: **You specify a prior probability distribution over data-makers,*P(datamaker)* then use Bayes law to find a posterior *P(datamaker|x)*. True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly.

**What is wrong:**

- Information theoretically problematic. Explicitly specifying a reasonable prior is often hard.
- Computationally difficult problems are commonly encountered.
- Human intensive. Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable.

**Graphical/generative Models**

**Methodology: **Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist.

**What is wrong:**

- Often (still) fails to fix problems with the Bayesian approach.
- In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence.

**Convex Loss Optimization**

**Methodology: **Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima.

**What is wrong:**

- The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous.
- Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult.

**Gradient Descent**

**Methodology: **Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters.

**What is wrong:**

- Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance.
- Overfitting is a significant issue.

**Kernel-based learning**

**Methodology: **You chose a kernel *K(x,x’)* between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning.

**What is wrong: **Specification of the kernel is not easy for some applications (this is another example of prior elicitation). *O(n2)* is not efficient enough when there is much data.

**Boosting**

**Methodology: **You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor.

**What is wrong: **The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.

**Online Learning with Experts**

**Methodology: **You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret.

**What is wrong: **Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.

**Learning Reductions**

**Methodology: **You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner.

**What is wrong: **The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.

**PAC Learning**

**Methodology: **You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner.

**What is right: **The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute.

**What is wrong: **There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.

**Statistical Learning Theory**

**Methodology: **You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses.

**What is wrong: **The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).

**Decision tree learning**

**Methodology: **Learning is a process of cutting up the input space and assigning predictions to pieces of the space.

**What is wrong: **There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.

**Algorithmic complexity**

**Methodology: **Learning is about finding a program which correctly predicts the outputs given the inputs.

**What is wrong: **The theory literally suggests solving halting problems to solve machine learning.

**RL, MDP learning**

**Methodology: **Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process.

**What is wrong: **Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.

**RL, POMDP learning**

**Methodology: **Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process

**What is wrong: **All known algorithms scale badly with the number of hidden states.

This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)

]]>Coming from computer science background, I can sense statistics will dominate the future algorithms.

What is your opinion about the future of machine learning. Do you think it will find its own direction of will it follow the statistics.

]]>- Tutorial: How to detect spurious correlations, and how to find the …
- Practical illustration of Map-Reduce (Hadoop-style), on real data
- Jackknife logistic and linear regression for clustering and predict…
- From the trenches: 360-degrees data science
- A synthetic variance designed for Hadoop and big data
- Fast Combinatorial Feature Selection with New Definition of Predict…
- A little known component that should be part of most data science a…
- 11 Features any database, SQL or NoSQL, should have
- Clustering idea for very large datasets
- Hidden decision trees revisited
- Correlation and R-Squared for Big Data
- Marrying computer science, statistics and domain expertize
- New pattern to predict stock prices, multiplies return by factor 5
- What Map Reduce can’t do
- Excel for Big Data
- Fast clustering algorithms for massive datasets
- Source code for our Big Data keyword correlation API
- The curse of big data
- How to detect a pattern? Problem and solution
- Interesting Data Science Application: Steganography

- Volume – Large amount of the data.
- Velocity – Rapid generation of the data.
- Variability – Inconsistency of the data.
- Veracity – Quality of the data.
- Variety – Various forms of the data

I would also like to point out that rich data having multiple views or representation should also be considered as a characteristic of BigData. The next step for you would be to have a look at wikipedia article about BigData and explore more information.

]]>