Microsoft’s R tools bring data science to the masses
One of Microsoft’s more interesting recent acquisitions was Revolution Analytics, a company that built tools for working with big data problems using the open source statistical programming language R. Mixing an open source model with commercial tools, Revolution Analytics offered a range of tools supporting academic and personal use, alongside software that took advantage of massive amounts of data–including Hadoop. Under Microsoft’s stewardship, the now-renamed R Server has become a bridge between on-premises and cloud data.
Two years on, Microsoft has announced a set of major updates to its R tools. The R programming language has become an important part of its data strategy, with support in Azure and SQL Server—and, more important, in its Azure Machine Learning service, where it can be used to preprocess data before delivering it to a machine learning pipeline. It’s also one of Microsoft’s key cross-platform server products, with versions for both Red Hat Linux and Suse Linux.
R is everywhere in Microsoft’s ecosystem
Outside of Microsoft, the open source R has become a key tool for data science, with a lot of support in academic environments. (It currently ranks fifth in terms of all languages, according to the IEEE.) You don’t need to be a statistical expert to get started with R, because the Comprehensive R Archive Network (CRAN, a public library of R applications) now has more than 9,000 statistical modules and algorithms you can use with your data.
Microsoft’s vision for R is one that crosses the boundaries between desktop, on-premises servers, and the cloud. Locally, there’s a free R development client, as well as R support in Microsoft’s (paid) flagship Visual Studio development environment. On-premises, R Server runs on Windows and Linux, as well as inside SQL Server, giving you access to statistical analysis tools alongside your data. Local big data services based on Hadoop and Spark are also supported, while on Azure you can run R Server alongside Microsoft’s HDInsight services.
R is a tool for data scientists. Although the R language is relatively simple, you need a deep knowledge of statistical analytics to get the most from it. It’s been a long while since I took college-level statistics classes, so I found getting started with R complex because many of the underlying concepts require graduate-level understanding of complex statistical functions. The question isn’t so much whether you can write R code—it’s whether you can understand the results you’re getting.
That’s probably the biggest issue facing any organization that wants to work with big data: getting the skills needed to produce the analysis you want and, more important, to interpret the results you get. R certainly helps here, with built-in graphing tools that help you visualize key statistical measures.
Working with Microsoft R Server
The free Microsoft R Open can help your analytics team get up to speed with R before investing in any of the server products. It’s also a useful tool for quickly trying out new analytical algorithms and exploring the questions you want answered using your data. That approach works well as part of an overall analytics lifecycle, starting with data preparation, moving on to model development, and finally turning the model into tools that can be built into your business applications.
One interesting role for R is alongside GPU-based machine-learning tools. Here, R is employed to help train models before they’re used at scale. Microsoft is bundling its own machine learning algorithms with the latest R Server release, so you can test a model before uploading it to either a local big data instance or to the cloud. During a recent press event, Microsoft demonstrated this approach with astronomy images, training a machine-learning-based classifier on a local server with a library of galaxies before running the resulting model on cloud-hosted GPUs.
R is an extremely portable language, designed to work over discrete samples of data. That makes it very scalable and ideal for data-parallel problems. The same R model can be run on multiple servers, so it’s simple to quickly process large amounts of data. All you need to do is parcel out your data appropriately, then deliver it to your various R Server instances. Similarly, the same code can run on different implementations, so a model built and tested against local data sources can be deployed inside a SQL Server database and run against a Hadoop data lake.
R makes operational data models easy
Thus, R is very easy to operationalize. Your data science team can work on building the model you need, while your developers write applications and build infrastructures that can take advantage of their code. Once it’s ready, the model can be quickly deployed, and it can even be swapped out for improved models in the future without affecting the rest of the application. In the same manner, the same model can be used in different applications, working with the same data.
With a common model, your internal dashboards can show you the same answers as customer- and consumer-facing code. You can then use data to respond proactively—for example, providing delay and rebooking information to airline passengers when a model predicts weather delays. That model can be refined as you get more data, reducing the risks of false positives and false negatives.
Building R support into SQL Server makes a lot of sense. As Microsoft’s database platform becomes a bridge between on-premises data and the cloud, as well as between your systems of record and big data tools, having fine-grained analytics tools in your database is a no-brainer. A simple utility takes your R models and turns them into procs, ready for use inside your SQL applications. Database developers can work with data analytics teams to implement those models, and they don’t need to learn any new skills to build them into their applications.
Microsoft is aware that not every enterprise needs or has the budget to employ data scientists. If you’re dealing with common analytics problems, like trying to predict customer churn or detecting fraud in an online store, you have the option of working with a range of predefined templates for SQL Server’s R Services that contain ready-to-use models. Available from Microsoft’s MSDN, they’re fully customizable in any R-compatible IDE, and you can deploy them with a PowerShell script.
Source: InfoWorld Big Data