iQuate Releases iQCloud

iQuate Releases iQCloud

iQuate has announced the availability of iQCloud, the most advanced automated discovery and service mapping solution for digital enterprises.

As cloud computing becomes mainstream, business and IT professionals must understand how their IT services are delivered to run their business in a digital age. How can you harness the power of the cloud if you don’t understand how your existing IT services are delivered today?

“iQCloud gives organizations what they need for a smarter way to the cloud,” says Patrick McNally, CEO of iQuate. “We call it Discovery and Service Mapping 2.0 because it automatically discovers, maps, sizes, tracks, and enables dynamic service management with top-down application services visibility together with bottom-up infrastructure clarity.  We will tell you how your IT services are delivered now, and we’ll help you manage them wherever they are delivered in the future — across legacy, private and public cloud environments.”

McNally brought together a highly respected team with several decades of experience in discovery and service mapping to create iQCloud. The iQuate team worked with an existing global customer base to build a solution that reduces to minutes and hours what once required weeks, and months of manual effort and required deep in-house knowledge of IT resources.

iQCloud provides actionable information to IT and business professionals within the first hour of onboarding and doesn’t require installation or deep knowledge of the IT enterprise it exposes. “The technology has been designed to get more organizations into the cloud faster and with lower risk,” says McNally. “iQCloud automatically provides a holistic view across your entire estate, including highly dynamic, hybrid IT environments.”

Source: CloudStrategyMag

Vertiv Introduces Cloud Capabilities And IoT Gateway

Vertiv Introduces Cloud Capabilities And IoT Gateway

Vertiv, formerly Emerson Network Power, has announced a significant cloud-based addition to the Vertiv portfolio that will empower customers with deeper insights across the data center. The Vertiv cloud offering will leverage the collective knowledge gleaned from decades of data center projects to deliver real-time analysis that will simplify and streamline data center operations and planning.

As part of the Vertiv cloud offering, now available is a new Internet of Things (IoT) gateway that provides added security with simple installation and commissioning to streamline data center connectivity. The Vertiv™ RDU300 gateway, a new entry in the Vertiv RDU family of monitoring and control products, integrates with building management systems and ensures that any data passed to the Vertiv cloud from the customer site is done securely and using minimum bandwidth. Together the Vertiv cloud offering and Vertiv RDU300 gateway enable remote visibility, collection, and analysis of critical infrastructure data across all Vertiv products.

“As an organization, we have designed and built data centers of all shapes and sizes and have millions of equipment installations in data centers and IT facilities in every corner of the globe,” said Patrick Quirk, vice president and general manager of Global Management Systems at Vertiv. “The accumulated knowledge from past, present and future deployments is a powerful resource, and this cloud-based initiative operationalizes that resource in a way that will bring unprecedented capabilities to our customers.”

The Vertiv cloud initiative unlocks the data and deep domain knowledge Vertiv has accrued from its history of monitoring and servicing hardware, software and sensors, including its Chloride®, Liebert®, and NetSure™ brands. With billions of historical uninterruptible power supply (UPS), battery and thermal system data points populating the Vertiv cloud, supplemented by the constant inflow of real-time data, operators will be able to make decisions and take actions based on data-based insight and best practices from across the industry.

Vertiv will use its cloud to aggregate, anonymize and analyze data from IT deployments around the world, identifying trends and patterns that will transform data center operation practices and virtually eliminate the traditional break/fix model and preventative maintenance. Starting with battery monitoring and monitoring for select UPS and power distribution unit (PDU) systems, Vertiv will leverage its cloud to continuously evaluate performance against billions of existing data points to anticipate everything from maintenance needs to efficiency improvements. The Vertiv cloud will synthesize that information and deliver preemptive prompts to data center managers, who can remotely trigger the appropriate actions through qualified personnel and eventually secure Vertiv gateway systems within their facilities and more effectively plan in the short and long term.

 

Source: CloudStrategyMag

No, you shouldn’t keep all that data forever

No, you shouldn’t keep all that data forever

Modern ethos is that all data is valuable, should be stored forever, and that machine learning will one day magically find the value of it. You’ve probably seen that EMC picture about how there will be 44 zettabytes of data by 2020? Remember how everyone had Fitbits and Jawbone Ups for about a minute? Now Jawbone is out of business. Have you considered this “all data is valuable” fad might be the corporate equivalent? Maybe we shouldn’t take a data storage company’s word on it that we should store all data and never delete anything.

Back in the early days of the web it was said that the main reasons people went there were for porn, jobs, or cat pictures. If we download all of those cat pictures and run a machine learning algorithm on them, we can possibly determine the most popular colors of cats, the most popular breeds of cats, and the fact that people really like their cats. But we don’t need to do this—because we already know these things. Type any of those three things into Google and you’ll find the answer. Also, with all due respect to cat owners, this isn’t terribly important data.

Your company has a lot of proverbial cat pictures. It doesn’t matter what your policy and procedures for inventory retention were in 1999. Any legal issues you had reason to store back then have passed the statute of limitation. There isn’t anything conceivable that you could glean from that old data that could not be gleaned from any of the more recent revisions.

Machine learning or AI isn’t going to tell you anything interesting about any of your 1999 policies and procedures for inventory retention. It might even be sort of a type of “dark data,” because your search tool probably boosts everything else above it, so unless someone queries for “inventory retention procedure for 1999,” it isn’t going to come up.

You’ve got logs going back to the beginning of time. Even the Jawbone UP didn’t capture my every breath and certainly didn’t store my individual steps for all time. Sure each breath or step may have slightly different characteristics, but it isn’t important. Likewise, It probably doesn’t matter how many exceptions per hour your Java EE applications server used to throw in 2006. You use Node.js now anyhow. If “how many errors per hour per year” is a useful metric, you can probably just summarize that. You don’t need to keep every log for all time. It isn’t reasonable to expect it to be useful.

Supposedly, we’re keeping this stuff around for the day when AI or machine learning find something useful in it. But machine learning isn’t magical. Mostly, machine learning falls into classification, regression, and clustering. Clustering basically groups stuff that looks “similar”—but it isn’t very likely your 2006 app server logs have anything useful in them that can be found via clustering. The other two algorithms require you to think of something and “train” the machine learning. This means you need a theory of what could be useful and to find something useful, then train the computer to find it. Don’t you have better things to do?

Storage is cheap, but organization and insight are not. Just because you got a good deal on your SAN or have been running some kind of mirrored JBOD setup with a clustered file system doesn’t mean that storing noise is actually cheap. You need to consider the human costs of organizing, maintaining, and keeping all this stuff around. Moreover, while modern search technology is good at sorting relevant stuff from irrelevant, it does cost you something to do so. So while autumn is on the wane, go ahead and burn some proverbial corporate leaves.

It really is okay if you don’t keep it.

Source: InfoWorld Big Data

IDG Contributor Network: How in-memory computing drives digital transformation with HTAP

IDG Contributor Network: How in-memory computing drives digital transformation with HTAP

In-memory computing (IMC) is becoming a fixture in the data center, and Gartner predicts that by 2020, IMC will be incorporated into most mainstream products. One of the benefits of IMC is that it will enable enterprises to start implementing hybrid transactional/analytical processing (HTAP) strategies, which have the potential to revolutionize data processing by providing real-time insights into big data sets while simultaneously driving down costs.

Here’s why IMC and HTAP are tech’s new power couple.

Extreme processing performance with IMC

IMC platforms maintain data in RAM to process and analyze data without continually reading and writing data from a disk-based database. Architected to distribute processing across a cluster of commodity servers, these platforms can easily be inserted between existing application and data layers with no rip-and-replace.

They can also be easily and cost effectively scaled by adding new servers to the cluster and can automatically take advantage of the added RAM and CPU processing power. The benefits of IMC platforms include performance gains of 1,000X or more, the ability to scale to petabytes of in-memory data, and high availability thanks to distributed computing.

In-memory computing isn’t new, but until recently, only companies with extremely high-performance, high-value applications could justify the cost of such solutions. However, the cost of RAM has dropped steadily, approximately 10 percent per year for decades. So today the value gained from in-memory computing and the increase in performance it provides can be cost-effectively realized by a growing number of companies in an increasing number of use cases.

Transactions and analytics on the same data set with HTAP

HTAP is a simple concept: the ability to process transactions (such as investment buy and sell orders) while also performing real-time analytics (such as calculating historical account balances and performance) on the operational data set.

For example, in a recent In-Memory Computing Summit North America keynote, Rafique Awan from Wellington Management described the importance of HTAP to the performance of the company’s new investment book of rRecord (IBOR). Wellington has more than $1 trillion in assets under management.

But HTAP isn’t easy. In the earliest days of computing, the same data set was used for both transaction processing and analytics. However, as data sets grew in size, queries started slowing down the system and could lock up the database.

To ensure fast transaction processing and flexible analytics for large data sets, companies deployed transactional databases, referred to as online transaction processing (OLTP) systems, solely for the purpose of recording and processing transactions. Separate online analytical processing (OLAP) databases were deployed, and data from an OLTP system was periodically (daily, weekly, etc.) extracted, transformed, and loaded (ETLed) into the OLAP system.

This bifurcated architecture has worked well for the last few decades. But the need for real-time transaction and analytics processing in the face of rapidly growing operational data sets has become crucial for digital transformation initiatives, such as those driving web-scale applications and internet of things (IoT) use cases. With separate OLTP and OLAP systems, however, by the time the data is replicated from the OLTP to the OLAP system, it is simply too late—real-time analytics are impossible.

Another disadvantage of the current strategy of separate OLTP and OLAP systems is that IT must maintain separate architectures, typically on separate technology stacks. This results in hardware and software costs for both systems, as well as the cost for human resources to build and maintain them.

The new power couple

With in-memory computing, the entire transactional data set is already in RAM and ready for analysis. More sophisticated in-memory computing platforms can co-locate compute with the data to run fast, distributed analytics across the data set without impacting transaction processing. This means replicating the operational data set to an OLAP system is no longer necessary.

According to Gartner, in-memory computing is ideal for HTAP because it supports real-time analytics and situational awareness on the live transaction data instead of relying on after-the-fact analyses on stale data. IMC also has the potential to significantly reduce the cost and complexity of the data layer architecture, allowing real-time, web-scale applications at a much lower cost than separate OLTP/OLAP approaches.

To be fair, not all data analytics can be performed using HTAP. Highly complex, long running queries must still be performed in OLAP systems. However, HTAP can provide businesses with a completely new ability to react immediately to a rapidly changing environment.

For example, for industrial IoT use cases, HTAP can enable the real-time capture of incoming sensor data and simultaneously make real-time decisions. This can result in more timely maintenance, higher asset utilization, and reduced costs, driving significant financial benefits. Financial services firms can process transactions in their IBORs and analyze their risk and capital requirements at any point in time to meet the real-time regulatory reporting requirements that impact their business.

Online retailers can transact purchases while simultaneously analyzing inventory levels and other factors, such as weather conditions or website traffic, to update pricing for a given item in real time. And health care providers can continually analyze the transactional data being collected from hundreds or thousands of in-hospital and home-based patients to provide immediate individual recommendations while also looking at trend data for possible disease outbreaks.

Finally, by eliminating the need for separate databases, an IMC-powered HTAP system can simplify life for development teams and eliminate duplicative costs by reducing the number of technologies in use and downsizing to just one infrastructure.

The fast data opportunity

The rapid growth of data and the drive to make real-time decisions based on the data generated as a result of digital transformation initiatives is driving companies to consider IMC-based HTAP solutions. Any business faced with the opportunities and challenges of fast data from initiatives such as web-scale applications and the internet of things, which require ever-greater levels of performance and scale, should definitely take the time to learn more about in-memory computing-driven hybrid transactional/analytical processing.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Equinix Collaboration with AWS Expands to Additional Markets

Equinix Collaboration with AWS Expands to Additional Markets

Equinix, Inc. announced an expansion of its collaboration with Amazon Web Services (AWS) with the extension of direct, private connectivity to the AWS Direct Connect service to four additional Equinix International Business Exchange™ (IBX®) data centers in North America and Europe. The move advances the Equinix and AWS collaboration that enables businesses to connect their owned and managed infrastructure directly to AWS via a private connection, which helps customers reduce costs, improve performance and achieve a more consistent network experience.

“When businesses compete at the digital edge, proximity matters. To be successful, enterprises require superior interconnection. Together, Equinix and AWS are catalysts and enablers of this new digital and interconnected world. By offering AWS Direct Connect in our data centers across the globe, we are helping our customers solve their business challenges, drive better outcomes, and simplify their journey to the public cloud,” said Kaushik Joshi, global managing director, Strategic Alliances at Equinix.

Effective immediately, AWS Direct Connect will be available to customers in Equinix IBX data centers in Helsinki, Madrid, Manchester, and Toronto, bringing the total number of Equinix metros offering AWS Direct Connect to 21, globally. Customers can connect to AWS Direct Connect at all available speeds via Equinix Cloud Exchange™ (ECX), cross connects or Equinix-provided metro connectivity options. Additionally, with the recently announced AWS Direct Connect Gateway, Equinix customers can also access multiple AWS regions with a single connection to AWS Direct Connect.

In addition to the four new markets announced today, Equinix offers AWS Direct Connect in the Amsterdam, Chicago, Dallas, Frankfurt, Los Angeles, London, Munich, New York, Osaka, São Paulo, Seattle, Silicon Valley, Singapore, Sydney, Tokyo, Warsaw, and Washington, D.C. metro areas.

Direct connection to AWS inside Equinix IBX data centers is ideal for specific enterprise use cases, such as:

  • Securing and accelerating data flows: Applications such as business intelligence, pattern recognition and data visualization require heavy compute and low-latency connectivity to large data sets. Equinix Data Hub™ and Cloud Exchange can help enterprises control data movement and placement by enabling private, secure and fast connectivity between private data storage devices and AWS compute nodes, maintaining data residency and accelerating access between storage and compute resources.
  • Interconnecting to hybrid cloud and business ecosystems: Direct connection to AWS via the Equinix Cloud Exchange offers enterprises access to networks, IaaS, PaaS and SaaS providers and connectivity to thousands of other business ecosystems.
  • Direct and private connectivity to strategic cloud providers that avoids the public internet is a growing business practice for leading companies. According to the Global Interconnection Index, a market study published recently by Equinix, the capacity for private data exchange between enterprises and cloud providers is forecast to grow at 160% CAGR between now and 2020.

Source: CloudStrategyMag

Equinix Achieves AWS Networking Competency Status

Equinix Achieves AWS Networking Competency Status

Equinix, Inc. has announced it has achieved Amazon Web Services (AWS) Networking Competency status in the AWS Partner Network (APN), underscoring Equinix’s ongoing commitment to serving AWS customers by providing private and secure access inside its global footprint of International Business Exchange™ (IBX®) data centers. This distinction recognizes Equinix as a key Technology Partner in the APN, helping customers adopt, develop, and deploy networks on AWS.

“Equinix is proud to achieve AWS Networking Competency status. Together, Equinix and AWS Direct Connect accelerate Amazon Web Services adoption by making it easier to directly and securely connect to AWS and ensure the performance and availability of mission-critical applications and workloads,” said Kaushik Joshi, global managing director, Strategic Alliances at Equinix.

Achieving the AWS Networking Competency differentiates Equinix as an APN member that provides specialized demonstrated technical proficiency and proven customer success with specific focus on networking based on AWS Direct Connect. To receive the designation, APN members must possess deep AWS expertise and deliver solutions seamlessly on AWS.

AWS is enabling scalable, flexible and cost-effective solutions from startups to global enterprises. To support the seamless integration and deployment of these solutions, AWS established the AWS Competency Program to help customers identify Consulting and Technology APN Partners with deep industry experience and expertise.

In April of this year, Equinix achieved Advanced Technology Partner status in the AWS Partner Network. To obtain this status, AWS requires partners to meet stringent criteria, including the ability to demonstrate success in providing AWS services to a wide range of customers and use cases. Additionally, partners must complete a technical solution validation by AWS.

To help customers reduce costs, improve performance and achieve a more consistent network experience, Equinix offers AWS Direct Connect service in its IBX data centers in 21 markets globally, including the Amsterdam, Chicago, Dallas, Frankfurt, Helsinki, Los Angeles, London, Madrid, Manchester, Munich, New York, Osaka, São Paulo, Seattle, Silicon Valley, Singapore, Sydney, Tokyo, Toronto, Warsaw and Washington, D.C. metro areas.

Direct and private connectivity to strategic cloud providers that avoids the public internet is a growing business practice for leading companies. According to the Global Interconnection Index, a market study published recently by Equinix, the capacity for private data exchange between enterprises and cloud providers is forecast to grow at 160% CAGR between now and 2020.

Source: CloudStrategyMag

IDG Contributor Network: Are you treating your data as an asset?

IDG Contributor Network: Are you treating your data as an asset?

It’s a phrase we constantly hear, isn’t it? Data is a crucial business asset from which we can extract value and gain competitive advantage. Those who use data well will be the success stories of the future.

This got me thinking: If data is such a major asset, why do we hear so many stories about data leaks? Would these companies be quite so loose with other assets? You don’t hear about businesses losing hundreds of company cars or half a dozen buildings, do you?

If data is a potential asset, why aren’t companies treating it as such?

The reality is many businesses don’t treat data as an asset. In fact, it’s treated so badly there is increasing regulation forcing organizations to take better care of it. These external pressures have the potential to provide significant benefits, forcing a change in the way data is viewed across organizations from top to bottom. Forcing data to be treated as the asset it is.

If you can start to treat data as an asset, you can put yourself in a position where data really can provide a competitive advantage.

Where to start?

Clean up the mess

Do you have too little data in your organization? Probably not. In data discussion groups, a common refrain is that companies “have too much” and “it’s out of control.” Organizations are spending more and more resources on storing, protecting and securing it, but it’s not only the cost of keeping data that’s a problem. Tightening regulation will force you to clean up what you have.

It’s not an asset if you just keep collecting it and never do the housekeeping and maintenance that you should with any asset. If you don’t look after it, you will find it very difficult to realize value.

Your starting point is to ask yourself what you have, why you have it, and why you need it.

Gain some control

I talk regularly with people about the what, where, who and why of data. Understanding this will allow you to start to gain control of your asset.

Once it’s decided what your organization should have—and what you should be keeping—you need to understand exactly what you do have and, importantly, where it is stored: in data centers on laptops, on mobile devices or with cloud providers.

Next, the who and why. What other business asset does your company own that you wouldn’t know who’s using it and why? Yet companies seem to do this with data all the time. Look inside your own organization: Do you have a full understanding of who’s accessing your data…and why?

To treat our data like an asset, it’s crucial to understand how our data is been treated.

Build it the right home

As with any asset, data needs the right environment in which to thrive. Your organization no doubt offers decent working conditions for your employees, has a parking lot, provides regular maintenance for your car fleets and so on, doesn’t it? The same should be true for your data.

Consider your data strategy. Is it focused on the storage, media type or a particular vendor? Or are you building a modern, forward-thinking strategy focused on the data itself, and not the technology. This includes looking at how to ensure data is never siloed, can be placed in the right repository as needed, and can move seamlessly between repositories—be they on-prem, in the cloud or elsewhereyou’re your data always available? Can it be recovered quickly?

Build a strategy with a focus on the asset itself: the data.

Be ready to put it to work

To truly treat data as an asset, be prepared to sweat it like you would any other. If you can apply the things I’ve mentioned—cleanse it, gain control of it, have a data-focused strategy and have the right data in the right place—you can start to take advantage of tools that will allow you to gain value from it.

The ability to apply data analytics, machine learning, artificial intelligence and big data techniques to your assets allows you to not only understand your data better, but to begin to learn things from your data that you’d never previously been aware of…which is the most exciting opportunity data presents you.

Culture

All the above said, perhaps the best thing you can do for your data is to encourage a culture that is data-focused, one that realizes the importance of security and privacy, as well as understanding that data is crucial to your organization’s success.

If you can encourage and drive that cultural shift, there is every chance that your data will be treated as the asset it truly is—and you and your organization will be well-placed to reap the rewards that taking care of your data can bring.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Azure Databricks: Fast analytics in the cloud with Apache Spark

Azure Databricks: Fast analytics in the cloud with Apache Spark

We’re living in a world of big data. The current generation of line-of-business computer systems generate terabytes of data every year, tracking sales and production through CRM and ERP. It’s a flood of data that’s only going to get bigger as we add the sensors of the industrial internet of things, and the data that’s needed to deliver even the simplest predictive-maintenance systems.

Having that data is one thing, using it as another. Big data is often unstructured, spread across many servers and databases. You need something to bring it together. That’s where big data analysis tools like Apache Spark come into play; these distributed analytical tools work across clusters of computers. Building on techniques developed for the MapReduce algorithms used by tools like Hadoop, today’s big data analysis tools go further to support more database-like behavior, working with in-memory data at scale, using loops to speed up queries, and providing a foundation for machine learning systems.

Apache Spark is fast, but Databricks is faster. Founded by the Spark team, Databricks is a cloud-optimized version of Spark that takes advantage of public cloud services to scale rapidly and uses cloud storage to host its data. It also offers tools to make it easier to explore your data, using the notebook model popularized by tools like Jupyter Notebooks.

Microsoft’s new support for Databricks on Azure—called Azure Databricks—signals a new direction of its cloud services, bringing Databricks in as a partner rather than through an acquisition.

Although you’ve always been able to install Spark or Databricks on Azure, Azure Databricks makes it a one-click experience, driving the setup process from the Azure Portal. You can host multiple analytical clusters, using autoscaling to minimize the resources in use. You can clone and edit clusters, tuning them for specific jobs or running different analyses on the same underlying data.

Configuring the Azure Databricks virtual appliance

The heart of Microsoft’s new service is a managed Databricks virtual appliance built using containers running on Azure Container Services. You choose the number of VMs in each cluster that it controls and uses, and then the service handles load automatically once it’s configured and running, loading new VMs to handle scaling.

Databricks’ tools interact directly with the Azure Resource Manager, which adds a security group and a dedicated storage account and virtual network to your Azure subscription. It lets you use any class of Azure VM for your Databricks cluster – so if you’re planning on using it to train machine learning systems, you’ll want to choose one of the latest GPU-based VMs. And of course, if one VM model isn’t right for your problem, you can switch it out for another. All you need to do is clone a cluster and change the VM definitions.

Querying in Spark brings engineering to data science

Spark has its own query language based on SQL, which works with Spark DataFrames to handle both structured and unstructured data. DataFrames are the equivalent of a relational table, constructed on top of collections of distributed data in different stores. Using named columns, you can construct and manipulate DataFrames with languages like R and Python; thus, both developers and data scientists can take advantage of them.

DataFrames is essentially a domain-specific language for your data, a language that extends the data analysis features of your chosen platform. By using familiar libraries with DataFrames, you can construct complex queries that take data from multiple sources, working across columns.

Because Azure Databricks is inherently data-parallel, and its queries are evaluated only when called to deliver actions, results can be delivered very quickly. Because Spark supports most common data sources, either natively or through extensions, you can add Azure Databricks DataFrames and queries to existing data relatively easily, reducing the need to migrate data to take advantage of its capabilities.

Although Azure Databricks provides a high-speed analytics layer across multiple sources, it’s also a useful tool for data scientists and developers trying to build and explore new models, turning data science into data engineering. Using Databricks Notebooks, you can develop scratchpad views of your data, with code and results in a single view.

The resulting notebooks are shared resources, so anyone can use them to explore their data and try out new queries. Once a query is tested and turned into a regular job, its output can be exposed as an element a Power BI dashboard, making Azure Databricks part of an end-to-end data architecture that allows more complex reporting than a simple SQL or NoSQL service—or even Hadoop.

Microsoft plus Databricks: a new model for Azure Services

Microsoft hasn’t yet detailed its pricing for Azure Databricks, but it does claim that it can improve performance and reduce cost by as much as 99 percent compared to running your own unmanaged Spark installation on Azure’s infrastructure services. If Microsoft’s claim bears out, that promises to be a significant saving, especially when you factor in no longer having to run your own Spark infrastructure.

Azure’s Databricks service will connect directly to Azure storage services, including Azure Data Lake, with optimizations for queries and caching. There’s also the option of using it with Cosmos DB, so you can take advantage of global data sources and a range of NoSQL data models, including MongoDB and Cassandra compatibility—as well as Cosmos DB’s graph APIs. It should also work well with Azure’s data-streaming tools, giving you a new option for near real-time IoT analytics.

If you’re already using Databricks’ Spark tools, this new service won’t affect you or your relationship with Databricks. It’s only if you take the models and analytics you’ve developed on-premises to Azure’s cloud that you’ll get a billing relationship with Microsoft. You’ll also have fewer management tasks, leaving you more time to work with your data.

Microsoft’s decision to work with an expert partner on a new service makes a lot of sense. Databricks has the expertise, and Microsoft has the platform. If the resulting service is successful, it could set a new pattern for how Azure evolves in the future, building on what businesses are already using and making them part of the Azure hybrid cloud without absorbing those services into Microsoft.

Source: InfoWorld Big Data

IDG Contributor Network: Use the cloud to create open, connected data lakes for AI, not data swamps

IDG Contributor Network: Use the cloud to create open, connected data lakes for AI, not data swamps

Produced by every single organization, data is the common denominator across industries as we look to advance how cloud and AI are incorporated into our operations and daily lives. Before the potential of cloud-powered data science and AI is fully realized, however, we first face the challenge of grappling with the sheer volume of data. This means figuring out how to turn its velocity and mass from an overwhelming firehouse into an organized stream of intelligence.

To capture all the complex data streaming into systems from various sources, businesses have turned to data lakes. Often on the cloud, these are storage repositories that hold an enormous amount of data until it’s ready to be analyzed: raw or refined, and structured or unstructured. This concept seems sound: the more data companies can collect, the less likely they are to miss important patterns and trends coming from their data.

However, a data scientist will quickly tell you that the data lake approach is a recipe for a data swamp, and there are a few reasons why. First, a good amount of data is often hastily stored, without a consistent strategy in place around how to organize, govern and maintain it. Think of your junk drawer at home: Various items get thrown in at random over time, until it’s often impossible to find something you’re looking for in the drawer, as it’s gotten buried.

This disorganization leads to the second problem: users are often not able to find the dataset once ingested into the data lake. Without a way to easily search for data, it’s nearly impossible to discover and use it, making it difficult for teams to ensure it stays within compliance or fed to the right knowledge workers. These problems mix and create a breeding ground for dark data: unorganized, unstructured, and unmanageable data.

Many companies have invested in growing their data lakes, but what they soon realize is that having too much information is an organizational nightmare. Multiple channels of data in a wide range of formats can cause businesses to quickly lose sight of the big picture and how their datasets connect.

Compounding the problem further, if datasets are incomplete or inadequate they often add even more noise when data scientists are searching for specific datasets. It’s like trying to solve a riddle without a critical clue. This leads to a major issue: Ddata scientists spend on average only 20 percent of their time on actual data analysis, and 80 percent of their time finding, cleaning, and reorganizing tons of data.

The power of the cloud

One of the most promising elements of the cloud is that it offers capabilities to reach across open and proprietary platforms to connect and organize all a company’s data, regardless of where it resides. This equips data science teams with complete visibility, helping them to quickly find the datasets they need and better share and govern them.

Accessing and cataloging data via the cloud also offers the ability to use and connect into new analytical techniques and services, such as predictive analytics, data visualization and AI. These cloud-fueled tools help data to be more easily understood and shared across multiple business teams and users—not just data scientists.

It’s important to note that the cloud has evolved. Preliminary cloud technologies required some assembly and self-governance, but today’s cloud allows companies to subscribe to an instant operating system in which data governance and intelligence are native. As a result, data scientists can get back to what’s important: developing algorithms, building machine learning models, and analyzing the data that matters.

For example, an enterprise can augment their data lake with cloud services that use machine learning to classify and cleanse incoming data sets. This helps organize and prepare it for ingestion into AI apps. The metadata from this process builds an index of all data assets, and data stewards can apply governance policies to ensure only authorized users will be able to access sensitive resources.

These actions set a data-driven culture in motion by giving teams the ability to access the right data at the right time. In turn, this gives them the confidence that all the data they share will only be viewed by appropriate teams.

Disillusioned with data? You’re not the only one

Even with cloud services and the right technical infrastructure, different teams are often reluctant to share their data. It’s all about trust. Most data owners are worried about a lack of data governance—the management of secure data—since they have no way of knowing who will use their data, or how they will use it. Data owners don’t want to take this risk, so they choose to hold onto their data, rather than share it or upload it into the data lake.

This can change. By shifting the focus away from restricting usage of data to enabling access, sharing and reuse, organizations will realize the positive value that good governance and strong security delivers to a data lake, which can then serve as an intelligent backbone of every decision and initiative a company undertakes.

Overall, the amount of data that enterprises need to collect and analyze will continue to grow unabated. If nothing is done differently, so will the problems associated with it. Instead, there needs to be a material change in the way people think of solving complex data problems. It starts by solving data findability, management and governance issues with a detailed data index. This way, data scientists can navigate through the deepest depths of their data lakes and unlock the value of organized and indexed data lakes—the foundation for AI innovation.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Spark tutorial: Get started with Apache Spark

Spark tutorial: Get started with Apache Spark

Apache Spark has become the de facto standard for processing data at scale, whether for querying large datasets, training machine learning models to predict future trends, or processing streaming data. In this article, we’ll show you how to use Apache Spark to analyze data in both Python and Spark SQL. And we’ll extend our code to support Structured Streaming, the new current state of the art for handling streaming data within the platform. We’ll be using Apache Spark 2.2.0 here, but the code in this tutorial should also work on Spark 2.1.0 and above.

How to run Apache Spark

Before we begin, we’ll need an Apache Spark installation. You can run Spark in a number of ways. If you’re already running a Hortonworks, Cloudera, or MapR cluster, then you might have Spark installed already, or you can install it easily through Ambari, Cloudera Navigator, or the MapR custom packages.

If you don’t have such a cluster at your fingertips, then Amazon EMR or Google Cloud Dataproc are both easy ways to get started. These cloud services allow you to spin up a Hadoop cluster with Apache Spark installed and ready to go. You’ll be billed for compute resources with an extra fee for the managed service. Remember to shut the clusters down when you’re not using them!

Of course, you could instead download the latest release from spark.apache.org and run it on your own laptop. You will need a Java 8 runtime installed (Java 7 will work, but is deprecated). Although you won’t have the compute power of a cluster, you will be able to run the code snippets in this tutorial.

Source: InfoWorld Big Data