What are the big data processing technologies

End-to-end analysis with Azure Synapse

This example scenario shows how to use the diverse services of Azure Data Services to develop a modern data platform with which the common data processing tasks in an organization can be handled.

The solution described in this article combines different Azure services with which data and insights from different sources (structured, semi-structured, unstructured and streaming) are collected, stored, processed, enriched and made available.

Relevant use cases

This approach can also be used to:

  • Setting up a company-wide data hub with a data warehouse for structured data and a data lake for semi-structured and unstructured data. This data hub becomes the single source of truth for your report data.
  • Integrating relational data sources with other unstructured datasets using big data processing technology.
  • Use semantic models and powerful visualization tools to simplify data analysis
  • Share datasets within the organization or with trusted external partners.

construction

Note

  • The services covered by this architecture are only a subset of a much larger family of Azure services. Similar results can be achieved by using other services or features that are not used here.
  • Certain business needs for your analytics use case may also ask about the use of other services or features that are not covered by this draft.

Use cases for analytics

The analytics use cases covered by the architecture are represented by the various data sources on the left side of the diagram. The data traverses the solution as follows (from bottom to top):

Azure Data Services, cloud-native HTAP with Cosmos DB

  1. Azure Synapse Link for Azure Cosmos DB allows you to perform near real-time analytics on operational data in Azure Cosmos DB using the two analytics engines available in your Azure Synapse workspace: SQL (serverless) and Spark Pools.

  2. With a SQL (serverless) query or a Spark Pool notebook, you can access Cosmos DB analytical storage and then combine datasets from your near real-time operational data with data from your data lake or your data warehouse.

  3. The resulting datasets from your SQL (serverless) queries can be stored permanently in your data lake. If you use Spark notebooks, the resulting datasets can be stored permanently in your data lake or data warehouse (SQL pool).

  4. Load relevant data from the Azure Synapse SQL pool into Power BI datasets to enable data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.

  5. Business analysts use Power BI reports and dashboards to analyze data and gain business insights.

  6. Data can also be securely shared with other business units or external trusted partners using Azure Data Share.

Relational databases

  1. Azure Synapse pipelines are used to pull data from a wide variety of databases - both locally and in the cloud. Pipelines can be triggered based on a pre-defined schedule or in response to an event, or they can be called explicitly via REST APIs.

  2. Use a Copy Data activity in the Azure Synapse pipeline to stage the data copied from the relational databases into the raw zone of your Azure Data Lake Store Gen 2 data lake. You can save the data in a delimited text format or compressed as Parquet files.

  3. Use either dataflows, SQL (serverless) queries, or Spark notebooks to inspect, transform, and move the datasets to your curated zone in your data lake.

    1. As part of your data transformations, you can call up machine learning models from your SQL pools using standard T-SQL or Spark notebooks. These ML models can be used to enrich your datasets and generate additional business insights. These machine learning models can be used by Azure Cognitive Services or custom ML models by Azure ML.
  4. You can serve your final dataset directly from the curated zone of the data lake, or you can use the Copy Data activity to capture the final dataset in your SQL pool tables, using the COPY command for quick Use capture.

  5. Load relevant data from the Azure Synapse SQL pool into Power BI datasets to enable data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.

  6. Business analysts use Power BI reports and dashboards to analyze data and gain business insights.

  7. Data can also be securely shared with other business units or external trusted partners using Azure Data Share.

Semi-structured data sources

  1. Azure Synapse pipelines are used to pull data from a wide variety of data sources with semi-structured data - both locally and in the cloud. Example:

    • You can collect data from file-based sources that contain CSV or JSON files.
    • You can connect to NoSQL databases such as Cosmos DB or MongoDB.
    • You can call REST APIs provided by SaaS applications that act as a data source for the pipeline.
  2. Use a Copy Data activity in the Azure Synapse pipeline to stage the data copied from the semi-structured data sources into the raw zone of your Azure Data Lake Store Gen 2 data lake. You should save the data in the original format as it was retrieved from the data sources.

  3. Use either dataflows, SQL (serverless) queries, or Spark notebooks to inspect, transform, and move your datasets to your curated zone in your data lake. SQL (serverless) queries make underlying CSV, Parquet, or JSON files available as external tables so that they can be queried using T-SQL.

    1. As part of your data transformations, you can call up machine learning models from your SQL pools using standard T-SQL or Spark notebooks. These ML models can be used to enrich your datasets and generate additional business insights. These machine learning models can be used by Azure Cognitive Services or custom ML models by Azure ML.
  4. You can serve your final dataset directly from the curated zone of the data lake, or you can use the Copy Data activity to capture the final dataset in your SQL pool tables, using the COPY command for quick Use capture.

  5. Load relevant data from the Azure Synapse SQL pool into Power BI datasets to enable data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.

  6. Business analysts use Power BI reports and dashboards to analyze data and gain business insights.

  7. Data can also be securely shared with other business units or external trusted partners using Azure Data Share.

Unstructured data sources

  1. Azure Synapse pipelines are used to pull data from a wide variety of data sources with unstructured data - both locally and in the cloud. Example:

    • You can capture video, images, audio, or free text from file-based sources that contain the source files.
    • You can call REST APIs provided by SaaS applications that act as a data source for the pipeline.
  2. Use a Copy Data activity in the Azure Synapse pipeline to stage the data copied from the unstructured data sources into the raw zone of your Azure Data Lake Store Gen 2 data lake. You should save the data in the original format as it was retrieved from the data sources.

  3. Use Spark Notebooks to review, transform, enrich, and move your datasets to your curated zone in your data lake.

    1. As part of your data transformations, you can call up machine learning models from your SQL pools using standard T-SQL or Spark notebooks. These ML models can be used to enrich your datasets and generate further business insights. These machine learning models can be used by Azure Cognitive Services or custom ML models by Azure ML.
  4. You can serve your final dataset directly from the curated zone of the data lake, or you can use the Copy Data activity to capture the final dataset in your data warehouse tables using the COPY command for a use fast capture.

  5. Load relevant data from the Azure Synapse SQL pool into Power BI datasets to enable data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.

  6. Business analysts use Power BI reports and dashboards to analyze data and gain business insights.

  7. Data can also be securely shared with other business units or external trusted partners using Azure Data Share.

Streaming

  1. Use Azure Event Hubs or Azure IoT Hubs to ingest data streams generated by client applications or IoT devices. The Event Hub or IoT Hub then collects and stores streaming data while maintaining the sequence of events received. Consumers can then connect to the Event Hub or IoT Hub and get messages for processing.

  2. Configure Event Hub Capture or IoT Hub storage endpoints to store a copy of the events in the raw zone of your Azure Data Lake Store Gen 2 data lake. This feature implements the "cold path" of the Lambda architecture pattern and allows you to perform historical and trending analysis of the data of the data stream stored in your data lake by using SQL (serverless) queries or Spark notebooks and yourself adhere to the pattern described above for semi-structured data sources.

  3. Use a Stream Analytics job to implement the “hot path” of the Lambda architecture pattern and extract insights from the transmitted data stream. Define at least one input for the data stream from your Event Hub or IoT Hub, a query to process the input data stream, and a Power BI output for the location where the query results will be sent.

    1. As part of your data processing with Stream Analytics, you can call up machine learning models to enrich your stream datasets and make business decisions based on the generated predictions. These machine learning models can be used by Azure Cognitive Services or custom ML models in Azure Machine Learning.
  4. Business analysts then use real-time Power BI datasets and dashboard capabilities to visualize the rapidly changing insights generated by your Stream Analytics query.

Discover and control

Data governance is a common challenge in large corporate environments. On the one hand, business analysts need to be able to discover and understand data resources that can help them solve business problems. On the other hand, chief data officers want knowledge about data protection and the security of business data.

Azure Purview

  1. Use Azure Purview for data discovery and governance of your insights into your data resources, data classifications and confidentiality labels while covering the entire data landscape of the company.

  2. Azure Purview can help you maintain a business glossary of the specific business terminology needed to help users understand the semantics of what records mean and how they should be used across the enterprise.

  3. You can register all of your data sources and set up regular reviews to automatically catalog and update relevant metadata about data resources in the company. Azure Purview can also automatically add data lineage information based on information from Azure Data Factory or Azure Synapse pipelines.

  4. Data classification and data sensitivity labels can be automatically added to your data resources based on pre-configured or custom rules applied during periodic reviews.

  5. Data governance experts can leverage the reports and insights generated by Azure Purview to maintain control of the entire data landscape and protect the company from security and privacy issues.

Platform services

To improve the quality of your Azure solutions, follow the recommendations and guidelines defined in the five pillars of architectural excellence of the Azure Well-Architected Framework: Cost Optimization, Operational Excellence, Efficient Performance, Reliability, and Security.

According to these recommendations, the following services should be considered as part of the design:

  1. Azure Active Directory: Identity services, single sign-on (SSO), and multi-factor authentication (MFA) across Azure workloads.
  2. Azure Cost Management: Financial governance over your Azure workloads.
  3. Azure Key Vault: Secure management of credentials and certificates. For example, Azure Synapse Pipelines, Azure Synapse Spark Pools, and Azure ML can pull credentials and certificates from Azure Key Vault that are used for secure access to data storage.
  4. Azure Monitor: Collect, analyze, and respond to telemetry from your Azure resources to proactively identify issues and maximize performance and reliability.
  5. Azure Security Center: Strengthen and monitor the security status of your Azure workloads.
  6. Azure DevOps & GitHub: Implement DevOps practices to enforce automation and compliance for your workload development and delivery pipelines for Azure Synapse and Azure ML.
  7. Azure Policy: Implementing corporate standards and governance for resource consistency, regulatory compliance, security, cost, and management.

Components of the architecture

The following Azure services were used in the architecture:

  • Azure Synapse Analytics
  • Azure Data Lake Gen2
  • Azure Cosmos DB
  • Azure Cognitive Services
  • Azure machine learning
  • Azure Event Hubs
  • Azure IoT Hub
  • Azure Stream Analytics
  • Azure Purview
  • Azure Data Share
  • Microsoft Power BI
  • Azure Active Directory
  • Azure Cost Management
  • Azure key vault
  • Azure Monitor
  • Azure Security Center
  • Azure DevOps
  • Azure Policy
  • GitHub

Alternatives

Considerations

The technology components of this architecture were chosen because they each have the functions required to handle the most common data tasks in an organization. These services meet the requirements for scalability and availability and at the same time enable cost control. The services covered by this architecture are only a subset of a much larger family of Azure services. Similar results can be achieved by using other services or features that are not used here.

Certain business needs for your analytics use case may also ask about the use of other services or features that are not covered by this draft.

A similar architecture can also be implemented for preproduction environments where you can develop and test your workloads. Consider the specific requirements for your workloads and the capabilities of each service for a cost-effective preproduction environment.

Prices

In general, you should use the Azure pricing calculator to work out your costs. The ideal individual tariff and the total cost of each service included in the architecture depend on the amount of data to be processed and stored and the expected acceptable level of performance. Use the guide below for more information on pricing for each service:

  • With the serverless architecture of Azure Synapse Analytics, you can scale your compute and storage tier independently. Computer resources are billed based on usage and can be scaled or paused as required. Storage resources are billed by terabytes. So your costs increase as you collect more data.

  • Azure Data Lake Gen 2 is billed based on the amount of data stored and the number of transactions to read and write data.

  • Azure Event Hubs and Azure IoT Hubs are billed based on the amount of computing resources required to process your message streams.

  • Azure Machine Learning charges are based on the amount of compute resources used to train and deploy your machine learning models.

  • Cognitive Services are billed based on the number of calls you make to the service APIs.

  • Azure Purview is priced based on the number of data resources in the catalog and the compute power required to validate them.

  • Azure Stream Analytics is charged based on the compute power required to process your stream queries.

  • Power BI offers different product options for different needs. Power BI Embedded offers an Azure-based option for embedding Power BI capabilities in your applications. A Power BI Embedded instance is included in the price example above.

  • Azure CosmosDB is priced based on the amount of storage and compute resources your databases need.

Next Steps

Is this page helpful?