Big data – specifically the insights it offers – is transforming organizations across all industries.
Big data analytics is enabling healthcare professionals to make better medical decisions while delivering an ever-increasing quality of patient care. It’s allowing educators to predict learning outcomes and help students find colleges and majors that fit their interests and skills. It’s providing marketers with information to create better customer experiences and facilitate customer loyalty.
Unfortunately, developing applications that employ or enable big data analytics is far from easy. In fact, big data analytics and management, in general, can be extremely complex. That’s largely due to the ever-increasing amount of data. Much of it is unstructured and entails different preparation, processing, storage, and governance than structured data.
Working with big data also requires special expertise, which is often outside the capabilities of many developers and IT teams. Infrastructure, application architecture, and cost considerations come into play as well.
There’s no set-in-stone way for how to best deal with big data in developing apps. However, understanding some of the best practices used to architect successful big data analytics projects is a good way to start.
There are various architectural models that can be used. Among them: traditional big data architecture, data streaming architecture, Lambda architecture, Kappa architecture, and Unified architecture. Each has its own pros and cons; the use case will typically influence which one should be utilized. For the purposes of this blog, we’ll focus on traditional big data architecture.
The Architecture Layers
Working with big data requires an architecture that is purposefully designed to ingest, process, and analyze large, complex datasets that traditional database systems can’t handle. Multiple layers are required to address the variety of tasks entailed.
The specific layers vary based on the model but the simplest typically includes three layers:
Data doesn’t just magically appear. It must be gathered, often from numerous, diverse sources. From there, some form of pipeline must move it from all the places where it’s generated to where it will be stored. This data ingestion process involves three overarching steps — extract, transform and load (ETL), which much be covered in the app architecture:
- Extraction – taking the data from its source
- Transformation – cleansing the data for business use
- Loading – moving the data into a database, data warehouse, or data lake
Decisions made for each step regarding technologies and other factors will be dependent on whether the data is structured, unstructured, or semi-structured; the data processing requirements; and storage and budget considerations. Who will be using the data also comes into play.
To whatever extent possible, it’s important to automate as much of the process as possible for speed and efficiency and to facilitate adherence to any governance protocols. Using tools that identify problems in the data collection process is also recommended.
Note: in some big data architectures, the data collection and data ingestion layers are separated out.
The storage layer serves two general purposes in a big data architecture: storing the data for both short- and long-term use and making it available for consumption either in batch or streaming modes.
Databases only work in monolithic environments where the data is generated by a single app. Data warehouses support large volumes of data from multiple sources, making them more suitable for big data apps. However, data lakes are preferred because they’re better suited to dealing with unstructured data. They can store different data sets in their native formats and typically are built on technologies like Spark, Hadoop, NoSQL databases, and cloud object storage services.
The data storage layer is often split into two components. One handles storage for archive and persistent data. This is where data is stored for days, months, years, or even decades. Object storage, offered as a service by cloud providers, is often used for this purpose as it allows for cost-effectively storing all kinds of data, and supports fast reading of large volumes of data.
The other component of the storage layer handles the storage required for streaming use cases which entail low latency read/write operations on a single message. It’s typically associated with Apache Kafka, but there are numerous options on the market.
The cost of this type of storage is significantly higher than that of object storage. For cost-efficiency purposes, it’s advisable to configure a data retention policy, so that data is stored only for a predetermined amount of time in the more expensive storage. It is then automatically transferred to a less expensive tier of storage or deleted.
In general, the storage layer should be flexible so that different storage levels can be used to optimize costs. It should also be scalable so storage capacity can be easily changed. And it should be performant so large volumes of data can be read with high enough throughput from object storage and read/write single messages can be accessed with low latency when needed.
The processing layer is often considered the most important layer in big data architectures. That’s because this is where the real action takes place ─ and the value of big data is delivered. Generally, the data processing layer performs parallel computing, data cleansing, data integration, data fusion, data indexing, virtualization, and other tasks.
Depending on the specific architecture model, it may also be where big data analytics takes place. Note: in some big data architecture models, data processing and analytics – and sometimes even visualization – are separated out into their own layers.
While batch processing functions have long dominated the processing layer, real-time data messaging and streaming capabilities are becoming more affordable and, consequently, more common. As such, it’s important for big data app architecture to accommodate both batch and streaming processing.
For example, some apps may benefit from the use of streaming processing and analytics solutions, such as Apache Kafka Streaming and Apache Storm, that allow for direct analysis of messages in real-time. The analysis can be rule-based or use advanced analytics to extract events or signals. Historic data is often integrated into analytics to compare patterns, making these capabilities valuable for recommendation and prediction engines.
It’s also important to note that not all data has to be analyzed by humans. Machine learning algorithms and AI tools can process big data volumes that data science teams couldn’t handle on their own. These should be accommodated in the architecture as well.
Other Big Data Analytics Architecture Considerations
While a simple three-layer architecture model will work for many app development projects, chances are that something more sophisticated will be required to handle all the tasks and complexities inherent in big data applications. Separate layers may need to be designated for data collection, data ingestion, storage, processing, data query, data analytics, data visualization, data security, and data monitoring.
Other considerations include selecting components, tools, and technologies for use in each layer that balance current requirements, future needs, costs and expected returns. Use case and business requirements will influence the specific choices but will likely cover these categories:
- Business intelligence and data visualization software
- Cloud platforms for processing and storage
- Data governance and data security tools
- Data lakes and data warehouses
- Extract, transform, and load tools
If legacy systems are involved, particularly in terms of data generation, there may be issues to contend with regarding integration. There’s also the matter of complying with regulatory requirements, privacy standards, and best practices.
Special expertise is required for dealing with big data apps. Experience in agile development is a must. If you don’t have it in-house, you’ll need to recruit for it or consider outsourcing. It’s also essential to have staff with the necessary skill sets to assess options, develop and mature the architecture, and ultimately manage the deployed technologies.
It’s going to be important to decouple systems to ensure new tools and technologies can be integrated without major disruption. Particularly important will be the implementation of a data governance program to ensure that the data is well secured, complete for the planned use cases, and trusted by users.
Next Steps in Big Data Analytics
If it seems like there’s a lot involved in developing big data applications, that’s because there is. The information covered in this blog barely scratches the surface. That’s why many organizations opt to work with a partner experienced in developing apps that enable or employ big data, whether they’re just starting their big data journey or are mature users.
If a big data project is in your organization’s future, consider teaming up with ClearScale. We offer the full range of scalable, efficient, and cost-effective data and analytics services, including architecture design and infrastructure migration, data integration, systems integration, automation, management, and application development using AWS big data services.