Infra
Building Scalable, AI-ready IT Infrastructure – Spiceworks
Kenneth Tan, PhD, executive director and multilingual technology leader of Sardina Systems, delves into the growing necessity for companies to adeptly integrate AI into their operations, offering strategic insights on building robust AI infrastructure to unlock its full potential for business innovation and efficiency.
In recent years, the rapid expansion of artificial intelligence (AI) has revolutionized the business world, creating a demand for AI capabilities that often surpass existing organizational strategies. This has led companies to seek effective ways to integrate AI into their operations.
How and where should one start with AI infrastructure, focusing on a long-term strategy to maximize AI’s potential?
Companies are increasingly automating departments with AI and integrating new AI features into their applications. Despite the growth in AI adoption, as highlighted by McKinsey’s research showing a significant increase since 2017, the application of AI tools is still largely relaying on personal usage and daily routine optimization.
At the same time, AI is quickly developing at an enterprise level. Our clients, including those offering AI as a service and those embedding AI into their products, illustrate the market’s demand for easily integrated AI tools and the development of innovative services, such as providing researchers with access to computational resources.
This article addresses both service providers and product incorporators and underscores the importance of a well-planned, scalable IT infrastructure to avoid unnecessary expenses and obsolescence due to rapidly evolving technologies. Beginning the journey towards AI infrastructure integration requires understanding current capabilities and strategically planning for future needs. We outline five essential steps to navigate this process effectively.
1. Current Infrastructure Assessment
A strategic approach to cloud and overall IT infrastructure is imperative. Changes, especially core IT infrastructure, require a comprehensive evaluation of the business model and anticipated workloads, perhaps years in advance, highlighting the critical importance of meticulous planning in the era of AI.
The initial step in this transformation involves conducting an in-depth assessment of the current IT infrastructure, pinpointing strengths, weaknesses, and gaps concerning AI requirements. This assessment should encompass hardware (e.g., servers, storage, networks), software (e.g., databases, application platforms), and existing data management practices. Understanding the present state of the IT infrastructure is essential for planning necessary upgrades or changes for AI adoption.
Several concepts and frameworks traditionally assist organizations in assessing their IT infrastructure. Utilizing such methodologies can offer structured approaches for evaluating the effectiveness, efficiency, security, and alignment of IT systems with business objectives:
- Among the classic methodologies, ITIL (Information Technology Infrastructure Library) is particularly relevant, as the integration of AI should be closely aligned with the business strategy. Developed by the UK Government and regularly updated, ITIL offers a detailed framework for IT service management to align IT services with business needs, covering service design to improvement. It encourages a flexible approach to IT management, with ITIL 4 providing significant advancements. Integrating management with technology is essential; a mere technical focus can miss broader impacts, especially in AI integration.
- Another relevant one is COBIT, by the Information Systems Audit and Control Association, which provides a comprehensive framework for enterprise IT management and governance, ensuring alignment with business objectives, risk management, and performance optimization. Utilizing COBIT ensures that IT upgrades enhance efficiency and are aligned with best practices for automation and management.
2. Computational Power and High-performance Processors
AI, and deep learning in particular, demands processors with significant computational capabilities, such as Graphics Processing Units (GPUs), Tensor Processing Units (TPUs), or Field-Programmable Gate Arrays (FPGAs).
The hardware market is in a state of constant evolution, with new processors and accelerators designed for AI emerging globally. One such recent innovation is the Gaudi3 processor. While it may be premature to adopt new products hastily, it is undoubtedly crucial to monitor the market closely, stay informed about novel solutions, and thoroughly explore this sector. Key questions to consider include how these new offerings differ qualitatively, what software is capable of supporting such infrastructure, and whether the new solutions can address the current challenges faced by your system.
GPUs are the most popular choice today due to their widespread commercial availability and high performance. When selecting a GPU for AI infrastructure, it’s important to consider the specific requirements of your workload. This includes determining whether the focus is on training or inference, assessing the size and complexity of your models, considering budget constraints, and evaluating the software ecosystem.
NVIDIA’s GPUs, especially the A100, H100, or H200, are highly favored in the industry for their performance, comprehensive software support, and specialized AI acceleration features. Nevertheless, AMD and Intel are becoming increasingly viable alternatives, particularly in scenarios where their unique features or cost-effectiveness present clear benefits.
Another critical consideration is the ability to scale resources according to workload demands. This scalability is essential for managing costs and maintaining efficiency throughout the various AI model development and deployment stages.
To illustrate the critical role of scalability in AI operations, consider the regular seasonal demand faced by major e-commerce platforms. Amazon uses AI to optimize various aspects of its operations, including inventory management, personalized recommendations, and logistics. During peak shopping periods like Black Friday or Cyber Monday, the demand on Amazon’s systems surges dramatically. To handle this, Amazon leverages cloud computing platforms that allow for dynamic scaling of resources.
This scalability is crucial for handling increased traffic and cost management. By scaling resources up only when needed and backing down when demand wanes, Amazon can efficiently manage costs without sacrificing performance.
This approach ensures that the infrastructure is not underutilized during off-peak times and overburdened during high-demand periods, maintaining an optimal balance that supports continuous AI-driven innovation and customer satisfaction.
3. High-volume Storage and Management
AI systems require the capability to store and manage substantial volumes of data. This necessitates fast, scalable storage solutions, such as object storage for unstructured data and high-performance databases for structured data.
Fast and reliable access to this data is imperative for the effective training of AI models. Achieving this may involve utilizing advanced data caching, deploying high-bandwidth networking, and implementing efficient data retrieval systems. Ceph, as a storage solution, exemplifies flexibility and efficiency in handling large data volumes. It ensures compatibility with existing applications and facilitates integration with cloud platforms, making it a cost-effective solution.
Moreover, Ceph significantly reduces the expenses associated with storing enterprise data and supports organizations in managing exponential data growth. By running on commodity server hardware, Ceph can lower capital expenditures (CapEx) and operational expenditures (OpEx) through its self-managing and self-healing capabilities.
An alternative way to provide mass data storage and high capacity efficiently is using NVMe over Fabrics (NVMe-oF). It is not a product of a single manufacturer but rather a standard that various companies can use to create their products and solutions.
NVMe over Fabrics (NVMe-oF) can significantly aid in building powerful and cost-effective data storage systems, especially where high performance and scalability are required. NVMe-oF leverages NVMe SSD advantages like low latency and high data transfer speeds over network connections. This allows remote systems to access data nearly as quickly as if the storage devices were connected locally via PCIe, ideal for high-demand applications like databases, large computational workloads, and real-time big data processing.
Another advantage is the easy scaling of storage systems by adding more NVMe devices to the network without performance loss, allowing organizations to meet growing data storage needs without a complete infrastructure overhaul.
Although NVMe devices may be more expensive than traditional SATA SSDs or HDDs, using NVMe-oF can create more efficient and faster storage systems. Savings are achieved through better performance per device, which can reduce the total number of devices needed and lower maintenance and energy costs.
4. Software and Cloud Platform Providers
Choosing the right cloud platform or vendor is a critical decision for AI infrastructure. While most cloud platforms are capable of supporting AI workloads, the primary consideration should be compatibility with the processor selected for your system. However, for an infrastructure to be truly effective, this alone is not sufficient.
The expertise of the AI infrastructure team is vital for achieving optimal performance. Despite the prevalence of cloud virtualization, it may not always be the best for AI systems. A hybrid model of cloud, virtualization, and bare metal, as shown in one case, can effectively meet deep learning’s demands, blending computing power distribution with high-performance bare metal access. Thus, it’s crucial for vendors to precisely understand the system’s goals to tailor the most effective cloud technology solution.
One of the financial organizations, JPMorgan Chase, faced the challenge of processing vast amounts of data for real-time financial analysis and risk management. Recognizing the limitations of traditional cloud virtualization for their specific AI workloads, JPMorgan Chase adopted a hybrid cloud infrastructure that combines cloud, virtualization, and bare metal solutions.
This hybrid model allows JPMorgan Chase to leverage the flexibility of cloud and virtualization for scalability and cost-effectiveness while also utilizing the power of bare metal servers to handle compute-intensive AI tasks. The bare metal servers provide direct access to hardware resources, bypassing the overhead that comes with virtualization, which is crucial for the performance-intensive requirements of deep learning algorithms used in financial modeling and risk assessment.
The infrastructure must be flexible enough to adapt to evolving AI demands, enabling the incorporation of new technologies and the adjustment or expansion of resources with minimal disruption. Many cloud providers offer AI and machine learning services that can be integrated into the AI infrastructure, providing access to powerful tools for model development, training, and deployment.
Technologies such as OpenStack for virtualization and Kubernetes for containerization play a vital role in managing AI applications. They simplify the deployment, scaling, and operation of AI workloads across varied environments, enhancing the infrastructure’s agility and responsiveness to changing needs.
5. Energy Efficiency and Consumption
Incorporating AI into IT infrastructure boosts data processing and algorithm execution capabilities but raises energy consumption concerns, especially for deep learning models that need significant computational power. The energy part is the most challenging aspect because the conventional strategy for improving energy efficiency involves redistributing loads and powering down unused capacities.
However, in the context of AI, shutting down machines can be particularly challenging and may only marginally enhance energy efficiency.
Thus, it is advisable to focus on balancing performance and identifying and managing the components of the infrastructure that consume the most energy.
In AI infrastructure, these are typically GPUs, FPGAs, and other hardware components that require continuous cooling or heat dissipation.
If optimizing energy usage directly proves difficult, significant savings can still be achieved by improving the efficiency of the cooling systems. For instance, data centers in Iceland, like Borealis or atNorth, illustrate an effective approach to energy management. Leveraging Iceland’s cool climate and abundant renewable energy sources, these data centers utilize natural cooling and geothermal energy, significantly reducing the need for artificial cooling and hence lowering the overall energy consumption of AI infrastructures.
However, operating in Iceland also presents challenges, particularly with network latency and connectivity due to its remote location. Thus, it’s wise to carefully choose the types of workloads and the timing of projects executed in these data centers. Workloads that are less sensitive to latency or those that can be scheduled during off-peak hours can optimize both energy use and network performance, making the most of Iceland’s unique environment while mitigating potential drawbacks. This strategic approach underscores the importance of thoughtful infrastructure planning in maximizing both performance and energy efficiency in AI operations.
Another way to explore enhancing energy efficiency is energy-efficient GPUs and TPUs, and optimizing AI algorithms through model pruning and quantization, which are key for reducing energy use while maintaining performance. Technological advancements in semiconductors and AI accelerators also help lower power consumption.
Furthermore, adopting green data center technologies, leveraging virtualization and cloud computing, and employing dynamic scaling and AI-driven resource management improve energy efficiency in AI operations by tailoring resource use to demand. This ensures efficient energy use across IT infrastructures without sacrificing performance.
In conclusion, for organizations aiming to harness innovation and gain a competitive edge, transforming IT infrastructure to be AI-ready is essential. Prioritizing data enhancement, investing in specialized hardware, integrating robust cloud solutions, securing networks, and cultivating a skilled team are fundamental steps that lay the groundwork for effective AI deployment and future technological progress. These efforts ensure the successful implementation of AI and the flexibility to adapt to emerging AI advancements and opportunities.