Seattle Conference on Scalability: Building a Scalable Resource Management
55:35
-
2 years ago
Google Tech Talks
June 23, 2007
ABSTRACT
2007 Google Seattle Conference on Scalability: Building a Scalable Resource Mgmt System for Grid Computing
Speaker: Khalid Ahmed, Platform Computing Corp.
This talk will describe the architecture and implementation details
for building a highly scalable resource management layer that can
support a variety of applications and workloads. This technology
has evolved from large scale computing grids deployed in
production at customers such as Texas Instruments, AMD, JP
Morgan, and various government labs. We will show how to build a
centralized dynamic load information collection service that can
handle up to 5000 nodes/20,000 cpus in a single cluster. The
service is able to gather a variety of system level metrics and is
extensible to collect up to 256 dynamic or static attributes of a node
and actively feed them to a centralized master. A built-in election
algorithm ensures timely failover of the master service ensuring
high-availability without the need for specialized interconnects.
This building block is extended to multiple clusters that can be
organized hierarchically to support a single resource management
domain that can span multiple data centers. We believe the current
architecture could scale to 100,000 nodes/400,000 cpus. Additional
services such as a distributed process execution service, and a
policy-based resource allocation engine which leverage this core
scale-out clustering service are described. The protocols,
communication overheads, and various design tradeoffs that were
made the development of these services will be presented along
with experimental results from various tests, simulations and
production environments.
Khalid Ahmed works as the Chief Architect and Director of
Technology Research at Platform Computing. In over 12 years at
Platform he worked in a number of roles including development,
product management and architecture. His work on distributed
scheduling, wide-area resource sharing, workload management,
system automation, virtualization management, and high availabilityGoogle Tech Talks
June 23, 2007
ABSTRACT
2007 Google Seattle Conference on Scalability: Building a Scalable Resource Mgmt System for ...all »Google Tech Talks
June 23, 2007
ABSTRACT
2007 Google Seattle Conference on Scalability: Building a Scalable Resource Mgmt System for Grid Computing
Speaker: Khalid Ahmed, Platform Computing Corp.
This talk will describe the architecture and implementation details
for building a highly scalable resource management layer that can
support a variety of applications and workloads. This technology
has evolved from large scale computing grids deployed in
production at customers such as Texas Instruments, AMD, JP
Morgan, and various government labs. We will show how to build a
centralized dynamic load information collection service that can
handle up to 5000 nodes/20,000 cpus in a single cluster. The
service is able to gather a variety of system level metrics and is
extensible to collect up to 256 dynamic or static attributes of a node
and actively feed them to a centralized master. A built-in election
algorithm ensures timely failover of the master service ensuring
high-availability without the need for specialized interconnects.
This building block is extended to multiple clusters that can be
organized hierarchically to support a single resource management
domain that can span multiple data centers. We believe the current
architecture could scale to 100,000 nodes/400,000 cpus. Additional
services such as a distributed process execution service, and a
policy-based resource allocation engine which leverage this core
scale-out clustering service are described. The protocols,
communication overheads, and various design tradeoffs that were
made the development of these services will be presented along
with experimental results from various tests, simulations and
production environments.
Khalid Ahmed works as the Chief Architect and Director of
Technology Research at Platform Computing. In over 12 years at
Platform he worked in a number of roles including development,
product management and architecture. His work on distributed
scheduling, wide-area resource sharing, workload management,
system automation, virtualization management, and high availability«
Download is starting. Save file to your computer. If the download does not start automatically, right-click this link and choose "Save As". How to get videos onto the iPod or PSP.