Speaker
Description
Based on extensive experience in system maintenance and advanced artificial intelligence technology, we have designed the IHEP computing platform's intelligent operations and maintenance system. Its primary goal is to ensure optimal utilization and efficiency of computing resources.
This system automatically detects user jobs that cause anomalies in computing services and dynamically adjusts their available resources in real time.
Utilizing AI algorithms, it swiftly conducts fast, near real-time analysis of the file system's operational status and logs, identifying potential users and their process names that may be triggering anomalies.
After querying the computing node where the suspected abnormal job is located through the job scheduler, the system utilizes AI algorithms to conduct real-time analysis of the job to determine whether its behavior is causing excessive system load. Once confirmed, the system notifies the job scheduler and file system to limit the number of user job operations and the total I/O volume.
This system is employed for comprehensive monitoring and intelligent operations management of the computing platform. It dynamically adjusts the scale of available resources for users based on the overall situation of the computing platform, ensuring fair and efficient data processing for all users.
Speaker release | Yes |
---|