Troubleshooting ideas for Linux system CPU usage of 100%

Infineon / Mitsubishi / Fuji / Semikron / Eupec / IXYS

Troubleshooting ideas for Linux system CPU usage of 100%

Posted Date: 2024-01-24


Hello everyone, this is Haodao Linux, a platform that mainly shares with you IT knowledge related to Linux, Python, and network communications.

Today Haodao will share with you some hard-core Linux information. When your server CPU reaches 100% at work, there is no use in rushing. You have to check the problem yourself. This article will give you a list of ideas for troubleshooting abnormal faults, and a relevant shell script is attached at the end of the article. If you practice it, you will find that the method to solve the problem is so simple!

Yesterday afternoon, I suddenly received an operation and maintenance email alarm, which showed that the CPU utilization rate of the data platform server reached 98.94%, and it has been above 70% recently. It seems that the hardware resources have reached a bottleneck and need to be expanded, but if you think about it carefully, it will become clear. We found that our business system is not a highly concurrent or CPU-intensive application. This utilization rate is a bit too exaggerated. The hardware bottleneck should not arrive so quickly. There must be a problem with the business code logic somewhere.

2. Troubleshooting ideas

2.1 Locate high-load process pid

First, log in to the server and use the top command to confirm the specific situation of the server, and then analyze and judge based on the specific situation.

By observing the load average and the load evaluation standard (8 cores), it can be confirmed that the server has a high load;

Observing the resource usage of each process, we can see that the process with process ID 682 has a higher CPU ratio.

2.2 Locate specific abnormal businesses

Here we can use the pwdx command to find the business process path based on the pid, and then locate the person in charge and the project:

It can be concluded that this process corresponds to the web service of the data platform.

2.3 Locate the abnormal thread and specific code lines

The traditional solution is generally 4 steps:

1. top order by with P:1040 // First find maxLoad(pid) sorted by process load

2. top -Hp process PID:1073 // Find the relevant load thread PID

3. printf “0x%x” thread PID:0x431 // Convert the thread PID to hexadecimal to prepare for later searching for jstack logs

4. jstack process PID | vim +/hex thread PID - // For example: jstack 1040|vim +/0x431 -

But for online problem locating, every second counts, and the above four steps are still too cumbersome and time-consuming. Oldratlee, who introduced Taobao before, encapsulated the above process into a tool:, which can It is very convenient to locate such problems online:

It can be concluded that the execution CPU of a time tool method in the system is relatively high. After locating the specific method, check whether there are performance problems in the code logic.

※If the online problem is more urgent, you can omit 2.1 and 2.2 and directly perform 2.3. The analysis here is from multiple angles just to present you with a complete analysis idea.

3. Root cause analysis

After the previous analysis and troubleshooting, we finally located a problem with time tools, which caused excessive server load and CPU usage.

Exception method logic:It is to convert the timestamp into the corresponding specific date and time format;

Upper layer call:Calculate all the seconds from early morning to the current time, convert them into the corresponding format, put them into the set and return the result;

Logical layer:Corresponding to the query logic of the real-time report of the data platform, the real-time report will be generated at fixed time intervals, and there will be multiple (n) method calls in one query.

Then it can be concluded that if the current time is 10 a.m. that day, the number of calculations for a query is 10*60*60*n times = 36,000*n calculations, and as time increases, the number of single queries will be linear as time goes by. Increase. Since a large number of query requests from modules such as real-time query and real-time alarm require calling this method multiple times, a large amount of CPU resources are occupied and wasted.

4. Solution

After locating the problem, the first consideration is to reduce the number of calculations and optimize the exception method. After investigation, it was found that when used at the logic layer, the contents of the set collection returned by this method were not used, but the size value of the set was simply used. After confirming the logic, simplify the calculation through a new method (current seconds - seconds in the early morning of the day), replace the called method, and solve the problem of excessive calculations. After going online, we observed the server load and CPU usage. Compared with the abnormal time period, the server load and CPU usage dropped by 30 times and returned to normal. At this point, the problem has been solved.

5. Summary

During the coding process, in addition to implementing business logic, we must also focus on optimizing code performance. The ability to realize a business requirement and the ability to achieve it more efficiently and more elegantly are actually two completely different manifestations of engineers' abilities and realms, and the latter is also the core competitiveness of engineers.

After the code is written, do more reviews and think more about whether it can be implemented in a better way.

Don’t miss any small detail in online questions! Details are the devil. Technical students need to have the thirst for knowledge and the spirit of pursuing excellence. Only in this way can they continue to grow and improve.

Attached is the script:

#!/bin/bash # @Function # Find out the highest cpu consumed threads of java, and print the stack of these threads. # # @Usage # $ ./ # # @author Jerry Lee readonly PROG=`basename $0` readonly -a COMMAND_LINE=("$0" "$@") usage() { cat /dev/null; then ( -z "$JAVA_HOME" ) && { redEcho "Error: jstack not found on PATH!" exit 1 } ! ( -f "$JAVA_HOME/bin/jstack" ) && { redEcho "Error: jstack not found on PATH and $JAVA_HOME/bin/jstack file does NOT exists!" exit 1 } ! ( - x "$JAVA_HOME/bin/jstack" ) && { redEcho "Error: jstack not found on PATH and $JAVA_HOME/bin/jstack is NOT executable!" exit 1 } export PATH="$JAVA_HOME/bin:$PATH" fi readonly uuid=`date +%s`_${RANDOM}_$$ cleanupWhenExit() { rm /tmp/${uuid}_* &> /dev/null } trap "cleanupWhenExit" EXIT printStackOfThreads() { local line local count =1 while IFS=" " read -a line ; do local pid=${line(0)} local threadId=${line(1)} local threadId0x="0x`printf %x ${threadId}`" local user =${line(2)} local pcpu=${line(4)} local jstackFile=/tmp/${uuid}_${pid} ( ! -f "${jstackFile}" ) && { { if ( " ${user}" == "${USER}" ); then jstack ${pid} > ${jstackFile} else if ( $UID == 0 ); then sudo -u ${user} jstack ${pid} > ${jstackFile} else redEcho "($((count++))) Fail to jstack Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process(${pid}) under user (${user})." redEcho "User of java process($user) is not current user($USER), need sudo to run again:" yellowEcho " sudo ${COMMAND_LINE(@)}" echo continue fi fi } || { redEcho "($((count++))) Fail to jstack Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process(${pid}) under user($ {user})." echo rm ${jstackFile} continue } } blueEcho "($((count++))) Busy(${pcpu}%) thread(${threadId}/${threadId0x}) stack of java process( ${pid}) under user(${user}):" sed "/nid=${threadId0x} /,/^$/p" -n ${jstackFile} done } ps -Leo pid,lwp,user,comm ,pcpu --no-headers | { ( -z "${pid}" ) && awk '$4=="java"{print $0}' || awk -v "pid=${pid}" '$1== pid,$4=="java"{print $0}' } | sort -k5 -r -n | head --lines "${count}" | printStackOfThreads

Review Editor: Tang Zihong

#Troubleshooting #ideas #Linux #system #CPU #usage