CPU Process on ME3600 (NLFM Learn Pr)



NOC bilang ada high temperature yang disebabkan bahwa switch di RNC flapping. Memang sih ada flapping..

1554207: May 14 16:13:45 WIB: %SW_MATM-4-MACFLAP_NOTIF: Host xxxx.xx08.aa81 in vlan x00 is flapping between port VP3 and port Po1+Efpx00
1554208: May 14 16:13:45 WIB: %SW_MATM-4-MACFLAP_NOTIF: Host xxxx.xx09.5278 in vlan x00 is flapping between port VP3 and port Po1+Efpx00
1554209: May 14 16:13:46 WIB: %SW_MATM-4-MACFLAP_NOTIF: Host xxxx.xx00.2f78 in vlan x00 is flapping between port VP3 and port Po1+Efpx00

kenapa bisa tau high utilization nya karena flapping?? katanya begini :

xxx-ME3600-xxx#sh processes cpu sorted
CPU utilization for five seconds: 93%/40%; one minute: 94%; five minutes: 94%
PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
  84   756462208  1192821049        634 46.38% 45.66% 44.87%   0 NLFM Learning Pr
  81    64159820  1247163834         51  3.21%  3.21%  3.03%   0 Nile LED Process
  77   395802052   140134743       2824  1.60%  1.31%  1.27%   0 npm counter proc
  95   118802540    34047744       3489  0.72%  0.38%  0.36%   0 NL3MD_STAT       

NOC said, " NLFM Learning Pr  itu karena aktivitas MAC Learningnya"
Kenapa NOC berkata seperti itu? karena dia dapet info dari MSI like this :
MSI said, " di log ada flapping 1512929: May 13 09:40:01 WIB: %SW_MATM-4-MACFLAP_NOTIF: Host xxxx.xxc5.9b02 in vlan 900 is flapping between port Gi0/16+Efpx00 and port VP3".

Okey... lets see this.. 
xxx-ME3600-xxx#sh int des
Interface                      Status         Protocol Description
---output ommited---
Gi0/16                         admin down     down     Metro namacustomer Test
---output ommited---

Gi0/16 itu dalam kondisi down.. dan ... itu ga kekonect kemana-mana..

Lets see, what is the maksud of "NLFM Learn PR" based on

CSCtr82351
ME3600 crash after ospf configuration.
Router crashes when trying to change OSFP area/multipath parameters. NLFM learning process causes CPU HOG and the router is forced to crash by watchdog.
Workaround: None. 

Yuk, sekarang terjemahnya dalam bahasa Indonesia versi google translate adalah (dengan sedikit modifikasi biar terjemahnya ga terlalu aneh):
CSCtr82351
ME3600 crash saat mencoba mengganti OSPF area/ parameter multipath nya. NLFM learning process menyebabkan CPU HOG dan router dipaksa untuk melakukan hal tersebut oleh watchdog.
Solusi: Tidak ada.


Tuh.. berarti penyebabnya bukan karena flapping, tapi lebih cenderung ke OSPF proses nya.

Lain banget dengan saat saya menjadi NCC di Patra, NOC di tempat ini kenapa cenderung melakukan praduga tak bersalah terus kepada dirinya. Terlalu prosedural dan bertele-tele. Jika kemarin ada Site yang gangguan, selalu ditanya indikasi perangkat. Hari ini Site yang sama gangguan kembali, selalu ditanya indikasi perangkat, besok, besoknya lagi, besoknya lagi gangguan lagi, selalu ditanya indikasi perangkat. Harusnya kalo gangguan berulang harus diinvestigasi..

Untuk indikasi perangkat, menurut saya cukup Level 1. Masa NOC masih harus ngurus indikasi perangkat?

"Customer is a king ". yang jelas customer nggak akan komplain kalo mereka nggak ada gangguan. Dan yang jelas, Customer (Corporate) selalu melakukan pengecekan di sisi internal mereka dahulu sebelum komplain. Jadi, selalu tanamkan praduga bahwa kita lah yang bersalah. Lalu diinvestigasi sedalam mungkin. Jika sudah yakin link internal aman, lampirkan capture pengecekannya dan kirimkan ke user/customer. Bukan terus-menerus melemparkan kesalahan ke Customer..

Hmm...
Berikut tentang high utilization CPU saya copas dari blog sebelah http://allaboutmylife.wordpress.com/2008/01/21/cisco-router-high-cpu-utilization-up-to-99/ 


Bagi anda yang menggunakan Cisco router, mungkin suatu saat anda akan mengalami hal ini atau anda pernah mengalaminya. 3 hari yang lalu saya mengalami hal tersebut. Tiba-tiba salah satu router kami mengalami masalah, ping putus2 sekalipun ping ke ip loopback, SSH juga susah sekali masuk setelah mencoba beberapa kali akhirnya muncul halaman login, tetapi authentikasi gagal karena koneksi ke TACACS server juga ikut terganggu. Cukup merepotkan juga apalagi saya hanya diberi waktu 30 menit untuk melakukan analisa dan mencari solusinya. Sedikit informasi saya menggunakan router Cisco 7600.

Pada menit pertama saya berhasil mengidentifikasi permasalahannya, yaitu Utilisasi CPU yang terlalu tinggi mencapai 99%”. Hal ini saya ketahui melalui monitoring tool yang saya gunakan, jadi saya gak perlu repot2 login ke router karena dengan kondisi seperti sekarang sangat sulit untuk bisa masuk ke router. Setelah tau penyebabnya maka sekarang saya harus mencari penyebab “High CPU Utilization”. Berikut hal-hal yang dapat anda lakukan apabila mengalami problem yang sama.

Identifikasi Masalah
SHOW PROCESS CPU NOTIFICATIONS (if any)
Router#show process cpu sorted
Dengan command ini anda akan mendapatkan informasi tetang CPU utilisasi, process-process yang menggunakan CPU, dan interrupt percentages. Informasi tersebut bisa anda dapatkan pada baris pertama dari output show.

router#sh proc cpu sort
CPU utilization for five sec: 99%/54%; one minute: 99%; five minutes: 99%
PID ntime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process
77 11697800 822858 14216 38.52% 37.51% 37.71% 0 IP Input
128 702920 714062 984 2.47% 2.57% 2.50% 0 Tag Input
Keterangan:
CPU utilization for five seconds: x%/y%; one minute: a%; five minutes: b%
Total CPU Utilization: x%
Process Utilization: (x - y)%
Interrupt Utilization: y%
Process Utilization is the difference between the Total and Interrupt 
(x and y). The one and five minute utilizations are exponentially 
decayed averages (rather than an arithmetic average), therefore 
recent values have more influence on the calculated average.

SHOW LOG
Lakukan “show logging” untuk mendapatkan informasi log router, barangkali ada informasi berguna yang dapat membantu untuk melakukan identifikasi masalah dan menemukan akar permasalahannya.

Dampak dari High CPU Utilization
- Input queue drops
- Slow performance
- Slow response int Telnet or unable to Telnet router
- Slow response on the console
- Slow or no response to ping
- Router doesn’t send routing updates

Kemungkinan Penyebab High CPU Utilization
Berdasarkan hasil “show process cpu” saya mendapatkan informasi bahwa IP Input menggunakan resource CPU cukup besar. Maka ada kemungkinan penyebabnya adalah seputar IP Input. Tetapi meskipun demikian, saya tidak menutup mata akan adanya penyebab yang lain. Hal-hal yang mungkin menyebabkan High CPU Utilization antara lain:

Hardware failure
Perkiraan saya untuk hardware failure yang dapat menyebabkan High CPU Utilization adalah RSP failure dan/atau VIP failure.

Configuration
Berdasarkan hasil “show process cpu” kemungkinan permasalahan berhubungan dengan IP Input, jadi bisa saja karena konfigurasi fast switching, TCP intercept, penggunaan IP NAT, dll. Atau bisa juga karena ada yang melakukan DoS.

Langkah-langkah Penanganan

Hardware failure
a. RSP Failure
Jika yang bermasalah RSP ya diganti aja :D . Tetapi sekedar saran mungkin opsi ini dijadkan pilihan terakhir saja, karena untuk menggangti RSP router harus dimatikan. Total estimasi downtime yang dibutuhkan kurang lebih 15 menit. Waktu yang dibutuhkan router 7600 untuk booting kira2 10-15, itu termasuk update routing, dll.
b. VIP Failure
Untuk memastikan apakah yang bermasalah VIP atau bukan, maka anda bisa melakukan:
Router#sh controllers vip all proc cpu | i utilization
CPU utilization for five seconds: 1%/1%; one minute: 1%; five minutes: 1%
CPU utilization for five seconds: 5%/5%; one minute: 5%; five minutes: 5%
CPU utilization for five seconds: 21%/21%; one minute: 21%; five minutes: 21%
CPU utilization for five seconds: 19%/19%; one minute: 20%; five minutes: 19%
Jika anda ingin melihat detil prosesnya maka lakukan perintahsh controllers vip vip_number proc cpu “.

Configuration
TRY THIS: If IP Input is consuming the CPU, one of the following might be the cause:
– Fast switching is disabled on an interface (or interfaces) that has a lot of outgoing traffic. Examine the output of the ‘show interfaces switching’ command to see which interface is burdened with traffic. Re-enable fast switching on that interface.
- TCP Intercept is enabled. TCP Intercept requires process switching for all packets during session set-up.
- Fast switching is disabled on an interface which supports more than one network and is routing traffic between them. This can occur when an interface has one or more secondary network addresses configured.
INFO: The router will process switch all packets sourced from the interface and destined to host(s) off the same interface which is a CPU-intensive task. Use the ‘ip route-cache same-interface’ interface configuration command to allow packets to be fast switched on the same interface.
- Traffic that can’t be fast switched is arriving. This could be any of the following types of traffic:
* Packet for which there is no entry yet in the switching cache.
INFO: If there is a device in the network which is generating lots of packets at an extremely high rate for devices reachable through the router and is using different source or destination ip addresses, there won’t be a match for these packets in the switching cache, so they will be processed by the IP Input process. This source device can be a malfunctioning device or a device attempting a Denial-of-Service (DOS) attack.
* Packets destined for the router (ie. Routing Updates or a Spoof Attack)
* IP packets with options
* Packets that require protocol translation
* Multilink PPP
* Packets that require policy routing.
INFO: IOS versions 11.3 and higher allow policy-routed packets to be fast switched. Usee the ‘ip route-cache policy’ interface configuration command to allow policy-routed packets to be fast switched.
* Packets going through serial interfaces with X.25 encapsulation. In the X.25 protocol suite, flow control is implemented in layer 2 of the OSI model.
* Compressed traffic. If there’s no Compression Service Adapter (CSA) in the router, compressed packets must be process-switched.
* Encrypted traffic. If there’s no Encryption Service Adapter (ESA) in the router, encrypted packets must be process-switched.
- A lot of packets, arriving at an extremely high rate, for a destination in a directly attached subnet, for which there is no entry in the ARP table. This shouldn’t happen with TCP traffic, because of the windowing mechanism, but it can happen with UDP traffic.
- A lot of multicast traffic going through the router. Unfortunately, there’s no easy way to examine the amount of multicast traffic. If you’ve configured multicast routing on the router, you can enable fast switching of multicast packets using the ‘ip mroute-cache’ interface configuration command (fast switching of multicast packets is off by default).
- A lot of broadcast traffic. Check the number of broadcast packets in the ‘show interfaces’ command output.
- Too much traffic is passing through the router. If the router is over-used and is incapable of handling this amount of traffic, try distributing the load among other routers or consider purchasing a high-end router.
- IP NAT is configured on the router and there are lots of DNS packets going through the router. UDP or TCP packets with source and/or destination port 53 (DNS) are always punted to process level by NAT.
- Check who’s logged on to the router and what they are doing. If someone is logged on and is issuing commands that produce long output, the high CPU utilization by the IP input process will be followed by a much higher CPU utilization by the virtual EXEC process.
- Make sure all debugging commands in your router are turned off by issuing the undebug all or no debug all command.
- Check for a possible security issue. Commonly, high CPU utilization is caused by a security issue, such as a worm or virus operating in your network. Usually, a configuration change, such as adding additional lines to your access lists can mitigate the effects of this problem. Check the Cisco Product Security Advisories and Notices for information on the most likely causes and specific workarounds.

No comments:

Post a Comment