Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

69
Used AI/AIOPS to Identify and Squash a Y2K bug... in 2023
Post Flair (click to view more posts with a particular flair)
Post Body

So this is going to be a bit long, but I thought you might all get a kick out of it.

I haven't been a real Sysadmin for 15 years or so now... but it is in your blood and part of your soul. Once a Sysadmin, always a Sysadmin.

What I do now is help my client solve problems with a broad range of technologies that we sell (of course) by building actual Minimal Viable Products. Real working code that can do the job on a very narrow focus or a very limited functionality. 4-6 week Epics, 2 to 4 Epics max.

So I was quite excited when a group of Syadmins approached our team and asked to try and solve their problem they have. Specifically, they have 20K to 30K ETL batch jobs that run through Informatica every night, depending on cycles. Every night a job "gets the slowness" and they get called... no, the systems are fine... ok, now lets track down the job owner and have them look at it. 25K jobs means even 99.99% is still a job or two a night.

So they wanted us to correlate all the feeds with an AI, Tickets from Ticket system, Informatica job data, system perf data, Splunk feeds from the source and target databases... assuming they had Splunk feeds.

So we built it over 2 Epics, trained it on 10 years of extracted data. Mix of standard ML for identifying patterns in the metrics, LLM for picking out patterns in the unstructured ticket data. Which kinda works... the tickets are inconsistently filled out not to standards. So far it is having fun flagging badly filled out tickets and the team is going back and making them fill out the Root Cause Analysis properly. We should get better results as that happens. Many RCA's are half assed, and said half-asses are getting reamed.

Turned it loose on live feeds (it gets fed, it can't pull) and let work over the weekend. Low and behold... it identified some problematic jobs. One in particular stood out.

Now, let me give some background on these jobs. Many of them, 75 to 80% were COBOL/CICS jobs from the Mainframe that were moved off to save MIPS on the mainframe. This was done in early 2000s. Early jobs were actual refactorings, but as deadlines loomed and money ran out 50% or more were simply wrappers for the COBOL/CICS process that now ran on Power, not Mainframe. Much of THAT code was written in the 70s, 80s, an a bit in 90s when they moved to JAVA... yeah, I know! I know! But it was the early 90s.

One of the things it was trained on was to look for non-linear resource consumption. And this one job jumped out because the growth rate of lines of data processed was not in line with a typical job. So the AI flagged the process, noted it was getting a call a month minimum, mainly at the start of the month when new data was streaming in from month end closing.

So we looked. It was pulling ALL the data, even though the job spec said it should pull the last 10 years and use that. The data in the reports was accurate, there was no issue there, the data did not appear. So the report code was right.

"Hey, can we have someone look at the COBOL?"

"No, we don't have enough people to do that."

Kinda expected. Brick wall.

"Hey guys, my COBOL is really rusty, but it wouldn't hurt for me to just have a peak, if I can't find anything we haven't lost anything."

So they (AIX SYSADMIN) pull the code for me, because, hey... once a Sysadmin, always a Sysadmin, right? Right?!

And being a Sysadmin... I lied. I never coded anything in COBOL. I am a shit programmer, honestly. But I did know it was relatively easy to read. It was designed for "Non-programers" and it sure as hell is easier to read than C, C , JAVA, or a lot of other contemporary languages. LISP anyone?

Anyway, I am looking... well, it takes the current year, subtracts 10 from it... hey, that is only 2 digit year value!

Sure enough, it is pulling "All data from 1913 to NOW". 2023 - 10, trunc to last 2 digits... join it with a leading "19" and... 1913!

And that is how I identified a Y2K bug in 2023.

Now for the rest of the story... Here is where it gets good.

Mrs. "Ain't nobody got time for that!" shrugs off our findings and says "We will get to it when we get to it. It works, right?"

Translation: Fuck you, it isn't me that gets called every time it breaks.

Mr. AIX Security puts his hand up. "AKSHULLY... I am red flagging that code. Our policy is that code with Y2K code issues cannot be allowed to run in PRODUCTION. It will not be run until you fix it. "

Mrs. COBOL: "I never heard of that, besides, we can't finish batch without that job! The bank won't be able to open accounts in the morning!"

Mr. SVP, who was on standby and already briefed just in case we needed a big gun: "Well, you better get on it then, because Mr. Security is right. No Y2K non-compliant code can be run, per FEDERAL regulation. Yes, it runs, and yes it slipped past us for more years than you have been here, but it is what it is. Fix it. You have 9 hrs to batch, I suggest you start. I will approve an emergency change once you have a fix."

Probably the only time I have ever enjoyed hearing "It is what it is"

Author
Account Strength
100%
Account Age
4 years
Verified Email
Yes
Verified Flair
No
Total Karma
237,467
Link Karma
21,702
Comment Karma
210,973
Profile updated: 1 day ago
Posts updated: 8 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
1 year ago