20 Year Retrospective on Troubleshooting

Author

Sarah Morgan

I have been working in software for a long time in a few different capacities. Recently, I was prompted to reflect on my career by an incident with one of our customers, and I’ve since been marveling at how much has changed. Most of my early career included supporting various types of systems (with varying degrees of comfort and competency, tbh). The tools I had at my disposal in those days pretty much got the job done, but this recent incident really made me stop and think about the contrast between the early 2000s and now. My life would have been much easier had I had these monitoring and diagnostic services at my disposal, though I would have had far fewer airline miles.

My first “real” job after graduating with a CS degree was as a QA engineer at a small startup in Boston that developed hedge fund and institutional money management software. Looking back on it, I would not recommend this world as a place to get your feet wet in software because it is s t r e s s f u l. Time is money rings even more true when you’re trading to beat the market, and having price and share information at your fingertips is critical. I ended up working as a software consultant, which meant I installed, maintained, and supported our software for a group of domestic and international customers.

One ubiquitous piece of technology in the early days of FinTech was the Bloomberg terminal which was first released in 1982(!). In 2003 when I got my start, the Bloomberg terminal was on every desk, along with a PC and about 8 monitors. The terminal supplied real-time market data, including pricing, analytics, news, and other stats on the state of the financial world. It was installed on-premise as client/server and came with proprietary hardware required to navigate the data and tools.

Our product had an early differentiator, which was our big hook. As a user, you could see pricing and availability in your Order Management System (OMS), where you manage your book of trades, or in your Portfolio Management System (PMS), where you oversee the larger funds status. We did this by integrating with the Bloomberg terminal, which was also installed onsite. Our software updated share prices in our database in real-time with the feed from the terminal. The timing of trading for hedge funds is everything, and obviously, the price information is a big part of that, so when there were issues with that feature, it was trouble.

One of my clients was a smaller hedge fund based in London. Luckily for me, they were typically fairly low maintenance, but one afternoon we got a somewhat frantic call about the client machines being out of sync with the server. Troubleshooting something from 3,000 miles away can be a challenge, as you can imagine. We went through all of the normal steps and realized that while trade information was in line between the client and server, the pricing information was not. The options available for troubleshooting at the time were to connect to a private VPN in the customer’s office, then RDP to the server, and, if you’re lucky, a client machine to see what was going on. There are many things that can go wrong in this configuration, and sometimes the companies used contracted IT teams, so they didn’t have anyone on-site to help troubleshoot network or other hardware issues, so we often ended up paying a visit for installs, upgrades, problems, etc.

Since we didn’t get anywhere troubleshooting remotely, I hopped on a last-minute economy flight to London in a very uncomfortable middle seat in a five-person middle row. I headed out to the office jetlagged and feeling pretty concerned about my ability to troubleshoot the mystery issue the next day. The good news was that once I was there, I could see that all the client machines were having the same problem, so that ruled out an issue with one particular client configuration and meant it was something universal on the clients or potentially a network or server misconfiguration. TL;DR is that the problem turned out to be that the IT department had recently upgraded all the client machines to XP SP2. The critical thing to note is that this introduced the Windows firewall on everyone’s machines. By default, this was blocking UDP traffic. Our software relied on UDP to distribute the pricing information because it changes quickly, and UDP reduces the delay in transmission from other protocols like TCP while still being reliable. That meant that all we had to do was allow UDP traffic on the specific ports needed, and we were good to go. A couple of thousand dollars in airfare, hotel, and transportation costs and the loss of a team member on the consulting team were all it took to get one small client back up and running.

Fast forward almost 20 years, and I’m now working as a product manager for an observability company. It had been a while since I had been responsible for monitoring or supporting any systems, so I was definitely not up to speed on the extent to which things had changed. Of course, I was somewhat familiar with monitoring tools and SaaS applications, having been a software PM for all this time, but I had never really stopped to think about the way the world has changed until a recent incident.

Recently, one of our customers experienced an issue not dissimilar to my previous customer in that it was an issue of software installed on a user’s laptop acting unpredictably due to a network configuration in the hotel at which the user was staying during a conference. TL;DR, an issue that would have previously required someone to travel on-site to troubleshoot, was resolved in about 5 minutes by reviewing the trace of the request initiated by the client that was causing issues. Think of all the man hours, money, and frustration saved! This is a sea change from the days of opaque systems, and I might still be working on the technical side of the house had I had this kind of one-stop-shop insight into the software was supporting.

For me, this really highlights the importance of observable systems for those who support them. I’m also cognizant of the fact that running software in production is now much more complex and distributed than when I had my fingers in the mix back in the day when Nagios and some SQL Server alerting were all you really needed to stay more or less up to speed on the state of your systems. This clearly means that our tools must have evolved in line with the evolution of software, right? I’ve been surprised to find that this has not necessarily been true, and I’ve talked to many a frustrated developer who has been spending too much time trying to navigate logs to understand where an issue is occurring.

In some ways, things are easier. They are more accessible thanks to the cloud. They can be more stable and scalable thanks to the automation of resource allocation. Releases can be more rapid and less scary with CI/CD. All of this is great, but the truth is that as things become more distributed, the knowledge of those things also becomes distributed; and no one person typically understands everything from top to bottom, we are all more focused on our functional areas. That means it can be quite the wild goose chase to track down a one-off issue now.

This is where I will advocate for the benefits organizations get by utilizing OpenTelemetry in their systems or some other methods for making things more transparent and correlated. Regardless of the backend you use (but you really should check out TelemetryHub), the understanding of lower-level relationships and communications happening in your system will, at some point, be invaluable when something unexpected happens. The ability to pivot on an attribute like a user ID when diagnosing is something I could have never dreamed of in my support days. These are exciting times, and it will be fascinating to see where the advancement of observability technology will have an impact. I’m expecting to see big changes in the way organizations work together when more people are privy to the system information. I expect greater shared responsibility in addition to more reliable and scalable systems thanks to the availability of insight into how everything works together.

20 Year Retrospective on Troubleshooting

Author

Sarah Morgan

More insights like this

DuckDB vs FINOS’ Perspective: A Comparison for Web Developers

Distributed Tracing in ElectronJS Apps is Worth the Price of Admission

Monitoring Django and Celery With OpenTelemetry

20 Year Retrospective on Troubleshooting

Get started with TelemetryHub

Interested in full stack observability?

Want to provide your feedback?

20 Year Retrospective on Troubleshooting

Author

Sarah Morgan

Enjoying deep insights from our industry experts? Subscribe now.

Subscribe Form

More insights like this

DuckDB vs FINOS’ Perspective: A Comparison for Web Developers

Distributed Tracing in ElectronJS Apps is Worth the Price of Admission

Monitoring Django and Celery With OpenTelemetry

20 Year Retrospective on Troubleshooting

Get started with TelemetryHub

Interested in full stack observability?

Want to provide your feedback?

Insights delivered straight to your inbox. Subscribe now.

Subscribe Form