Frequent 504 gateway timeouts since 2020-01-29

Since 2020-01-29 I have seen more and more frequent 504 gateway timeout errors.

I have a Drupal site that syncs data from Infusionsoft to the site for various uses throughout the site (views, reports, etc…) utilizing a slightly extended class of the infusionsoft-php 1.4.9 library.

This site does the sync of data throughout the day and up until recently would only occasionally run into a 504 error. I have written the code to retry the request once when a 504 happens and most of the time the retried request goes through fine. It records a record in the log whenever this happens.

Since the 29th, the retry is still mostly going through, but I have seen quite a uptick in the number of times a initial 504 happens.

Does this have something to do with the migration to Apigee? I don’t think I am anywhere near the 1500 QPM limit and don’t think that is what I am running into since that should be a 429 error instead of a 504, but I don’t see the same statistics that were available in Mashery in Apigee to get a sense of the number of calls, what type of call, any errors, etc that Mashery showed (unless I am just not seeing where that is available?).

Probably not the QPM at all, but I did see a post about Mashery being cut off at the end of the month (Jan?) so there is some question about the timing.

Anyone else seeing this?

1 Like

I’ve been having issues refreshing my token since around this time as well

1 Like

I haven’t specifically had any issues with refreshing the token, just regular calls to the api, some via XML-RPC and some via REST, error seems to happen with both.

As an example, the sync process maintains a record of the opportunities. Since our account as about 1500 opportunities, it grabs them in a batch of 100 each. So, to get all of them, it requires 15 calls to the api. Since the 29th it will process all 15, but sometimes several of them (say the 5th then the 12th) will get an initial 504 error and have to immediately retry, which then works (most of the time).

Our site is doing this with several data points (contacts, referrals, etc…) to allow us to generate reports automatically and display them on the site. So the sync operations are happening all day every day at regular intervals for different data points.

Before the 29th there was only the occasional 504, maybe 1 a day, now I have been seeing several an hour.

We are attempting to remedy the situation. This was caused by our new API Proxy having a shorter timeout than our old proxy.

Seeing less 504 errors the last day or so, still happening more than it did though. Also have seen some 503 service unavailable errors thrown in there as well.

We modified the connection timeouts yesterday morning around 16:30 UTC to match what we had with Mashery (300 seconds for our backends to respond) . We are continuing to monitor them to make sure our changes are having the desired effect.

@Daniel_Watson Our experience has been nearly identical to yours:

  • We have a process that performs periodic syncs throughout the day that involves fetching many Contact and Opportunity records, 1000 at a time.
  • Starting around 01/29/2020, we started seeing higher than usual Gateway Timeout errors with these API calls. Most calls still completed successfully in 0.5 - 2s, except for the few per hour that took ~60 seconds then returned with a 504 error.
  • Starting around 9 AM MT on 02/05/2020, the Gateway Timeout errors seem to have largely stopped. However we are now seeing occasional API calls that take 300s and error with read timeouts (“ Read timed out” from our Java code).
1 Like

@bradb Things appear to have improved, I did have a 503 at around 5:30am EST today, so there may still be a few hiccups happening, but it does seem to have improved from a few days ago.

@Randy_Baldwin I haven’t been watching specifically how long the calls take, may have to start watching that to see. Looks I need to expand my error checking / logging some more.

You may be running into more than me when it comes to Contact/Opportunity calls. I found that if I tried to pull the full contact or opportunity record with all our custom fields, trying to grab 1000 at a time would time out quite often, so I had to reduce the number of records I tried to pull at a time, I think currently I only do 100. Part of that is how long our server takes to process the data after getting it, but it could also be on how long the call takes to get a response as well.

But I am pulling a full 1000 referral records at a time which is far less data than contacts and hasn’t been an issue (9 fields as opposed to 100 or so for a contact), but had started seeing it error out with referrals and other tables that are smaller amounts of data.

Hopefully the longer timeout setting will fix things going forward.

1 Like