Skip to content

packets.go: read tcp [ip-address]: connection timed out #257

Closed
@eduncan911

Description

@eduncan911

UPDATE: My resolution was to remove all "Idle" connections from the pool. See this comment:

#257 (comment)

I am currently experiencing a stalling or broken web app after a period of idle between 15 to 48 minutes. The most critical issue is described below:

  • Visit a URL, any url on the site, and load the page completely (as in, the page actually loads and the logs show a complete page has loaded).
  • Close browser, and wait.

A typical request is logged like this:

2014/07/13 15:29:54 INFO template rendering: index
2014/07/13 15:29:54 METRIC, URL: /, HANDLER TIME: 7.2339ms, CTX TIME: 5.0894ms, TOTAL TIME: 12.3258ms

After a long period of time (ranging from 15m to 48m), the system all of a sudden logs these lines below with no interaction - the web app has been idle this entire time:

[MySQL] 2014/07/13 16:00:09 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 16:00:09 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 16:00:10 INFO template rendering: index
2014/07/13 16:00:10 METRIC, URL: /, HANDLER TIME: 8.8574ms, CTX TIME: 31m19.2606723s, TOTAL TIME: 31m19.2695329s

Notice the "TOTAL TIME" is 31 minutes and 19 seconds? Also, notice the MySql driver error that is logged at the same time?

There was no activity / no web request made. The web app was simply idle.

The most critical issue is what comes next after these log messages: _the very next web request is stalls completely, never returning a response_:

user@govm1:~$ wget http://localhost
--2014-07-13 17:11:18--  http://localhost/
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... Read error (Connection timed out) in headers.
Retrying.

--2014-07-13 17:26:19--  (try: 2)  http://localhost/
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html.4'

    [ <=>                                                                                  ] 6,310       --.-K/s   in 0.001s

2014-07-13 17:26:20 (9.61 MB/s) - `index.html.4' saved [6310]

And it sits idle, no response, for 15 minutes until wget times out.

Now, if I make a 2nd or 3rd request immediately after this one is stalled and anytime while it is stalled, the go web app responds and returns a full page for other requests. No issues. And then, the cycle starts over from the last request I make and let it site idle.

After this 15m, you can guess exactly what is logged next:

[MySQL] 2014/07/13 17:26:57 packets.go:32: read tcp remote-mysql-server-address:3306: connection timed out
[MySQL] 2014/07/13 17:26:57 packets.go:118: write tcp remote-mysql-server-address:3306: broken pipe
2014/07/13 17:26:57 INFO template rendering: index
2014/07/13 17:26:57 METRIC, URL: /, HANDLER TIME: 6.8938ms, CTX TIME: 15m39.1718434s, TOTAL TIME: 15m39.1787398s

Another 15m wait time.

I eliminated Windows Azure, the Cluster VIP and Firewall/Linux VM running the go web app as an issue because I ran wget http://localhost locally on the same box, and I get this "stalled" request that never completes and never sends back anything.

There are a number of factors in my web app so I will try to outline them accordingly.

Using:

  • Go 1.3
  • go-sql-driver/mysql ## Version 1.2 (2014-06-03)
  • Ubuntu 12.04 LTS, ~June 2014 Updates
  • Windows Azure

Do note that the Linux box running MySql is a different Linux box running the cluster of GoLang apps - and they are in separate dedicated Cloud Services. The MySql vm is a single VM, no cluserting.

Here is some related code:

// global handler for our DB
var db *sql.DB

// CLI parameter
var dbdsn string

func init() {

    flag.StringVar(&dbdsn, "dbdsn", "root:root@tcp(localhost:3306)/prod?timeout=5s&tls=false&autocommit=true", "Specifies the MySql DSN connection.")
    flag.Parse()

    var err error
    db, err = sql.Open("mysql", dbdsn)
    if err != nil {
        log.Printf("ERROR in sql.Open(): %v", err)
    }

    //db.SetMaxIdleConns(5)

    // verify the DSN is setup properly1
    err = db.Ping()
    if err != nil {
        panic("PANIC when pinging db: " + err.Error()) // proper error handling instead of panic in your app
    }
}

// **********
// * omitted is the Gorilla MUX router and http handler registrations
// **********

func ArticleHandler(w http.ResponseWriter, r *http.Request, c *Context) (err error) {

    m := NewArticle(c)
    id := c.Vars["id"]

    var pid int
    var title, body, excerpt, date, slug, fi, fv, region, region_slug string
    err = db.QueryRow(
        "SELECT p.ID, p.post_title, p.post_content, p.post_excerpt, p.post_date, p.post_name, "+
            "(SELECT fpim.meta_value FROM wp_postmeta fpim WHERE fpim.meta_key = '_wp_attached_file' AND fpim.post_id = (SELECT fpim2.meta_value FROM wp_postmeta fpim2 WHERE fpim2.post_id = p.ID AND fpim2.meta_key = '_thumbnail_id' LIMIT 1) LIMIT 1) AS featured_image, "+
            "(SELECT fpim3.meta_value FROM wp_postmeta fpim3 WHERE fpim3.meta_key = 'fv_video' AND fpim3.post_id = p.ID LIMIT 1) AS featured_video, "+
            "t.name as region, t.slug as region_slug "+
            "FROM wp_posts p "+
            "JOIN wp_term_relationships tr ON tr.object_id=p.ID "+
            "JOIN wp_term_taxonomy tt ON tt.term_taxonomy_id=tr.term_taxonomy_id "+
            "JOIN wp_terms t ON t.term_id=tt.term_id "+
            "WHERE p.post_name=? AND p.post_type='post' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP()"+
            "AND tr.object_id=p.ID AND tt.parent = (SELECT t3.term_id FROM wp_terms t3 WHERE t3.name=? LIMIT 1) LIMIT 1",
        id, RegionsParentCategory).
        Scan(&pid, &title, &body, &excerpt, &date, &slug, &fi, &fv, &region, &region_slug)

    if err != nil {
        if err == sql.ErrNoRows {

            // snipped code for redirects

            // article was not found
            return handleNotFound(w, r, c)

        } else {
            log.Printf("ERROR in .Scan(): %v", err)
        }
    } else {
        m.Region = Region{
            Name: region,
            Slug: region_slug,
        }
        m.Id = pid
        m.Title = title
        m.Body = template.HTML(body) // render the raw html
        m.Excerpt = excerpt
        m.Datetime = date
        m.Slug = slug
        m.FeaturedImageUrl = fi
        m.FeaturedVideoUrl = fv
    }

    web.RenderTemplate(w, "article", m)
    return
}

5 more DB queries, per request

In addition to this query, my "Context" you see being passed into the handler runs 4 to 6 additional SQL queries. Therefore, each "article" handler that loads runs about 5 to 7 SQL queries, minimal, using the exact same pattern and *db global variable you see above.

Timeouts / errors are always on the same DB query

Here's one of the "context" queries as a comparison:

rows2, err := db.Query(
    "SELECT p.post_title, p.post_name "+
        "FROM wp_posts p "+
        "WHERE p.post_type='page' AND p.post_status='publish' AND p.post_date <= UTC_TIMESTAMP() "+
        "AND p.post_parent = (SELECT p2.ID FROM wp_posts p2 WHERE p2.post_name=? LIMIT 1) "+
        "ORDER BY p.menu_order",
    FooterPagesParentNameSlug)
if err != nil {
    log.Printf("ERROR in AllPages .Query() : %v", err)
} else {
    defer rows2.Close()
    c.AllFooterPages = make([]FooterPage, 0)
    for rows2.Next() {
        var name, slug string
        err := rows2.Scan(&name, &slug)
        if err != nil {
            log.Printf("ERROR in AllPages row.Scan() : %v", err)
        } else {
            p := FooterPage{
                Page: Page{
                    Title: name,
                    Slug:  slug,
                },
            }
            c.AllFooterPages = append(c.AllFooterPages, p)
        }
    }
}

Nothing special there.

I do call defer rows2.Close() only if there was no error. Perhaps that is part of the issue? This particular SQL query seems to log errors under load tests as no response or mysql driver timing out.

Questions

Why am I getting request timeouts logged in excess of 15 to 30 minutes, from an idle site? That seems like a bug with the mysql driver I am using, possibly holding a connection open. But, the last http request was successful and returned a complete page + template.

I even have the Timeout set in the connection string, which is 5 seconds. Even if it is a problem with the mysql server, why the 15 minute timeout/request logged? Where did that request come from?

It still could be a MySql driver issue, blocking the request from completing - maybe being blocked by the MySql dedicated VM and an issue there. If that is the case, then how come nothing is logged? What is this random timeout of 15m to 49m minutes? It is usually only 15m or 31m, but sometimes 48m is logged.

It is very interesting on the "15m" multiples there in the timeouts (@15m, 31m and 48m), allotting for some padding there in seconds.

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions