Skip to content

http.GET() randomly exits with error -1 (connection refused) -> issue with function WiFiGenericClass::hostByName() [proposal of changes in library] #3722

Closed
@asier70

Description

@asier70

Hardware:

Board: ESP32 Dev Module
Core Installation version: 1.0.4
IDE name: Arduino IDE

Description:

Because issue #2778 Getting [E][WiFiGeneric.cpp:658] hostByName(): DNS Failed randomly is closed and this problem still exists I decided to open new one.
From time to time function http.GET() immediatelly exits with error -1 which means "connection refused". This problem was observed by many other programmers, especially when internet connection was slow or function was used very often.
I investigated this problem for very long time. In this situation "connection refused" doesn't mean that remote server refused our connection. Problematic was function WiFiGenericClass::hostByName(), which resolves host name and translates it to IP address.

Analysis:

Tree of functions use is as follows:

int HTTPClient::GET()
{
    return sendRequest("GET");
} 

int HTTPClient::sendRequest(const char * type, String payload)
{
    return sendRequest(type, (uint8_t *) payload.c_str(), payload.length());
} 

int HTTPClient::sendRequest(const char * type, uint8_t * payload, size_t size)
{
    // connect to server
    if(!connect()) {
        return returnError(HTTPC_ERROR_CONNECTION_REFUSED);
    }
    [...]
} 

bool HTTPClient::connect(void)
{
    [...]
    if(!_client->connect(_host.c_str(), _port, _connectTimeout)) {
        log_d("failed connect to %s:%u", _host.c_str(), _port);
        return false;
    }
    [...]
} 

int WiFiClient::connect(const char *host, uint16_t port, int32_t timeout)
{
    IPAddress srv((uint32_t)0);
    if(!WiFiGenericClass::hostByName(host, srv)){
        return 0;
    }
    return connect(srv, port, timeout);
} 

After investigation error is caused by WiFiGenericClass::hostByName() function:

/**
 * Resolve the given hostname to an IP address.
 * @param aHostname     Name to be resolved
 * @param aResult       IPAddress structure to store the returned IP address
 * @return 1 if aIPAddrString was successfully converted to an IP address,
 *          else error code
 */
int WiFiGenericClass::hostByName(const char* aHostname, IPAddress& aResult)
{
    ip_addr_t addr;
    aResult = static_cast<uint32_t>(0);
    waitStatusBits(WIFI_DNS_IDLE_BIT, 5000);
    clearStatusBits(WIFI_DNS_IDLE_BIT);
    err_t err = dns_gethostbyname(aHostname, &addr, &wifi_dns_found_callback, &aResult);
    if(err == ERR_OK && addr.u_addr.ip4.addr) {
        aResult = addr.u_addr.ip4.addr;
    } else if(err == ERR_INPROGRESS) {
        waitStatusBits(WIFI_DNS_DONE_BIT, 4000);
        clearStatusBits(WIFI_DNS_DONE_BIT);
    }
    setStatusBits(WIFI_DNS_IDLE_BIT);
    if((uint32_t)aResult == 0){
        log_e("DNS Failed for %s", aHostname);
    }
    return (uint32_t)aResult != 0;
} 

/**
 * DNS callback
 * @param name
 * @param ipaddr
 * @param callback_arg
 */
static void wifi_dns_found_callback(const char *name, const ip_addr_t *ipaddr, void *callback_arg)
{
    if(ipaddr) {
        (*reinterpret_cast<IPAddress*>(callback_arg)) = ipaddr->u_addr.ip4.addr;
    }
    xEventGroupSetBits(_network_event_group, WIFI_DNS_DONE_BIT);
} 

Real problem lies in timeout value used in above function. This function uses DNS functionality of lwip library. How it works? When domain name is in local cache lwip library returns immediately resolved IP address from cache. If not it starts working in backgroud checking IP address in external DNS servers. Function waits only 4000[ms] for this. If resolving procedure lasts more time than 4[s] function returns with error, but background process still lives. When it finishes, callback function is called, WIFI_DNS_DONE_BIT is set and result variable is filled with resolved IP address. This situation is very dangerous for whole application! There are two big problems connected with it.
When next time we use hostByName() function it returns immediately with error because WIFI_DNS_DONE_BIT is set (look at wait for status bit function). Background callback remains alive so abnormal situation will be repeated over and over again (till DNS cache use).
Second problem is more dangerous in my opinion. Object srv of IPAddress class in fuction WiFiClient::connect() is declared locally on stack. It is passed by reference to callback function. When callback ends it writes resolved IP address to this stack memory area, but after 4[s] this area is probably used by other function so it may result in application crash...

Solution

I tried to find timeout value of DNS resolving procedure in lwip library. I found that internal timer has default 1[s] period and there are 4 retries. So deducing external DNS IP address resolving procecedure should have 4 seconds timeout. But nothing could be more wrong.
I did a test. I manually set both DNS serwers as some IPs that aren't DNS servers. I measured that real timeout occurs after 13-14 seconds! So timeout value used in hostByName() function should be more than 14000[ms].
I thought about the solution for a long time.
I suggest to make three changes in WiFiGenericClass::hostByName() function:

int WiFiGenericClass::hostByName(const char* aHostname, IPAddress& aResult)
{
    ip_addr_t addr;
    aResult = static_cast<uint32_t>(0);
    waitStatusBits(WIFI_DNS_IDLE_BIT, 16000);  //* CHANGED FROM 5000 to 16000
    clearStatusBits(WIFI_DNS_IDLE_BIT);
    clearStatusBits(WIFI_DNS_DONE_BIT);  //* NEW
    err_t err = dns_gethostbyname(aHostname, &addr, &wifi_dns_found_callback, &aResult);
    if(err == ERR_OK && addr.u_addr.ip4.addr) {
        aResult = addr.u_addr.ip4.addr;
    } else if(err == ERR_INPROGRESS) {
        waitStatusBits(WIFI_DNS_DONE_BIT, 15000);  //* CHANGED FROM 4000 to 15000
        clearStatusBits(WIFI_DNS_DONE_BIT);
    }
    setStatusBits(WIFI_DNS_IDLE_BIT);
    if((uint32_t)aResult == 0){
        log_e("DNS Failed for %s", aHostname);
    }
    return (uint32_t)aResult != 0;
} 

What do you think about this solution?
Can anyone verify my logical reasoning about it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions