Up to Main Index                             Up to Journal for March, 2019

                     JOURNAL FOR SATURDAY 9TH MARCH, 2019
______________________________________________________________________________

SUBJECT: Random segmentation faults on Raspberry Pi 3
   DATE: Sat  9 Mar 16:44:02 GMT 2019

A few people have asked me if I’ve forgotten about posting some performance
testing figures for the new connection quotas on the Raspberry Pi. I haven’t,
but there have been some issues…

I did some tests and wasn’t happy with the results and so wanted to do some
CPU profiling. However on the Raspberry Pi 3 I started seeing random
segmentation faults when CPU profiling was turned on. I managed to boil
everything down in to a simple test program:


  package main

  import (
    "log"
    "net"
    "os"
    "runtime/pprof"
  )

  // Test from shell using:
  //
  //  seq 1000000 | xargs -n1 netcat -zn 127.0.0.1 4001
  //
  func main() {

    debug := true

    if debug {
      f, err := os.Create("/tmp/cpuprof")
      if err != nil {
        log.Fatal(err)
      }
      if err = pprof.StartCPUProfile(f); err != nil {
        log.Fatal(err)
      }
    }

    addr, err := net.ResolveTCPAddr("tcp", "127.0.0.1:4001")
    if err != nil {
      log.Fatal(err)
    }

    l, err := net.ListenTCP("tcp", addr)
    if err != nil {
      log.Fatal(err)
    }
    log.Print("Listening...")

    for x := 1; x < 1000001; x++ {
      conn, err := l.AcceptTCP()
      if err != nil {
        log.Fatal(err)
      }
      err = conn.Close()
      if err != nil {
        log.Fatal(err)
      }
      if x%1000 == 0 {
        log.Printf("%d connections...", x)
      }
    }

    if debug {
      pprof.StopCPUProfile()
    }

    log.Printf("Finished")
  }


This simple program waits for 1,000,000 connections and just closes them when
they arrive. On Amd64 I see what I expect to see:


  2019/03/09 12:45:26 Listening...
  2019/03/09 12:45:30 1000 connections...
  2019/03/09 12:45:30 2000 connections...
  :
  :
  :
  2019/03/09 12:51:10 999000 connections...
  2019/03/09 12:51:11 1000000 connections...
  2019/03/09 12:51:11 Finished


However on the Raspberry Pi 3 I see:


  2019/03/09 12:46:30 Listening...
  2019/03/09 12:46:35 1000 connections...
  2019/03/09 12:46:38 2000 connections...
  :
  :
  :
  2019/03/09 12:47:06 14000 connections...
  2019/03/09 12:47:08 15000 connections...
  Segmentation fault (core dumped)


If I set debug to false on the Raspberry Pi 3:


  2019/03/09 12:58:06 Listening...
  2019/03/09 12:58:13 1000 connections...
  2019/03/09 12:58:15 2000 connections...
  :
  :
  :
  2019/03/09 13:43:58 999000 connections...
  2019/03/09 13:44:00 1000000 connections...
  2019/03/09 13:44:00 Finished


Interestingly it works fine, but slowly, on a Raspberry Pi Zero W, with CPU
profiling turned on.

The actual number of connections before it fails is random. However it always
fails in the same place. Looking at the core file with gdb I see:


  Core was generated by `./cpuprofile'.
  Program terminated with signal SIGSEGV, Segmentation fault.
  #0  runtime.sigtrampgo (ctx=0x2009d10, info=0x2009c90, sig=<optimized out>)
    at /home/rolfea/go1.11.1/src/runtime/signal_unix.go:307
  307             if sp < g.m.gsignal.stack.lo || sp >= g.m.gsignal.stack.hi {
  [Current thread is 1 (LWP 16249)]
  Loading Go Runtime support.
  3(gdb) bt
  #0  runtime.sigtrampgo (ctx=0x2009d10, info=0x2009c90, sig=<optimized out>)
    at /home/rolfea/go1.11.1/src/runtime/signal_unix.go:307
  #1  0x00064904 in runtime.sigtramp ()
    at /home/rolfea/go1.11.1/src/runtime/sys_linux_arm.s:445
  #2  0x7ea99408 in ?? ()
  Backtrace stopped: previous frame identical to this frame (corrupt stack?)


Running with strace I see:


  accept4(5, 0xc63de0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN
    (Resource temporarily unavailable)
  epoll_wait(4, [{EPOLLIN, {u32=1724452800, u64=1724452800}}], 128, 0) = 1
  accept4(5, {sa_family=AF_INET, sin_port=htons(34640),
    sin_addr=inet_addr("127.0.0.1")}, [112->16], SOCK_CLOEXEC|SOCK_NONBLOCK) = 6
  epoll_ctl(4, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET,
    {u32=1724452672, u64=1724452672}}) = 0
  getsockname(6, {sa_family=AF_INET, sin_port=htons(4001),
    sin_addr=inet_addr("127.0.0.1")}, [112->16]) = 0
  setsockopt(6, SOL_TCP, TCP_NODELAY, [1], 4) = 0
  epoll_ctl(4, EPOLL_CTL_DEL, 6, 0xc63ec0) = 0
  close(6)                                = 0
  accept4(5, 0xc63de0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN
    (Resource temporarily unavailable)
  epoll_wait(4, [], 128, 0)               = 0
  epoll_wait(4, [{EPOLLIN, {u32=1724452800, u64=1724452800}}], 128, -1) = 1
  --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
  --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xa387933b} ---
  +++ killed by SIGSEGV +++
  Segmentation fault


Hrm, that doesn’t look good :( So far on the Raspberry Pi 3 I have tried
Go1.11.1, Go1.12 and tip all with the same results.

This seems related to #20417 and #10534 maybe[1]?

Just posted to golang-nuts to try and get some feedback[2].

--
Diddymus

  [1] https://github.com/golang/go/issues/20417
      https://github.com/golang/go/issues/10534

  [2] https://groups.google.com/forum/#!topic/golang-nuts/nMQHPUhB9L8


  Up to Main Index                             Up to Journal for March, 2019