Re: Serious Performance Problem with Sun NIT/DLPI
Author:   Van Jacobson <van@ell.ee.lbl.gov>
Date:     1995/06/09
Forums:   comp.protocols.tcp-ip, comp.sys.sun.misc

In article <3r096c$s0l@nova.netapp.com>, Guy Harris <guy@netapp.com> wrote:
>How much of it was due to STREAMS in and of itself, and how much was due
>to the fact that STREAMS NIT, in SunOS 4.x, was a STREAMS-based
>mechanism stuck atop a BSD/mbuf-based networking framework, which turned
>mbufs into STREAMS messages by copying data from the mbuf into a STREAMS
>buffer (which might not be necessary under a purely STREAMS-based
>networking implementation, such as that of SunOS 5.x, and might have
>been avoidable even in 4.x had 4.x supported a STREAMS moral equivalent
>of "loaned cluster mbufs", possibly allowing the mbuf's data to have a
>STREAMS dbuf glued onto it)?

Guy,

I think we already had this conversation in a bar in San Antonio
during some Usenix but the answer is that none of the result
had anything to do with mbufs -- Steve bracketed the call to to
snit_intr() in do_protocol() with reads of the microsecond clock
counter & histogrammed the difference between the two times.
The only impact of mbufs on snit_intr() is that snit_cpymsg(),
the routine used to copy the message, calls m_cpytoc() rather than
calling bcopy directly.  Since the packets were always contiguous &
in a single mbuf or loaned cluster, m_cpytoc() contributes <2us out of
a total of more than 100us spent in snit.

I agree that it is theoretically possible that a STREAMS
implementation could have avoided the copy & done something similar
to BPF (but neither nit nor dlpi do this and, as Vernon pointed
out, there are several reasons why the system framework makes
it difficult) but, contrary to Steve Rago's claims, STREAMS would
still be slower.  The call to the BPF filter looks roughly like:

    bpf_tap(struct ifnet*, const void* packet, int plen)

(where packet is a pointer to the buffer containing the packet
just read & plen is its length).  The tap is called early in
the driver's read interrupt service routine at a place where
all three of those parameters are already in registers because
they're needed for other read processing.  The linkage was
deliberately designed to be efficient on a modern architecture &
the packet is filtered in place.

By contrast, the STREAMS religion has these 'generic' put
procedures that can only communicate in terms of mblks.  So to
get to the dlpi or nit filter, you have to allocate & initialize a
dblk pointing at the data, allocate & initialize an mblk pointing
at the dblk then finally you can do a putnext() on the mblk.
In the code we measured, an mblk & dblk each had 6 fields &
24 bytes to initialize so you end up writing then reading 48
bytes of memory (not counting the allocator overhead) just to
describe the buffer you're not copying.  Since the mean packet
size is around 100 bytes, this is at least 50% overhead (more on
an SS-2 with its stupid write through cache) solely because of
the STREAMS linkage conventions.

Since Steve measured the processing time vs. packet size, the
y intercept of a least-squares fit to the measurements is the
relative cost of the STREAMS vs. BPF linkage conventions (ie.,
extrapolating back to a packet size is 0 removes the copy cost).
You can read these numbers off figures 2 & 3: 90us for STREAMS
(136us if you include the deallocate of mblk & dblk in addition
to the allocates) vs 4.6us for BPF.  Solaris-2.x has improved
on this factor-of-20 difference somewhat by doing a single
allocate for the mblk, dblk (& buffer if needed) rather than
allocating each individually.  But it also added another 8 bytes
of state to be initialized in an already baroque buffer model.
[By contrast, the mbuf model only touches 16 bytes on an allocate
(and 4 of those are purely for diagnostic purposes)].

A point here is that a lot of the performance problems with
STREAMS are intrinsic & not just the result of poor implementation
(though things like the buffer model could certainly have been
implemented far more efficiently).  One component of good performance
is efficient linkage conventions but the STREAMS convention eschews
registers & guarantees memory traffic.  This is a poor match to
most any modern machine.

Another component is to bind as many choices as possible at
design time rather than make them run time.  But the selling
point of STREAMS seems to be run-time composition so almost nothing
can be optimized out in the design & most has to be figured out,
and verified, at run-time.  E.g., the leread() interrupt service
routine *knows* it has a data packet for the upper level to process
(it spent a lot of clocks figuring this out).  Rather than taking
the efficient course of passing this knowledge implicitly via
control flow, i.e.,
    (*service_data[ptype])(ifp, packet, plen)
or
    class data_processor {
        virtural void receive(data_processor* src, void* data, int dlen);
        virtural void ioctl(data_processor* src, void* data, int dlen);
        virtural void flush(data_processor* src);
        ...
    };
    ...
    next_level->receive(this, packet, plen);

STREAMS forces the interrupt routine, at runtime, to push the
knowledge into memory (i.e., mp->b_datap->db_type = M_DATA)
then forces the next higher level to figure out, again at
run-time, why it was called (i.e., the "switch (mp->b_datap->db_type)"
at the front of every STREAMS put procedure).

Another performance gain from early binding is establishing
conventions that particular subsystems can use to avoid processing.
So, for example, leread() knows that bpf_tap() will not modify the
packet (so no copy needs to be made just to filter) but bpf_tap()
knows that after it returns the data can be modified so if the
filter matches it needs to copy the small portion the bpf user
has asked to see.  (We could have as easily made the convention
that the data is immutable & reference counted so no bpf copies
need to be made -- we didn't because that has an adverse impact
on other network subsystems.)  Our IP code knows that at least
the first KB of a packet is contiguous & the packet pointer it's
handed is 8-byte aligned because all our drivers are written
to these conventions.  Since STREAMS has chosen to maximize the
composition options, ie., anything can be pushed onto anything,
it's much harder to establish this kind of convention.  So,
for example, Solaris IP checks all these things at runtime and,
near as I can tell, dlpi copies everything because it doesn't
know if some other stream intends to modify it.

> Sun put the STREAMS code into SunOS 4.0 before the AT&T/Sun deal
> ever happened.  (I was there.  I know what I'm talking about here.
> Anyone who believes it was done *after* the AT&T/Sun deal, and done
> only because AT&T pressured Sun into do it, is simply mistaken.)
> I no longer remember the precise reasons why STREAMS NIT was done,
> but I think the expectation was that it would in some ways be an
> improvement over the old NIT code

Ok.  I knew your streams tty driver work predated the AT&T deal.
I know that Sun was interested in expanding into the commercial
market & AT&T had told that market that STREAM-based ttys were
wonderful (probably because SVR3 didn't have any other selling
points :).  I think that when we first got 4.0 source the deal had
happened & snit was the only thing besides ttys that used STREAMS.
Source for both the old mbuf-based nit & streams nit were in the
kernel (& the streams nit was very buggy & very slow).  About that
time we were collaborating with several people in the Sun networks
group and they were being encouraged from above to do anything new
with streams & convert as much possible of the bsd networking code.
Since this direction was antithetical to everything we'd learned
about improving network performance, I kept asking "why?"  No one
I was working with seemed enthusiastic about the direction or
gave any technical justification for it & the only answer I ever
seemed to get was "AT&T owns 25% of the company."  I'm certainly
willing to believe they just told me what I wanted to hear so
I'd shut up & get on with the work we were doing.

 - Van