Re: Serious Performance Problem with Sun NIT/DLPI Author: Van Jacobson Date: 1995/06/09 Forums: comp.protocols.tcp-ip, comp.sys.sun.misc In article <3r096c$s0l@nova.netapp.com>, Guy Harris wrote: >How much of it was due to STREAMS in and of itself, and how much was due >to the fact that STREAMS NIT, in SunOS 4.x, was a STREAMS-based >mechanism stuck atop a BSD/mbuf-based networking framework, which turned >mbufs into STREAMS messages by copying data from the mbuf into a STREAMS >buffer (which might not be necessary under a purely STREAMS-based >networking implementation, such as that of SunOS 5.x, and might have >been avoidable even in 4.x had 4.x supported a STREAMS moral equivalent >of "loaned cluster mbufs", possibly allowing the mbuf's data to have a >STREAMS dbuf glued onto it)? Guy, I think we already had this conversation in a bar in San Antonio during some Usenix but the answer is that none of the result had anything to do with mbufs -- Steve bracketed the call to to snit_intr() in do_protocol() with reads of the microsecond clock counter & histogrammed the difference between the two times. The only impact of mbufs on snit_intr() is that snit_cpymsg(), the routine used to copy the message, calls m_cpytoc() rather than calling bcopy directly. Since the packets were always contiguous & in a single mbuf or loaned cluster, m_cpytoc() contributes <2us out of a total of more than 100us spent in snit. I agree that it is theoretically possible that a STREAMS implementation could have avoided the copy & done something similar to BPF (but neither nit nor dlpi do this and, as Vernon pointed out, there are several reasons why the system framework makes it difficult) but, contrary to Steve Rago's claims, STREAMS would still be slower. The call to the BPF filter looks roughly like: bpf_tap(struct ifnet*, const void* packet, int plen) (where packet is a pointer to the buffer containing the packet just read & plen is its length). The tap is called early in the driver's read interrupt service routine at a place where all three of those parameters are already in registers because they're needed for other read processing. The linkage was deliberately designed to be efficient on a modern architecture & the packet is filtered in place. By contrast, the STREAMS religion has these 'generic' put procedures that can only communicate in terms of mblks. So to get to the dlpi or nit filter, you have to allocate & initialize a dblk pointing at the data, allocate & initialize an mblk pointing at the dblk then finally you can do a putnext() on the mblk. In the code we measured, an mblk & dblk each had 6 fields & 24 bytes to initialize so you end up writing then reading 48 bytes of memory (not counting the allocator overhead) just to describe the buffer you're not copying. Since the mean packet size is around 100 bytes, this is at least 50% overhead (more on an SS-2 with its stupid write through cache) solely because of the STREAMS linkage conventions. Since Steve measured the processing time vs. packet size, the y intercept of a least-squares fit to the measurements is the relative cost of the STREAMS vs. BPF linkage conventions (ie., extrapolating back to a packet size is 0 removes the copy cost). You can read these numbers off figures 2 & 3: 90us for STREAMS (136us if you include the deallocate of mblk & dblk in addition to the allocates) vs 4.6us for BPF. Solaris-2.x has improved on this factor-of-20 difference somewhat by doing a single allocate for the mblk, dblk (& buffer if needed) rather than allocating each individually. But it also added another 8 bytes of state to be initialized in an already baroque buffer model. [By contrast, the mbuf model only touches 16 bytes on an allocate (and 4 of those are purely for diagnostic purposes)]. A point here is that a lot of the performance problems with STREAMS are intrinsic & not just the result of poor implementation (though things like the buffer model could certainly have been implemented far more efficiently). One component of good performance is efficient linkage conventions but the STREAMS convention eschews registers & guarantees memory traffic. This is a poor match to most any modern machine. Another component is to bind as many choices as possible at design time rather than make them run time. But the selling point of STREAMS seems to be run-time composition so almost nothing can be optimized out in the design & most has to be figured out, and verified, at run-time. E.g., the leread() interrupt service routine *knows* it has a data packet for the upper level to process (it spent a lot of clocks figuring this out). Rather than taking the efficient course of passing this knowledge implicitly via control flow, i.e., (*service_data[ptype])(ifp, packet, plen) or class data_processor { virtural void receive(data_processor* src, void* data, int dlen); virtural void ioctl(data_processor* src, void* data, int dlen); virtural void flush(data_processor* src); ... }; ... next_level->receive(this, packet, plen); STREAMS forces the interrupt routine, at runtime, to push the knowledge into memory (i.e., mp->b_datap->db_type = M_DATA) then forces the next higher level to figure out, again at run-time, why it was called (i.e., the "switch (mp->b_datap->db_type)" at the front of every STREAMS put procedure). Another performance gain from early binding is establishing conventions that particular subsystems can use to avoid processing. So, for example, leread() knows that bpf_tap() will not modify the packet (so no copy needs to be made just to filter) but bpf_tap() knows that after it returns the data can be modified so if the filter matches it needs to copy the small portion the bpf user has asked to see. (We could have as easily made the convention that the data is immutable & reference counted so no bpf copies need to be made -- we didn't because that has an adverse impact on other network subsystems.) Our IP code knows that at least the first KB of a packet is contiguous & the packet pointer it's handed is 8-byte aligned because all our drivers are written to these conventions. Since STREAMS has chosen to maximize the composition options, ie., anything can be pushed onto anything, it's much harder to establish this kind of convention. So, for example, Solaris IP checks all these things at runtime and, near as I can tell, dlpi copies everything because it doesn't know if some other stream intends to modify it. > Sun put the STREAMS code into SunOS 4.0 before the AT&T/Sun deal > ever happened. (I was there. I know what I'm talking about here. > Anyone who believes it was done *after* the AT&T/Sun deal, and done > only because AT&T pressured Sun into do it, is simply mistaken.) > I no longer remember the precise reasons why STREAMS NIT was done, > but I think the expectation was that it would in some ways be an > improvement over the old NIT code Ok. I knew your streams tty driver work predated the AT&T deal. I know that Sun was interested in expanding into the commercial market & AT&T had told that market that STREAM-based ttys were wonderful (probably because SVR3 didn't have any other selling points :). I think that when we first got 4.0 source the deal had happened & snit was the only thing besides ttys that used STREAMS. Source for both the old mbuf-based nit & streams nit were in the kernel (& the streams nit was very buggy & very slow). About that time we were collaborating with several people in the Sun networks group and they were being encouraged from above to do anything new with streams & convert as much possible of the bsd networking code. Since this direction was antithetical to everything we'd learned about improving network performance, I kept asking "why?" No one I was working with seemed enthusiastic about the direction or gave any technical justification for it & the only answer I ever seemed to get was "AT&T owns 25% of the company." I'm certainly willing to believe they just told me what I wanted to hear so I'd shut up & get on with the work we were doing. - Van