用Netty开发中间件：高并发性能优化

最近在写一个后台中间件的原型，主要是做消息的分发和透传。因为要用Java实现，所以网络通信框架的第一选择当然就是Netty了，使用的是Netty 4版本。Netty果然效率很高，不用做太多努力就能达到一个比较高的tps。但使用过程中也碰到了一些问题，个人觉得都是比较经典而在网上又不太容易查找到相关资料的问题，所以在此总结一下。

1.Context Switch过高

压测时用nmon监控内核，发现Context Switch高达30w+。这明显不正常，但JVM能有什么导致Context Switch。参考之前整理过的恐龙书《Operating System Concept》的读书笔记《进程调度》和Wiki上的Context Switch介绍，进程/线程发生上下文切换的原因有：

根据分析，重点就放在第一个和第二个因素上。

进程与线程的上下文切换

之前的读书笔记里总结的是进程的上下文切换原因，那线程的上下文切换又有什么不同呢？在StackOverflow上果然找到了提问thread context switch vs process context switch：

“The main distinction between a thread switch and a process switch is that during a thread switch, the virtual memory space remains the same, while it does not during a process switch. Both types involve handing control over to the operating system kernel to perform the context switch. The process of switching in and out of the OS kernel along with the cost of switching out the registers is the largest fixed cost of performing a context switch. A more fuzzy cost is that a context switch messes with the processors cacheing mechanisms. Basically, when you context switch, all of the memory addresses that the processor “remembers” in it’s cache effectively become useless. The one big distinction here is that when you change virtual memory spaces, the processor’s Translation Lookaside Buffer (TLB) or equivalent gets flushed making memory accesses much more expensive for a while. This does not happen during a thread switch.”

通过排名第一的大牛的解答了解到，进程和线程的上下文切换都涉及进出系统内核和寄存器的保存和还原，这是它们的最大开销。但与进程的上下文切换相比，线程还是要轻量一些，最大的区别是线程上下文切换时虚拟内存地址保持不变，所以像TLB等CPU缓存不会失效。但要注意的是另一份提问What is the overhead of a context-switch?的中提到了：Intel和AMD在2008年引入的技术可能会使TLB不失效。感兴趣的话请自行研究吧。

1.1 非阻塞I/O

针对第一个因素I/O等待，最直接的解决办法就是使用非阻塞I/O操作。在Netty中，就是服务端和客户端都使用NIO。

这里在说一下如何主动的向Netty的Channel写入数据，因为网络上搜到的资料都是千篇一律：服务端就是接到请求后在Handler中写入返回数据，而客户端的例子竟然也都是在Handler里Channel Active之后发送数据。因为要做消息透传，而且是向下游系统发消息时是异步非阻塞的，网上那种例子根本没法用，所以在这里说一下我的方法吧。

关于服务端，在接收到请求后，在channelRead0()中通过ctx.channel()得到Channel，然后就通过ThreadLocal变量或其他方法，只要能把这个Channel保存住就行。当需要返回响应数据时就主动向持有的Channel写数据。具体请参照后面第4节。

关于客户端也是同理，在启动客户端之后要拿到Channel，当要主动发送数据时就向Channel中写入。

EventLoopGroup group = new NioEventLoopGroup();Bootstrap b = new Bootstrap();b.group(group).channel(NioSocketChannel.class).remoteAddress(host, port).handler(new ChannelInitializer<SocketChannel>() {(SocketChannel ch) throws Exception {ch.pipeline().addLast(…);}});try {ChannelFuture future = b.connect().sync();this.channel = future.channel();}catch (InterruptedException e) {throw new IllegalStateException(“Error when start netty client: addr=[” + addr + “]”, e);}1.2 减少线程数

线程太多的话每个线程得到的时间片就少，CPU要让各个线程都有机会执行就要切换，切换就要不断保存和还原线程的上下文现场。于是检查Netty的I/O worker的EventLoopGroup。之前在《Netty 4源码解析：服务端启动》中曾经分析过，EventLoopGroup默认的线程数是CPU核数的二倍。所以手动配置NioEventLoopGroup的线程数，减少一些I/O线程。

(int port) throws InterruptedException {EventLoopGroup bossGroup = new NioEventLoopGroup();EventLoopGroup workerGroup = new NioEventLoopGroup(4);try {ServerBootstrap b = new ServerBootstrap().group(bossGroup, workerGroup).channel(NioServerSocketChannel.class).localAddress(port).childHandler(new ChannelInitializer<SocketChannel>() {(SocketChannel ch) throws Exception {ch.pipeline().addLast(…);}});// Bind and start to accept incoming connections.ChannelFuture f = b.bind(port).sync();// Wait until the server socket is closed.f.channel().closeFuture().sync();} finally {bossGroup.shutdownGracefully();workerGroup.shutdownGracefully();}}

此外因为还用了Akka作为业务线程池，所以还看了下如何修改Akka的默认配置。方法是新建一个叫做application.conf的配置文件，我们创建ActorSystem时会自动加载这个配置文件，下面的配置文件中定制了一个dispatcher：

my-dispatcher { # Dispatcher is the name of the event-based dispatcher type = Dispatcher mailbox-type = “akka.dispatch.SingleConsumerOnlyUnboundedMailbox” # What kind of ExecutionService to use executor = “fork-join-executor” # Configuration for the fork join pool fork-join-executor {# Min number of threads to cap factor-based parallelism number toparallelism-min = 2# Parallelism (threads) … ceil(available processors * factor)parallelism-factor = 1.0# Max number of threads to cap factor-based parallelism number toparallelism-max = 16 } # Throughput defines the maximum number of messages to be # processed per actor before the thread jumps to the next actor. # Set fair as possible. throughput = 100}

简单来说，最关键的几个配置项是：

parallelism-factor：决定线程池的大小（竟然不是parallelism-max）。throughput：决定coroutine的切换频率，1是最为频繁也最为公平的设置。醒来第一眼看见的是他，然后倒头继续睡。这就是我想要的幸福。

相关文章：

你感兴趣的文章：

标签云：