1.背景
当前业务使用TextFile存储日志,每天增量5T左右,集群存储空间有限,且使用TextFile日志的扩展性非常差,因此想到使用protocol buffer序列化数据,后续传输及处理使用二进制。百度+google了一圈发现没有现成案例,所以决定自己搞一个。
2.问题
1.log4j如何将二进制文件写入flume?
2.flume如何将二进制文件传输到hdfs?
3.传输到hdfs的文件结构是什么?如何读取?
3.相关知识点
3.1 log4j2
简介:一个强悍的日志序列化框架
官网:https://logging.apache.org/log4j/2.x/manual/index.html
示例:log4j2使用指南
3.2 flume
简介:分布式日志收集系统
官网:https://flume.apache.org/FlumeUserGuide.html
参考文章
1.flume简介
2.flume source
3.flume channel
4.flume sink
3.3 protocol buffer
简介:是google开发的的一套用于数据存储,网络通信时用于协议编解码的工具库.它和XML和Json数据差不多,把数据已某种形式保存起来.Protobuf相对与XML和Json的不同之处,它是一种二进制的数据格式,具有更高的传输,打包和解包效率。
参考文档:Protobuf 语法指南
3.4 avro
简介:和pb类型,avro支持二进制序列化方式,可以便捷,快速地处理大量数据;动态语言友好,Avro提供的机制使动态语言可以方便地处理Avro数据。
参考文档:avro入门指南(Java实现)
4.环境搭建
4.1 本地安装hadoop
参考我的另一篇文章:windows环境下安装配置hadoop
4.2 本地部署flume
在本地部署flume比较简单,参考我的另外一篇文章:windows下配置flume
除此之外,因为涉及到写hdfs,所以需要把hadoop的下列jar复制到flume的lib目录下,否则会报错。
/hadoop/share/hadoop/common/.jar
/hadoop/share/hadoop/common/lib/.jar
/share/hadoop/hdfs/hadoop-hdfs-2.5.2.jar
5.具体方案
整个方案的流程如图所示:
5.1 定义protocol buffer协议
定义pb协议,编译生成AdxWinLog类。
option java_outer_classname="AdxWinLog";
//编译:protoc --java_out=./ adxlog.proto
message AdxLog {
optional string version = 1;
required string displayId = 2;
optional string ts = 3;
optional string cookieId = 4;
optional string imei = 5;
optional string idfa = 6;
optional string sessionId = 7;
optional string userId = 8;
optional string httpxForward = 9;
optional string requestSource = 10;
}
5.2 自定义FlumeByteAppender
在给出代码之前有三点需要说明:
(1)log4j的Logger.info()方法支持传入Object类型对象,在log4j系统内部被包装成LogEvent流转。但是自带的FlumeAppender在处理过程中会调用LogEvent中Object的toString方法,然后将拿到的值通过layout格式化后放入flumeEvent的body中发送出去。所以如果在log.info()方法中传入一个非String类型的对象且这个对象没有重写toString方法,则拿到的是对象的hash码。
(2)如果log.info()方法中传入的是对象,则对应的LogEvent中Message的类型为ObjectMessage,这个类有点扯,存放值的字段名称为obj,字段的类型为Object[],它的getter方法名字叫getParameters。
(3)flumeEvent的body接收的参数为byte[]类型,一般都是通过layout的toByteArray将EventLog中的内容转为byte[]。
所以,我们可以直接在log.info()中传入byte[],在Appender中直接放入flumeEvent的body中。同时通过layout转换是没有必要的,因为二进制文件无需布局
代码如下:
package net.bigdataer.demo.log4j.appender;
import org.apache.logging.log4j.core.Appender;
import org.apache.logging.log4j.core.Filter;
import org.apache.logging.log4j.core.Layout;
import org.apache.logging.log4j.core.LogEvent;
import org.apache.logging.log4j.core.appender.AbstractAppender;
import org.apache.logging.log4j.core.config.Property;
import org.apache.logging.log4j.core.config.plugins.*;
import org.apache.logging.log4j.core.layout.Rfc5424Layout;
import org.apache.logging.log4j.core.net.Facility;
import org.apache.logging.log4j.core.util.Booleans;
import org.apache.logging.log4j.core.util.Integers;
import org.apache.logging.log4j.flume.appender.*;
import org.apache.logging.log4j.message.Message;
import org.apache.logging.log4j.message.ObjectMessage;
import org.apache.logging.log4j.message.SimpleMessage;
import java.io.Serializable;
import java.util.Locale;
import java.util.concurrent.TimeUnit;
/**
* Created by liuxuecheng on 2018/7/3.
* custom FlumeByteAppender based on FlumeAppender
* generation event without layout,so it not need to confige layout in log4j2.xml
* it also suppots to pass a String type object ,because string is converted to byte[] in event too.
* @58corp
*/
@Plugin(name = "FlumeByte", category = "Core", elementType = Appender.ELEMENT_TYPE, printObject = true)
public final class FlumeByteAppender extends AbstractAppender implements FlumeEventFactory {
private static final String[] EXCLUDED_PACKAGES = {"org.apache.flume", "org.apache.avro"};
private static final int DEFAULT_MAX_DELAY = 60000;
private static final int DEFAULT_LOCK_TIMEOUT_RETRY_COUNT = 5;
private final AbstractFlumeManager manager;
private final String mdcIncludes;
private final String mdcExcludes;
private final String mdcRequired;
private final String eventPrefix;
private final String mdcPrefix;
private final boolean compressBody;
private final FlumeEventFactory factory;
/**
* Which Manager will be used by the appender instance.
*/
private enum ManagerType {
AVRO, EMBEDDED, PERSISTENT;
public static FlumeByteAppender.ManagerType getType(final String type) {
return valueOf(type.toUpperCase(Locale.US));
}
}
private FlumeByteAppender(final String name, final Filter filter, final Layout<? extends Serializable> layout,
final boolean ignoreExceptions, final String includes, final String excludes,
final String required, final String mdcPrefix, final String eventPrefix,
final boolean compress, final FlumeEventFactory factory, final AbstractFlumeManager manager) {
super(name, filter, layout, ignoreExceptions);
this.manager = manager;
this.mdcIncludes = includes;
this.mdcExcludes = excludes;
this.mdcRequired = required;
this.eventPrefix = eventPrefix;
this.mdcPrefix = mdcPrefix;
this.compressBody = compress;
this.factory = factory == null ? this : factory;
}
/**
* Publish the event.
* @param event The LogEvent.
*/
@Override
public void append(final LogEvent event) {
final String name = event.getLoggerName();
if (name != null) {
for (final String pkg : EXCLUDED_PACKAGES) {
if (name.startsWith(pkg)) {
return;
}
}
}
final FlumeEvent flumeEvent = factory.createEvent(event, mdcIncludes, mdcExcludes, mdcRequired, mdcPrefix,
eventPrefix, compressBody);
/**
* !!!only modified some codes here
* if message instanceof SimpleMessage then run deafalut logic
* else if message instanceof ObjectMessage ,get obj which passed into log.info() by call event.getMessage().getParameters()
* it only has one element,so parameters's index is zero
*/
Message message = event.getMessage();
if(message instanceof SimpleMessage){
flumeEvent.setBody(getLayout().toByteArray(flumeEvent));
}else if(message instanceof ObjectMessage) {
Object[] parameters = event.getMessage().getParameters();
if (parameters[0] instanceof byte[]) {
byte[] bytes = (byte[]) parameters[0];
flumeEvent.setBody(bytes);
}
}
/**
* btw,you can add some infomation into header ,just like those:
*/
/*flumeEvent.getHeaders().put("dir","/home/hadoop/hdp_lbg_ectech/");
flumeEvent.getHeaders().put("user","hdp_lbg_ectech");
flumeEvent.getHeaders().put("topic","flume_topic");*/
manager.send(flumeEvent);
}
@Override
public boolean stop(final long timeout, final TimeUnit timeUnit) {
setStopping();
boolean stopped = super.stop(timeout, timeUnit, false);
stopped &= manager.stop(timeout, timeUnit);
setStopped();
return stopped;
}
/**
* Create a Flume event.
* @param event The Log4j LogEvent.
* @param includes comma separated list of mdc elements to include.
* @param excludes comma separated list of mdc elements to exclude.
* @param required comma separated list of mdc elements that must be present with a value.
* @param mdcPrefix The prefix to add to MDC key names.
* @param eventPrefix The prefix to add to event fields.
* @param compress If true the body will be compressed.
* @return A Flume Event.
*/
@Override
public FlumeEvent createEvent(final LogEvent event, final String includes, final String excludes,
final String required, final String mdcPrefix, final String eventPrefix,
final boolean compress) {
return new FlumeEvent(event, mdcIncludes, mdcExcludes, mdcRequired, mdcPrefix,
eventPrefix, compressBody);
}
/**
* Create a Flume Avro Appender.
* @param agents An array of Agents.
* @param properties Properties to pass to the embedded agent.
* @param embedded true if the embedded agent manager should be used. otherwise the Avro manager will be used.
* <b>Note: </b><i>The embedded attribute is deprecated in favor of specifying the type attribute.</i>
* @param type Avro (default), Embedded, or Persistent.
* @param dataDir The directory where the Flume FileChannel should write its data.
* @param connectionTimeoutMillis The amount of time in milliseconds to wait before a connection times out. Minimum is
* 1000.
* @param requestTimeoutMillis The amount of time in milliseconds to wait before a request times out. Minimum is 1000.
* @param agentRetries The number of times to retry an agent before failing to the next agent.
* @param maxDelayMillis The maximum number of milliseconds to wait for a complete batch.
* @param name The name of the Appender.
* @param ignore If {@code "true"} (default) exceptions encountered when appending events are logged; otherwise
* they are propagated to the caller.
* @param excludes A comma separated list of MDC elements to exclude.
* @param includes A comma separated list of MDC elements to include.
* @param required A comma separated list of MDC elements that are required.
* @param mdcPrefix The prefix to add to MDC key names.
* @param eventPrefix The prefix to add to event key names.
* @param compressBody If true the event body will be compressed.
* @param batchSize Number of events to include in a batch. Defaults to 1.
* @param lockTimeoutRetries Times to retry a lock timeout when writing to Berkeley DB.
* @param factory The factory to use to create Flume events.
* @param layout The layout to format the event.
* @param filter A Filter to filter events.
*
* @return A Flume Avro Appender.
*/
@PluginFactory
public static FlumeByteAppender createAppender(@PluginElement("Agents") final Agent[] agents,
@PluginElement("Properties") final Property[] properties,
@PluginAttribute("hosts") final String hosts,
@PluginAttribute("embedded") final String embedded,
@PluginAttribute("type") final String type,
@PluginAttribute("dataDir") final String dataDir,
@PluginAliases("connectTimeout")
@PluginAttribute("connectTimeoutMillis") final String connectionTimeoutMillis,
@PluginAliases("requestTimeout")
@PluginAttribute("requestTimeoutMillis") final String requestTimeoutMillis,
@PluginAttribute("agentRetries") final String agentRetries,
@PluginAliases("maxDelay") // deprecated
@PluginAttribute("maxDelayMillis") final String maxDelayMillis,
@PluginAttribute("name") final String name,
@PluginAttribute("ignoreExceptions") final String ignore,
@PluginAttribute("mdcExcludes") final String excludes,
@PluginAttribute("mdcIncludes") final String includes,
@PluginAttribute("mdcRequired") final String required,
@PluginAttribute("mdcPrefix") final String mdcPrefix,
@PluginAttribute("eventPrefix") final String eventPrefix,
@PluginAttribute("compress") final String compressBody,
@PluginAttribute("batchSize") final String batchSize,
@PluginAttribute("lockTimeoutRetries") final String lockTimeoutRetries,
@PluginElement("FlumeEventFactory") final FlumeEventFactory factory,
@PluginElement("Layout") Layout<? extends Serializable> layout,
@PluginElement("Filter") final Filter filter) {
final boolean embed = embedded != null ? Boolean.parseBoolean(embedded) :
(agents == null || agents.length == 0 || hosts == null || hosts.isEmpty()) && properties != null && properties.length > 0;
final boolean ignoreExceptions = Booleans.parseBoolean(ignore, true);
final boolean compress = Booleans.parseBoolean(compressBody, true);
FlumeByteAppender.ManagerType managerType;
if (type != null) {
if (embed && embedded != null) {
try {
managerType = FlumeByteAppender.ManagerType.getType(type);
LOGGER.warn("Embedded and type attributes are mutually exclusive. Using type " + type);
} catch (final Exception ex) {
LOGGER.warn("Embedded and type attributes are mutually exclusive and type " + type +
" is invalid.");
managerType = FlumeByteAppender.ManagerType.EMBEDDED;
}
} else {
try {
managerType = FlumeByteAppender.ManagerType.getType(type);
} catch (final Exception ex) {
LOGGER.warn("Type " + type + " is invalid.");
managerType = FlumeByteAppender.ManagerType.EMBEDDED;
}
}
} else if (embed) {
managerType = FlumeByteAppender.ManagerType.EMBEDDED;
} else {
managerType = FlumeByteAppender.ManagerType.AVRO;
}
final int batchCount = Integers.parseInt(batchSize, 1);
final int connectTimeoutMillis = Integers.parseInt(connectionTimeoutMillis, 0);
final int reqTimeoutMillis = Integers.parseInt(requestTimeoutMillis, 0);
final int retries = Integers.parseInt(agentRetries, 0);
final int lockTimeoutRetryCount = Integers.parseInt(lockTimeoutRetries, DEFAULT_LOCK_TIMEOUT_RETRY_COUNT);
final int delayMillis = Integers.parseInt(maxDelayMillis, DEFAULT_MAX_DELAY);
if (layout == null) {
final int enterpriseNumber = Rfc5424Layout.DEFAULT_ENTERPRISE_NUMBER;
layout = Rfc5424Layout.createLayout(Facility.LOCAL0, null, enterpriseNumber, true, Rfc5424Layout.DEFAULT_MDCID,
mdcPrefix, eventPrefix, false, null, null, null, excludes, includes, required, null, false, null,
null);
}
if (name == null) {
LOGGER.error("No name provided for Appender");
return null;
}
AbstractFlumeManager manager;
switch (managerType) {
case EMBEDDED:
manager = FlumeEmbeddedManager.getManager(name, agents, properties, batchCount, dataDir);
break;
case AVRO:
manager = FlumeAvroManager.getManager(name, getAgents(agents, hosts), batchCount, delayMillis, retries, connectTimeoutMillis, reqTimeoutMillis);
break;
case PERSISTENT:
manager = FlumePersistentManager.getManager(name, getAgents(agents, hosts), properties, batchCount, retries,
connectTimeoutMillis, reqTimeoutMillis, delayMillis, lockTimeoutRetryCount, dataDir);
break;
default:
LOGGER.debug("No manager type specified. Defaulting to AVRO");
manager = FlumeAvroManager.getManager(name, getAgents(agents, hosts), batchCount, delayMillis, retries, connectTimeoutMillis, reqTimeoutMillis);
}
if (manager == null) {
return null;
}
return new FlumeByteAppender(name, filter, layout, ignoreExceptions, includes,
excludes, required, mdcPrefix, eventPrefix, compress, factory, manager);
}
private static Agent[] getAgents(Agent[] agents, final String hosts) {
if (agents == null || agents.length == 0) {
if (hosts != null && !hosts.isEmpty()) {
LOGGER.debug("Parsing agents from hosts parameter");
final String[] hostports = hosts.split(",");
agents = new Agent[hostports.length];
for(int i = 0; i < hostports.length; ++i) {
final String[] h = hostports[i].split(":");
agents[i] = Agent.createAgent(h[0], h.length > 1 ? h[1] : null);
}
} else {
LOGGER.debug("No agents provided, using defaults");
agents = new Agent[] {Agent.createAgent(null, null)};
}
}
LOGGER.debug("Using agents {}", agents);
return agents;
}
}
5.3 定义日志发送客户端
测试上面自定义的Appender,创建一个工程,核心代码如下:
log4j2.xml
<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="WARN">
<Appenders>
<Console name="Console" target="SYSTEM_OUT">
<PatternLayout pattern="%d{HH:mm:ss.SSS} [%t] %-5level %logger{36} - %msg%n"/>
</Console>
<!--use custome FlumeByteAppender-->
<FlumeByte name="eventLogger" compress="false" type="Avro">
<!--it can config more than one host for HA
attention to your host ip and port that must same with your configuration in flume -->
<Agent host="0.0.0.0" port="7777"/>
</FlumeByte>
<!--you had better to use AsyncAppender to ref FlumeByte-->
<Async name="Async">
<AppenderRef ref="eventLogger"/>
</Async>
</Appenders>
<Loggers>
<Root level="info">
<AppenderRef ref="Async"/>
<AppenderRef ref="Console"></AppenderRef>
</Root>
</Loggers>
</Configuration>
运行类,其中依赖的AdxLog类是上文的pb协议文件生成的。
package net.bigdataer.demo.log4j.logwriter;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
/**
* Created by liuxuecheng on 2018/6/29.
*/
public class WriteLogToFlume {
public static void main(String args[]) throws InterruptedException {
Logger log = LogManager.getLogger(WriteLogToFlume.class);
//the default buffer size in AsyncAppender is 1024,so you variable must bigger than 1024
for(int i = 0;i<10000;i++){
AdxWinLog.AdxLog adxLog = AdxWinLog.AdxLog.newBuilder()
.setTs("ts"+System.currentTimeMillis())
.setUserId("userid"+i)
.setImei("ime"+i)
.setDisplayId("displayid"+i)
.setCookieId("cookie=="+i)
.build();
// log.info can accept Object kinds paramter,invoke method toByteArray() to transfer adxLog to byte[]
log.info(adxLog.toByteArray());
}
}
}
完整工程:https://github.com/bigdataer01/log4j2-flumebyte-appender
5.4 自定义CustomAvroToHdfsSerializer
(1)flume的hdfs sink使用FlumeEventAvroEventSerializer,支持以SequenceFile或者DataStream或CompressedStream这三种类型的文件写入hdfs。
(2)source type为avro的flume agent中数据以avro_event的形式存在,它的结构是一个avro结构,schema为:
{"type":"record","name":"Event","fields":[{"name":"headers","type":{"type":"map","values":"string"}},{"name":"body","type":"bytes"}]}
如果以这种格式序列化到hdfs,会带上header的信息,导致没法解析body里面的byte[]。
(3)重写serializer去掉header信息,在代码里需要重新定义avro_event的格式。
代码比较简单:
package net.bigdataer.demo.flume.sink.serializer;
import org.apache.avro.Schema;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.serialization.AbstractAvroEventSerializer;
import org.apache.flume.serialization.EventSerializer;
import java.io.OutputStream;
import java.nio.ByteBuffer;
/**
* Created by liuxuecheng on 2018/7/2.
*/
public class CustomAvroToHdfsSerializer extends AbstractAvroEventSerializer<RawData> {
private static final Schema SCHEMA = (new Schema.Parser()).parse("{\"type\":\"record\",\"name\":\"RawData\",\"fields\":[{\"name\":\"body\",\"type\":\"bytes\"}]}");
private final OutputStream out;
private CustomAvroToHdfsSerializer(OutputStream out) {
this.out = out;
}
protected Schema getSchema() {
return SCHEMA;
}
protected OutputStream getOutputStream() {
return this.out;
}
protected RawData convert(Event event) {
/**
* remove the header of the avro_event
* when event write to hdfs,it's sequencefile which formatter is <LongWritable,BytesWritable>
*/
return RawData.newBuilder().setBody(ByteBuffer.wrap(event.getBody())).build();
}
public static class Builder implements EventSerializer.Builder {
public Builder() {
}
public EventSerializer build(Context context, OutputStream out) {
CustomAvroToHdfsSerializer writer = new CustomAvroToHdfsSerializer(out);
writer.configure(context);
return writer;
}
}
}
完整工程:https://github.com/bigdataer01/flume-hdfs-sink-serializer
将上述工程打包后的jar放到flume的lib目录下即可。
5.5 配置flume
在flume的conf目录下新建hdfs.conf,并写入以下配置信息:
a1.sources=source1
a1.channels=channel1
a1.sinks=sink1
a1.sources.source1.type=avro
a1.sources.source1.bind=0.0.0.0
a1.sources.source1.port=7777
a1.sources.source1.channels=channel1
a1.channels.channel1.type=memory
a1.channels.channel1.capacity=1000
a1.channels.channel1.transactionCapacity=1000
a1.channels.channel1.keep-alive=30
a1.sinks.sink1.type=hdfs
a1.sinks.sink1.channel=channel1
#写入hdfs的目录
a1.sinks.sink1.hdfs.path=hdfs://localhost:9000/home/rawdata/flume/
#写入hdfs的文件类型
a1.sinks.sink1.hdfs.fileType=SequenceFile
a1.sinks.sink1.hdfs.rollInterval=0
a1.sinks.sink1.hdfs.rollSize=10240
a1.sinks.sink1.hdfs.rollCount=0
a1.sinks.sink1.hdfs.idleTimeout=60
#在这里使用自定义serializer
a1.sinks.sink1.hdfs.serializer=net.bigdataer.demo.flume.sink.serializer.CustomAvroToHdfsSerializer
5.6 启动flume
进入到flume的bin目录下,执行以下命令:
flume-ng.cmd agent -conf ../conf -conf-file ../conf/hdfs.conf -name a1 -property flume.root.logger=INFO,console
5.7 全流程测试
(1)直接在ide运行log4j2-flumebyte-appender项目中的WriteLogToFlume类,输出如下信息:
这是因为我们也配置了日志输出到console。
(2)flume的console界面输出如下信息:
2018-07-19 14:54:42,035 (hdfs-sink1-call-runner-2) [INFO - org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:655)] Renaming hdfs://localhost:9000/home/rawdata/flume/FlumeData.1531983230209.tmp to hdfs://localhost:9000/home/rawdata/flume/FlumeData.1531983230209
2018-07-19 14:54:42,068 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:251)] Creating hdfs://localhost:9000/home/rawdata/flume//FlumeData.1531983230210.tmp
2018-07-19 14:54:42,139 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.close(BucketWriter.java:393)] Closing hdfs://localhost:9000/home/rawdata/flume//FlumeData.1531983230210.tmp
(3)查看hdfs,文件已经写入hdfs
6.使用spark反序列化
将上述文件get到本地,使用下面的spark代码反序列化:
object SparkReadAvro {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("read").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.sequenceFile[LongWritable,BytesWritable]("file:///d:/logs/log")
.map(x=>{
x._2.copyBytes()
}).map(AdxLog.parseFrom).foreach(x=>println(x.toString))
}
}