This project creates some useful performance monitoring and measurement utilities.
The Load tool is intended to do the following:
usage: Generator -bs,--sample_bias <BIAS_FILE> The discrete distribution to bias the sampling. This is a CSV of 2 columns. The first column is the % of the templates and the 2nd column is the probability (0-100) that it's chosen. For instance: 20,80 80,20 implies that 20% of the templates will comprise 80% of the output and the remaining 80% of the templates will comprise 20% of the output. -c,--csv <CSV_FILE> A CSV file to emit monitoring data to. The format is a CSV with the following schema: timestamp, (name, eps, historical_mean, historical_stddev)+ -cg,--consumer_group <GROUP_ID> Consumer Group. The default is load.group -e,--eps <EPS> The target events per second -h,--help Generate Help screen -k,--kafka_config <CONFIG_FILE> The kafka config. This is a file containing a JSON map with the kafka config. -l,--lookback <LOOKBACK> When summarizing, how many monitoring periods should we summarize over? If 0, then no summary. Default: 5 -md,--monitor_delta_ms <TIME_IN_MS> The time (in ms) between monitoring output. Default is 10000 -mt,--monitor_topic <TOPIC> The kafka topic to monitor. -ot,--output_topic <TOPIC> The kafka topic to write to -p,--threads <NUM_THREADS> The number of threads to use when extracting data. The default is the number of cores of your machine. -sd,--send_delta_ms <TIME_IN_MS> The time (in ms) between sending a batch of messages. Default is 100 -t,--template <TEMPLATE_FILE> The template file to use for generation. This should be a file with a template per line with $METRON_TS and $METRON_GUID in the spots for timestamp and guid, if you so desire them. -tl,--time_limit_ms <MS> The total amount of time to run this in milliseconds. By default, it never stops. -z,--zk_quorum <QUORUM> zookeeper quorum
Messages are drawn from a template file. A template file has a message template per line.
For instance, let’s say we want to generate JSON maps with fields: source.type, ip_src_addr and ip_dst_addr. We can generate a template file with a template like the following per line:
{ "source.type" : "asa", "ip_src_addr" : "127.0.0.1", "ip_dst_addr" : "191.168.1.1" }
When messages are generated, there are some special replacements that can be used: $METRON_TS and $METRON_GUID. We can adjust our previous template to use these like so:
{ "source.type" : "asa", "ip_src_addr" : "127.0.0.1", "ip_dst_addr" : "191.168.1.1", "timestamp" : $METRON_TS, "guid" : "$METRON_GUID" }
One note about GUIDs generated. We do not generate global UUIDs, they are unique only within the context of a given generator run.
This load tool can be configured to use biased sampling. This is useful if, for instance, you are trying to model data which is not distributed uniformly, like many types of network data. Generating synthetic data with similar distribution to your regular data will enable the caches to be exercised in the same way, for instance, and yield a more realistic scenario.
You specify the biases in a csv file with 2 columns:
A simple example would be to generate samples based on Pareto’s principle:
20,80 80,20
This would yield biases that mean the first 20% of the templates in the template file would comprise 80% of the output.
A more complex example might be:
20,80 20,5 50,1 10,14
This would would imply:
For those who would prefer a different visualization or wish to incorporate the output of this tool into an automated test, you can specify a file to emit data in CSV format to via the -c or --csv option.
The CSV columns are as follows:
If you are generating synthetic data, then:
If you are monitoring a topic, then:
Obviously, if you are doing both generating and monitoring the throughput of a topic, then all of the columns are added.
An example of CSV output is:
1520506955047,generated,,,,throughput measured,,, 1520506964896,generated,1045,1045,0,throughput measured,,, 1520506974896,generated,1000,1022,31,throughput measured,1002,1002,0 1520506984904,generated,999,1014,26,throughput measured,999,1000,2 1520506994896,generated,1000,1011,22,throughput measured,1000,1000,1 1520507004896,generated,1000,1008,20,throughput measured,1000,1000,1
One can use the load tool to monitor performance of a kafka-to-kafka topology. For instance, we could monitor the throughput of the enrichment topology by monitoring the enrichments kafka topic:
$METRON_HOME/bin/load_tool.sh -mt enrichments -z $ZOOKEEPER
One can use the load tool to generate synthetic load and monitor performance of a kafka-to-kafka topology. For instance, we could monitor the performance of the enrichment topology. It is advised to start the enrichment topology against a new topic and write to a new topic so as to not pollute your downstream indices. So, for instance we could create a kafka topic called enrichments_load by generating load on it. We could also create a new kafka topic called indexing_load and configure the enrichment topology to output to it. We would then generate load on enrichments_load and monitor indexing_load.
#Threadpool of size 5, you want somewhere between 5 and 10 depending on the throughput numbers you're trying to drive #Messages drawn from ~/dummy.templates, which is a message template per line #Generate at a rate of 9000 messages per second #Emit the data to a CSV file ~/measurements.csv $METRON_HOME/bin/load_tool.sh -p 5 -ot enrichments_load -mt indexing_load -t ~/dummy.templates -eps 9000 -z $ZOOKEEPER -c ~/measurements.csv
Now, with the help of a bash function and gnuplot we can generate a plot of the historical throughput measurements for indexing_load:
# Ensure that you have installed gnuplot and the liberation font package # via yum install -y gnuplot liberation-sans-fonts # We will define a plot function that will generate a png plot. It takes # one arg, the output file. It expects to have a 2 column CSV streamed # with the first dimension being the timestamp and the second dimension # being what you want plotted. plot() { awk -F, '{printf "%d %d\n", $1/1000, $2} END { print "e" }' | gnuplot -e "reset;clear;set style fill solid 1.0 border -1; set nokey;set title 'Throughput Measured'; set xlabel 'Time'; set boxwidth 0.5; set xtics rotate; set ylabel 'events/sec';set xdata time; set timefmt '%s';set format x '%H:%M:%S';set term png enhanced font '/usr/share/fonts/liberation/LiberationSans-Regular.ttf' 12 size 900,400; set output '$1';plot '< cat -' using 1:2 with line lt -1 lw 2;" } # We want to transform the CSV file into a space separated file with the # timestamp followed by the throughput measurements. cat ~/measurements.csv | awk -F, '{printf "%d,%d\n", $1, $8 }' | plot performance_measurement.png
This generates a plot like so to performance_measurement.png: