Hello again everyone!
Today I wanted to talk about something I’ve been struggling with for a couple days (although more accurately weeks and months). That being, how to read log files effectively and efficiently? I work for a Managed Services Organization, which means we provide IT management services for clients. These are typically large clients and large, often application specific, environments. I often find myself setting up custom monitors that need to read through log files for specific entries.
The problem I was facing is that I was reading Gigs of logs every few minutes and my queries were having trouble keeping up with the volume of logs being generated. To compound that I had multiple scripts reading the same log files at different times. Which means the same logs were being accessed and read multiple times while looking for the slightly different entries.
So today I want to address the first issue I saw around reading log files: What is the quickest, and most memory efficient, way to read log files?
I performed the following tests on a 50 MB log file:
Measure-Command -Expression {Get-Content .\logs\somelog.log}
Days : 0
Hours : 0
Minutes : 0
Seconds : 14
Milliseconds : 695
Ticks : 146956836
TotalDays : 0.000170088930555556
TotalHours : 0.00408213433333333
TotalMinutes : 0.24492806
TotalSeconds : 14.6956836
TotalMilliseconds : 14695.6836
Not too bad but not super fast. This command also drops the file into the pipeline line by line. It manages to be the least memory efficient option. I’m not entirely sure why but it balloons the amount of memory required to multiple times the size of the file being read. It wouldn’t be a problem but when you’re trying to read 50-100 50 MB files you can run out of memory quickly.
Measure-Command -Expression {Get-Content .\logs\somelog.log -readcount 100}
Days : 0
Hours : 0
Minutes : 0
Seconds : 1
Milliseconds : 282
Ticks : 12829064
TotalDays : 1.48484537037037E-05
TotalHours : 0.000356362888888889
TotalMinutes : 0.0213817733333333
TotalSeconds : 1.2829064
TotalMilliseconds : 1282.9064
We’re starting to see a significant improvement here. This sends 100 line arrays into the pipeline. Definitely better but not ideal for my purposes.
Measure-Command -Expression {Get-Content .\logs\somelog.log -readcount 1000}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 550
Ticks : 5508710
TotalDays : 6.37582175925926E-06
TotalHours : 0.000153019722222222
TotalMinutes : 0.00918118333333333
TotalSeconds : 0.550871
TotalMilliseconds : 550.871
Pretty rocking performance wise. 1000 line arrays are getting created here. I’ve heard from a number of people that the best performance in logs reading varies with the log file but generally it’s clever to start looking between 1000 and 3000 on your readcount.
Measure-Command -Expression {Get-Content .\logs\somelog.log -readcount 0}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 493
Ticks : 4939466
TotalDays : 5.71697453703704E-06
TotalHours : 0.000137207388888889
TotalMinutes : 0.00823244333333333
TotalSeconds : 0.4939466
TotalMilliseconds : 493.9466
This is definitely the right area, on my log files this is usually the fastest but it’s still creating a string array. Not necessarily a bad thing but ended up being important for me when I was actually scanning for entries. I’ll get to that next post.
Measure-Command -Expression {Get-Content .\logs\somelog.log -raw}
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 410
Ticks : 4106764
TotalDays : 4.75319907407407E-06
TotalHours : 0.000114076777777778
TotalMinutes : 0.00684460666666667
TotalSeconds : 0.4106764
TotalMilliseconds : 410.6764
Before I get to the performance I just want to talk about the -raw parameter. This switches get-content’s behavior so that it reads the entire file as a single string. It may or may not be what you want in your environment, it’s crucial if you’re doing a multi line Regex match. More on that later.
I know this looks like it was faster than the last example but -readcount 0 seemed like it was beating it more often then not. Regardless both commands were extremely close performance wise, usually 50 – 100 millisecond difference. For me the choice of one over the other had more to do with my next operation than the read time. What these tests showed me was that when you’re using get-content it is really important to not accept the default readcount value. Or more accurately it’s important if the files are large and performance matters, like if you’re doing a log scrape for monitoring.
Next week I’ll be looking at the performance around actually scanning the log files for relevant data. Thanks for reading!