How to Read a Large File Efficiently with Java

1. Overview

This tutorial will show how to read all the lines from a large file in Java in an efficient manner.

This article is part of the “Java – Back to Basic” tutorial here on VietMX’s Blog.

2. Reading in Memory

The standard way of reading the lines of the file is in memory – both Guava and Apache Commons IO provide a quick way to do just that:

Files.readLines(new File(path), Charsets.UTF_8);
FileUtils.readLines(new File(path));

The problem with this approach is that all the file lines are kept in memory – which will quickly lead to OutOfMemoryError if the File is large enough.

For example – reading a ~1Gb file:

@Test
public void givenUsingGuava_whenIteratingAFile_thenWorks() throws IOException {
    String path = ...
    Files.readLines(new File(path), Charsets.UTF_8);
}

This starts off with a small amount of memory being consumed: (~0 Mb consumed)

[main] INFO  com.maixuanviet.java.CoreJavaIoUnitTest - Total Memory: 128 Mb
[main] INFO  com.maixuanviet.java.CoreJavaIoUnitTest - Free Memory: 116 Mb

However, after the full file has been processed, we have at the end: (~2 Gb consumed)

[main] INFO  com.maixuanviet.java.CoreJavaIoUnitTest - Total Memory: 2666 Mb
[main] INFO  com.maixuanviet.java.CoreJavaIoUnitTest - Free Memory: 490 Mb

Which means that about 2.1 Gb of memory are consumed by the process – the reason is simple – the lines of the file are all being stored in memory now.

It should be obvious by this point that keeping in memory the contents of the file will quickly exhaust the available memory – regardless of how much that actually is.

What’s more, we usually don’t need all of the lines in the file in memory at once – instead, we just need to be able to iterate through each one, do some processing and throw it away. So, this is exactly what we’re going to do – iterate through the lines without holding all of them in memory.

3. Streaming Through the File

Let’s now look at a solution – we’re going to use a java.util.Scanner to run through the contents of the file and retrieve lines serially, one by one:

FileInputStream inputStream = null;
Scanner sc = null;
try {
    inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        // System.out.println(line);
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }
} finally {
    if (inputStream != null) {
        inputStream.close();
    }
    if (sc != null) {
        sc.close();
    }
}

This solution will iterate through all the lines in the file – allowing for processing of each line – without keeping references to them – and in conclusion, without keeping them in memory(~150 Mb consumed)

[main] INFO  com.maixuanviet.java.CoreJavaIoUnitTest - Total Memory: 763 Mb
[main] INFO  com.maixuanviet.java.CoreJavaIoUnitTest - Free Memory: 605 Mb

4. Streaming With Apache Commons IO

The same can be achieved using the Commons IO library as well, by using the custom LineIterator provided by the library:

LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
} finally {
    LineIterator.closeQuietly(it);
}

Since the entire file is not fully in memory – this will also result in pretty conservative memory consumption numbers(~150 Mb consumed)

[main] INFO  o.b.java.CoreJavaIoIntegrationTest - Total Memory: 752 Mb
[main] INFO  o.b.java.CoreJavaIoIntegrationTest - Free Memory: 564 Mb

5. Conclusion

This quick article shows how to process lines in a large file without iteratively, without exhausting the available memory – which proves quite useful when working with these large files.

The implementation of all these examples and code snippets can be found in our GitHub project – this is a Maven-based project, so it should be easy to import and run as it is.

Related posts:

Lập trình đa luồng với Callable và Future trong Java
Java Program to Implement Fenwick Tree
Getting Started with Forms in Spring MVC
Java Program to Implement Dijkstra’s Algorithm using Priority Queue
Java Program to Implement Shell Sort
Java Program to Represent Graph Using Incidence List
Hướng dẫn sử dụng luồng vào ra nhị phân trong Java
Java Program to Describe the Representation of Graph using Incidence Matrix
Comparing Dates in Java
REST Web service: Basic Authentication trong Jersey 2.x
A Guide to the Java LinkedList
Java String to InputStream
Overview of Spring Boot Dev Tools
Kiểu dữ liệu Ngày Giờ (Date Time) trong java
Java Program to Perform Postorder Recursive Traversal of a Given Binary Tree
Java Program to implement Bit Set
Guide to the Java TransferQueue
Java Program to Implement Jarvis Algorithm
Jackson – Decide What Fields Get Serialized/Deserialized
Hướng dẫn Java Design Pattern – Observer
Java Program to Implement Merge Sort Algorithm on Linked List
Getting Started with Stream Processing with Spring Cloud Data Flow
Java Program to Implement wheel Sieve to Generate Prime Numbers Between Given Range
Java Program to Implement Max-Flow Min-Cut Theorem
Spring Security Custom AuthenticationFailureHandler
Java Program to Find MST (Minimum Spanning Tree) using Prim’s Algorithm
Jackson JSON Views
Java Program to Construct a Random Graph by the Method of Random Edge Selection
How to Remove the Last Character of a String?
The Order of Tests in JUnit
Java Program to Implement CountMinSketch
Java Program to Implement AVL Tree