学习POI对excel中xls文件读较大数据的两种方法数据处理的方法和速度对比

象相合 · 发表于 2017-11-18 14:38

本帖最后由象相合于 2018-5-17 15:59 编辑

      大数据的excel读写不像小数据，对于读写的方法，数据精度，数据冗余度，数据稳定性都有相当的要求。
      于是在网上搜到一份参考资料，沿着链接跳进去得到另一份：
               前置资料：https://www.cnblogs.com/tootwo2/p/6683143.html
                                 http://poi.apache.org/spreadsheet/how-to.html
      但是，这个人并没有提供相应的速度评测，文章内容也是从POI官网内容捞下来的，实在没有什么可比性。【= =于是就有了本文：

在这里，只讨论两种模式：1、eventmodel，只读，CPU和内存消耗非常低。事件监听实现。2、usermodel，可读可写，CPU和内存消耗非常大。普通模式，直接获取实现。
   这两种模式读取excel的时间格式都是double型，因此读取时还需要对数据整理：

   先上代码，再作解释：

   a). 配置maven依赖：

[XML] 纯文本查看 复制代码

<!-- 读取execl 文件 -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>3.15</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.15</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml-schemas -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml-schemas</artifactId>
    <version>3.15</version>
</dependency>
<dependency>
    <groupId>commons-lang</groupId>
    <artifactId>commons-lang</artifactId>
    <version>2.6</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.poi/poi-scratchpad -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>3.15</version>
</dependency>
<dependency>
    <groupId>xerces</groupId>
    <artifactId>xerces</artifactId>
    <version>2.4.0</version>
</dependency>

b). XlsReadWriteMethod：普通的usermodel模式读取excel

[Java] 纯文本查看 复制代码

package excel.POI;

import org.apache.commons.lang.time.DateFormatUtils;
import org.apache.poi.hssf.usermodel.HSSFDateUtil;
import org.apache.poi.ss.usermodel.*;

import java.io.File;
import java.io.FileInputStream;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;


[i]/**
[/i][i] * 可读可写的Xls/Xlsx
[/i][i] * */
[/i]public class XlsReadWriteMethod {

    private String filePath;

    XlsReadWriteMethod(String filePath) {
        this.filePath = filePath;
    }

    public static void main(String[] args) {
        try {
            Long start = System.[i]currentTimeMillis[/i]();
            String filePath = "F:/temp/excel/1.xls";
            XlsReadWriteMethod xlsReadWriteMethod = new XlsReadWriteMethod(filePath);
            Iterator iter = xlsReadWriteMethod.RowIterator(0);
            Long end = System.[i]currentTimeMillis[/i]();
            System.[i]out[/i].println("get data time = " + (end - start));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }


    [i]/**
[/i][i]     * 迭代FileBean的每一行
[/i][i]     * */
[/i][i]    [/i]public Iterator<String[]> RowIterator(int sheetAt) throws Exception {
        Workbook workbook = WorkbookFactory.[i]create[/i](new FileInputStream(new File(filePath)));
        Sheet sheet = workbook.getSheetAt(sheetAt); //get index of sheet

        List<String[]> list = new LinkedList<String[]>();
        for (int i = 0; i < sheet.getLastRowNum(); i++) {
            Row row = sheet.getRow(i);
            if(null == row)  continue; // 这一行是空行：没东西

            int firstCellNum = row.getFirstCellNum();
            int lastCellNum = row.getLastCellNum();
            String[] oneRowValue = new String[lastCellNum - firstCellNum];
            int countNoneCol = 0;

            for (int j = firstCellNum; j < lastCellNum; j++) {
                Cell cell = row.getCell(j);
                int thisCol = j - firstCellNum;
                if(null == cell) {
                    oneRowValue[thisCol] = "";
                    continue;
                }
                switch (cell.getCellType()) // Cell.getCellType() has not deprecated in API
                {
                    // Because of Cell.getCellType().getCode() has deprecated And do not provide a replacement way So just use direct ways
                    case -1 : // _NONE
                        oneRowValue[thisCol] = "";
                        break;
                    case 0 : // NUMERIC Date
                        short dataFormat = cell.getCellStyle().getDataFormat();
                        // 31/176 : yyyy年m月d日（自定义型和日期型） ， 58 : m月d日 ， 32 : h时mm分
                        // 有时候176也是数字型，那这样会报错，干脆直接抓个异常。总之抓一堆dateFormat来兼容就好了。
                        if(HSSFDateUtil.[i]isCellDateFormatted[/i](cell) || HSSFDateUtil.[i]isCellDateFormatted[/i](cell)
                                || 31 == dataFormat || 58 == dataFormat || 32 == dataFormat || 176 == dataFormat) {
                            try {
                                oneRowValue[thisCol] = DateFormatUtils.[i]format[/i](cell.getDateCellValue(), "yyyy-MM-dd hh:MM:ss");
                            }catch ( Exception e) {
                                try {
                                    cell.setCellType(CellType.[i]STRING[/i]);
                                }catch (Exception ex) {
                                    //出问题就不转
                                }
                                oneRowValue[thisCol] = String.[i]valueOf[/i](cell.getStringCellValue());
                            }
                        } else {
                            try {
                                cell.setCellType(CellType.[i]STRING[/i]);
                            }catch (Exception ex) {
                                //出问题就不转
                            }
                            oneRowValue[thisCol] = String.[i]valueOf[/i](cell.getStringCellValue());
                        }
                        break;
                    case 1 : // STRING
                        oneRowValue[thisCol] = cell.getStringCellValue();
                        break;
                    case 2 : // FORMULA
                        try {
                            oneRowValue[thisCol] = String.[i]valueOf[/i](cell.getNumericCellValue());
                        } catch (IllegalStateException e) {
                            try {
                                oneRowValue[thisCol] = String.[i]valueOf[/i](cell.getRichStringCellValue());
                            } catch (Exception ex){
                                oneRowValue[thisCol] = "";
                            }
                        }
                        break;
                    case 3 : // BLANK
                        oneRowValue[thisCol] = "";
                        break;
                    case 4 : // BOOLEAN
                        oneRowValue[thisCol] = String.[i]valueOf[/i](cell.getBooleanCellValue());
                        break;
                    case 5 : // ERROR
                        oneRowValue[thisCol] = String.[i]valueOf[/i](cell.getErrorCellValue());
                        break;
                    default :
                        oneRowValue[thisCol] = "";
                        break;
                } // end switch
                if("".equals(oneRowValue[thisCol])) countNoneCol++;

            } // end for

            if(countNoneCol == (lastCellNum - firstCellNum)) continue;//这一行是空行，里面的东西都无效
            list.add(oneRowValue);
        } // end for 2

        return list.iterator();
    }

}

c). XlsReadOnlyMethod：eventmodel的模式，它是监听器模式，读文件的时候不停记录读取的记录，根据不同的记录特征读取不同的值。

[Java] 纯文本查看 复制代码

package excel.POI;

import org.apache.poi.hssf.eventusermodel.HSSFListener;


import org.apache.poi.hssf.eventusermodel.*;
import org.apache.poi.hssf.record.*;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;


[i]/**
[/i][i] * 只可读类型的XlsMethod，比可读写快（比例：20/13
[/i][i] */
[/i]public class XlsReadOnlyMethod implements HSSFListener {

    private SSTRecord sstRec = null;

    private List<String[]> iterList = new LinkedList<String[]>();
    private String[] tmpStr_ = null;

    private int colLastRealIndex = 0; //总列
    private int rowLastRealIndex = 0; //总行

    private int thisRow = 0;        // 当前行指针
    private boolean helpThisRow = true;

    private boolean shouldInit = true;

    private static int [i]sheetAt [/i]= 0; // 传入的sheet，不用static会丢失值
    private int count = 0;   // 用来计数是第几个sheet
    private boolean isActiveSheet = false;   // 用来标记是第几个sheet
    private static FormatTrackingHSSFListener [i]formatListener[/i];

    public static void main(String[] args) {
        try {
            Long start = System.[i]currentTimeMillis[/i]();
            String filePath = "F:/temp/excel/1.xls";
            XlsReadOnlyMethod xlsReadOnlyMethod = new XlsReadOnlyMethod();
            Iterator iter = xlsReadOnlyMethod.RowIterator(filePath, 0);
            Long end = System.[i]currentTimeMillis[/i]();
            System.[i]out[/i].println("spending time = " + (end - start));
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

    [i]/**
[/i][i]     * sheet想调哪张调哪张。操作流程：
[/i][i]     *  1. 构造HSSFRequest对象
[/i][i]     *  2. 使用输入流让事件工厂运作，直接监听到读取数据的事件，在监听processRecord中拿数据。
[/i][i]     */
[/i][i]    [/i]public Iterator<String[]> RowIterator(String filePath,int sheetAt) throws IOException {

        this.[i]sheetAt [/i]= sheetAt;
        FileInputStream fin = new FileInputStream(filePath);
        InputStream inputStream = new POIFSFileSystem(fin).createDocumentInputStream("Workbook");
        HSSFRequest hssfRequest = new HSSFRequest();
//        hssfRequest.addListenerForAllRecords(new XlsReadOnlyMethod(this.getSource())); // add listener
        HSSFEventFactory hssfEventFactory = new HSSFEventFactory();

        // 使用formatListener可以分别获得date和double
        MissingRecordAwareHSSFListener listener = new MissingRecordAwareHSSFListener(this);
        [i]formatListener [/i]= new FormatTrackingHSSFListener(listener);
        hssfRequest.addListenerForAllRecords([i]formatListener[/i]);

        hssfEventFactory.processEvents(hssfRequest, inputStream);
        if(null != tmpStr_)
            if(0 != tmpStr_.length) {
                iterList.add(tmpStr_);
            }
        fin.close();
        inputStream.close();
        return iterList.iterator();
    }

    [i]/**
[/i][i]     *  实现HSSFListener的方法，【监听器】获取xls的内容
[/i][i]     *  eg : http://poi.apache.org/spreadsheet/how-to.html
[/i][i]     * */
[/i][i]    /**
[/i][i]     * 1. 初始化
[/i][i]     *      获取一行所有列
[/i][i]     *      设置【当前行指针】
[/i][i]     *      初始化tmpStr_
[/i][i]     *      设置初始化在每行只走一次 === 3/2
[/i][i]     * 2. 获取值
[/i][i]     *      把值丢到tmpStr_里（空的就不丢
[/i][i]     * 3. 换行
[/i][i]     *      tmpStr_加到iterList里去
[/i][i]     *      如果【当前行指针】跳到下一行，我就换行，并初始化1/3
[/i][i]     * 4. 循环
[/i][i]     *      到最后一行的时候跳出来，我再执行一边添加。
[/i][i]     * */
[/i][i]    [/i]public void processRecord(Record record) {
        switch (record.getSid())
        {
            // the BOFRecord can represent either the beginning of a sheet or the workbook
            case BOFRecord.[i]sid[/i]:
                BOFRecord bof = (BOFRecord) record;
                if (bof.getType() == bof.[i]TYPE_WORKSHEET[/i]) { //遇到sheet
                    if(count++ == [i]sheetAt[/i]) {
                        isActiveSheet = true;
                    } else {
                        isActiveSheet = false;
                    }
                }
                break;
            case RowRecord.[i]sid[/i]:  // 用来初始化当前行有几个cell
                if(isActiveSheet) {
                    if(null != tmpStr_) {
                        iterList.add(tmpStr_);
                    }
                    RowRecord rowrec = (RowRecord) record;

                    // 可以获得行列的最大值。 如果有错就给后面的数字加大。
                    colLastRealIndex = rowrec.getLastCol() - rowrec.getFirstCol() + 5;
                    rowLastRealIndex = rowrec.getRowNumber();

                    if(helpThisRow) { //初始化当前行指针
                        thisRow = rowrec.getRowNumber();
                        shouldInit = true;
                        helpThisRow = false;
                    }
//                    System.out.println("我要判定 " + rowrec.getRowNumber() + " 行有" + (colLastRealIndex)+ "个cell了");
                }
                break;
            [i]/**
[/i][i]             *  不管是哪条路，只要第一次跑到 就开始
[/i][i]             * */
[/i][i]            [/i]case NumberRecord.[i]sid[/i]: // Date int double
                if(isActiveSheet) {
                    NumberRecord numrec = (NumberRecord) record;
                    //1.如果是下一行，2.如果是第一次运行，我就初始化
                    checkNextRowAndInit(numrec);

                    tmpStr_[numrec.getColumn()] = [i]formatListener[/i].formatNumberDateCell(numrec);

// System.out.println("Cell found with value " + formatListener.formatNumberDateCell(numrec)
//                            + " at row " + numrec.getRow() + " and column " + numrec.getColumn());
                }
                break;
            case FormulaRecord.[i]sid[/i]:  // formula
                if(isActiveSheet) {
                    FormulaRecord forRec = (FormulaRecord) record;

                    checkNextRowAndInit(forRec);

                    tmpStr_[forRec.getColumn()] = String.[i]valueOf[/i](forRec.getValue());
//                    System.out.println("Fomula found with value " + forRec.getValue() + " at row " + forRec.getRow() + " and column " + forRec.getColumn());
                }
                break;
            // SSTRecords store a array of unique strings used in Excel.
            case SSTRecord.[i]sid[/i]:
                sstRec = (SSTRecord) record; // to get SSTRecord
                break;
            case LabelSSTRecord.[i]sid[/i]:
                if(isActiveSheet) {
                    LabelSSTRecord lrec = (LabelSSTRecord) record;

                    checkNextRowAndInit(lrec);

                    tmpStr_[lrec.getColumn()] = String.[i]valueOf[/i](sstRec.getString(lrec.getSSTIndex()));

//                    System.out.println("String cell found with value "
//                            + sstRec.getString(lrec.getSSTIndex()) + " at row " + lrec.getRow()
//                    + " and column " + lrec.getColumn());
                }
                break;
        }
    }

    private void checkNextRowAndInit(CellRecord record) {
        if(thisRow != record.getRow()) { // 已经到下一行了
            thisRow = record.getRow();
            iterList.add(tmpStr_);       
            shouldInit = true;
        }
        if(shouldInit) { 
            tmpStr_ = new String[colLastRealIndex];
            shouldInit = false;
        }
    }


}

   两种模式由于驱动方式不同，获取数据时的方式和对数据的处理方式也不同。

      对于37.9M的xls文件，两者读取速度对比如下：

            ReadOnlyMethod : spending time = 4438
            ReadWriteMethod : get data time = 7056

      对于45M的xls文件，两者读取速度对比如下：

            ReadOnlyMethod : spending time = 13456
            ReadWriteMethod : get data time = 18136

      显然，只读方法不仅耗费更少的资源，而且速度也更快，因此如果没有对文件写的操作，使用只读方法更为优秀。(^o^

      文件github : https://github.com/EleComb/maven_test

      最后祝大家学习进步！

xuchengshan · 发表于 2017-11-18 17:02

谢谢分享

string峰峰 · 发表于 2017-11-18 21:39

十分感谢分享。

RUANJIAOXIAOZI · 发表于 2017-11-22 18:04

额,这个代码看着有点...,可能是52破解不大支持发代码的原因吧

xman55555 · 发表于 2017-11-29 21:34

aspose 用过没有，感觉还是可以的

暗夜未央 · 发表于 2018-5-4 11:19

谢谢分享，正好要用到

raocharles · 发表于 2018-6-21 08:41

谢谢楼主的分享

帐号		自动登录	找回密码
密码			注册[Register]

[Java 转载] 学习POI对excel中xls文件读较大数据的两种方法数据处理的方法和速度对比

免费评分